What to do if the leader crashes?

gustavohenrique · June 17, 2021, 11:12am

Hi,

I’m studying how to manage a Dgraph cluster and I’d like to know how to proceed when one alpha dies.
My PoC is: I have the zero and alphas running on different machines. Something happens and one of the alphas dies. If I try to launch another alpha, it seems is healthy but the gRPC communication does not work.
I’m using Docker to simulate it.

Creating the zero:

docker network create dgraph_default

docker run -d --name zero -p 5080:5080 -p 6080:6080 --network dgraph_default --hostname zero -v $PWD/zero:/dgraph dgraph/dgraph dgraph zero --my=zero:5080 --telemetry sentry=false --replicas 3

And 2 alphas:

for i in `seq 1 2`; do
  docker run -d \
    --name alpha${i} \
    -p 908${i}:9080 \
    -p 708${i}:7080 \
    -p 808${i}:8080 \
    --network dgraph_default \
    --hostname alpha${i} \
    -v $PWD/alpha${i}:/dgraph \
    dgraph/dgraph dgraph alpha --zero zero:5080 --my=alpha${i}:7080 --logtostderr --cache size-mb=2048 --telemetry sentry=false --telemetry reports=false --security whitelist=0.0.0.0/0 --limit query-timeout=500ms
done

Everything is ok. Now, when I kill alpha1 (the leader) and launch alpha3:

docker rm -f alpha1
i=3; docker run -d \
    --name alpha${i} \
    -p 908${i}:9080 \
    -p 708${i}:7080 \
    -p 808${i}:8080 \
    --network dgraph_default \
    --hostname alpha${i} \
    -v $PWD/alpha${i}:/dgraph \
    dgraph/dgraph dgraph alpha --zero zero:5080 --my=alpha${i}:7080 --logtostderr --cache size-mb=2048 --telemetry sentry=false --telemetry reports=false --security whitelist=0.0.0.0/0 --limit query-timeout=500ms

Checking the state via zero’s API localhost:6080/state:

{
  "counter": "14",
  "groups": {
    "1": {
      "members": {
        "1": {
          "id": "1",
          "groupId": 1,
          "addr": "alpha1:7080",
          "leader": true,
          "amDead": false,
          "lastUpdate": "1623926750",
          "learner": false,
          "clusterInfoOnly": false,
          "forceGroupId": false
        },
        "2": {
          "id": "2",
          "groupId": 1,
          "addr": "alpha2:7080",
          "leader": false,
          "amDead": false,
          "lastUpdate": "0",
          "learner": false,
          "clusterInfoOnly": false,
          "forceGroupId": false
        },
        "3": {
          "id": "3",
          "groupId": 1,
          "addr": "alpha3:7080",
          "leader": false,
          "amDead": false,
          "lastUpdate": "0",
          "learner": false,
          "clusterInfoOnly": false,
          "forceGroupId": false
        }
      }
    ...

Checking the status via GraphQL on alpha2:

// http://localhost:8082/admin
"health": [
  {
    "address": "zero:5080",
    "status": "healthy",
    "instance": "zero",
    "version": "v21.03.0",
    "lastEcho": 1623927699,
    "group": "0",
    "uptime": 960,
    "ongoing": [],
    "indexing": []
  },
  {
    "address": "alpha1:7080",
    "status": "unhealthy",
    "instance": "alpha",
    "version": "v21.03.0",
    "lastEcho": 1623926779,
    "group": "1",
    "uptime": 29,
    "ongoing": [],
    "indexing": []
  },
  {
    "address": "alpha3:7080",
    "status": "healthy",
    "instance": "alpha",
    "version": "v21.03.0",
    "lastEcho": 1623927699,
    "group": "1",
    "uptime": 884,
    "ongoing": [],
    "indexing": []
  },
  {
    "address": "alpha2:7080",
    "status": "healthy",
    "instance": "alpha",
    "version": "v21.03.0",
    "lastEcho": 1623927699,
    "group": "1",
    "uptime": 946,
    "ongoing": [
      "opRollup"
    ],
    "indexing": []
  }
]

Alpha3 is “healthy”. But it does not work. I cannot access http://localhost:8083/admin.

# docker logs alpha3

Error while calling hasPeer: error while joining cluster: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host". Retrying...
Connection lost with alpha1:7080. Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host"
Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": error from client.subscribe: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host". closer err: <nil>
Error while calling hasPeer: error while joining cluster: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host". Retrying...
CONNECTING to alpha3:7080
Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
Error reading GraphQL schema: Please retry again, server is not ready to accept requests
Error while calling hasPeer: Unable to reach leader in group 1. Retrying...

So, what I supposed to do if at least one replica dies?
And why the offline alpha1 still on the list and is the leader, according to /state?

dmai · June 17, 2021, 6:45pm

You’ll want to start off with an odd number of Alphas to maintain a majority quorum. Because you started with 2 Alphas and stopped the second one, a majority of the Alpha group isn’t up so even if the new third Alpha joins the cluster, there’s no current leader to connect to. There won’t be until you start Alpha 2 again.

This is why we enforce that Zero’s --replicas flag is odd. In this case, you’ve set it to 3. That’s correct, but starting this test with two Alphas would get you stuck as you did here with Raft.

If you started with three Alphas and then killed the leader, then the two followers (the remaining majority) would detect that and elect a new leader amongst themselves.

gustavohenrique · June 18, 2021, 12:36am

@dmai Thanks a lot!

Topic		Replies	Views
Dgraph Zero crashes with Fatal error along with infinite loop in Alpha Dgraph	5	724	April 22, 2021
Why can't Dgraph work well when I kill one node in the cluster Users	10	2001	December 6, 2017
All subconns are in TransientFail Dgraph	3	455	August 20, 2020
Queries stop working when dgraph zero leader goes node goes down Dgraph	9	1143	August 26, 2022
Server Failure - Can alphas find other zeros? Dgraph	2	476	November 17, 2019

What to do if the leader crashes?

Related topics