What to do if the leader crashes?

Hi,

I’m studying how to manage a Dgraph cluster and I’d like to know how to proceed when one alpha dies.
My PoC is: I have the zero and alphas running on different machines. Something happens and one of the alphas dies. If I try to launch another alpha, it seems is healthy but the gRPC communication does not work.
I’m using Docker to simulate it.

Creating the zero:

docker network create dgraph_default

docker run -d --name zero -p 5080:5080 -p 6080:6080 --network dgraph_default --hostname zero -v $PWD/zero:/dgraph dgraph/dgraph dgraph zero --my=zero:5080 --telemetry sentry=false --replicas 3

And 2 alphas:

for i in `seq 1 2`; do
  docker run -d \
    --name alpha${i} \
    -p 908${i}:9080 \
    -p 708${i}:7080 \
    -p 808${i}:8080 \
    --network dgraph_default \
    --hostname alpha${i} \
    -v $PWD/alpha${i}:/dgraph \
    dgraph/dgraph dgraph alpha --zero zero:5080 --my=alpha${i}:7080 --logtostderr --cache size-mb=2048 --telemetry sentry=false --telemetry reports=false --security whitelist=0.0.0.0/0 --limit query-timeout=500ms
done

Everything is ok. Now, when I kill alpha1 (the leader) and launch alpha3:

docker rm -f alpha1
i=3; docker run -d \
    --name alpha${i} \
    -p 908${i}:9080 \
    -p 708${i}:7080 \
    -p 808${i}:8080 \
    --network dgraph_default \
    --hostname alpha${i} \
    -v $PWD/alpha${i}:/dgraph \
    dgraph/dgraph dgraph alpha --zero zero:5080 --my=alpha${i}:7080 --logtostderr --cache size-mb=2048 --telemetry sentry=false --telemetry reports=false --security whitelist=0.0.0.0/0 --limit query-timeout=500ms

Checking the state via zero’s API localhost:6080/state:

{
  "counter": "14",
  "groups": {
    "1": {
      "members": {
        "1": {
          "id": "1",
          "groupId": 1,
          "addr": "alpha1:7080",
          "leader": true,
          "amDead": false,
          "lastUpdate": "1623926750",
          "learner": false,
          "clusterInfoOnly": false,
          "forceGroupId": false
        },
        "2": {
          "id": "2",
          "groupId": 1,
          "addr": "alpha2:7080",
          "leader": false,
          "amDead": false,
          "lastUpdate": "0",
          "learner": false,
          "clusterInfoOnly": false,
          "forceGroupId": false
        },
        "3": {
          "id": "3",
          "groupId": 1,
          "addr": "alpha3:7080",
          "leader": false,
          "amDead": false,
          "lastUpdate": "0",
          "learner": false,
          "clusterInfoOnly": false,
          "forceGroupId": false
        }
      }
    ...

Checking the status via GraphQL on alpha2:

// http://localhost:8082/admin
"health": [
  {
    "address": "zero:5080",
    "status": "healthy",
    "instance": "zero",
    "version": "v21.03.0",
    "lastEcho": 1623927699,
    "group": "0",
    "uptime": 960,
    "ongoing": [],
    "indexing": []
  },
  {
    "address": "alpha1:7080",
    "status": "unhealthy",
    "instance": "alpha",
    "version": "v21.03.0",
    "lastEcho": 1623926779,
    "group": "1",
    "uptime": 29,
    "ongoing": [],
    "indexing": []
  },
  {
    "address": "alpha3:7080",
    "status": "healthy",
    "instance": "alpha",
    "version": "v21.03.0",
    "lastEcho": 1623927699,
    "group": "1",
    "uptime": 884,
    "ongoing": [],
    "indexing": []
  },
  {
    "address": "alpha2:7080",
    "status": "healthy",
    "instance": "alpha",
    "version": "v21.03.0",
    "lastEcho": 1623927699,
    "group": "1",
    "uptime": 946,
    "ongoing": [
      "opRollup"
    ],
    "indexing": []
  }
]

Alpha3 is “healthy”. But it does not work. I cannot access http://localhost:8083/admin.

# docker logs alpha3

Error while calling hasPeer: error while joining cluster: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host". Retrying...
Connection lost with alpha1:7080. Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host"
Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": error from client.subscribe: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host". closer err: <nil>
Error while calling hasPeer: error while joining cluster: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host". Retrying...
CONNECTING to alpha3:7080
Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
Error reading GraphQL schema: Please retry again, server is not ready to accept requests
Error while calling hasPeer: Unable to reach leader in group 1. Retrying...

So, what I supposed to do if at least one replica dies?
And why the offline alpha1 still on the list and is the leader, according to /state?

You’ll want to start off with an odd number of Alphas to maintain a majority quorum. Because you started with 2 Alphas and stopped the second one, a majority of the Alpha group isn’t up so even if the new third Alpha joins the cluster, there’s no current leader to connect to. There won’t be until you start Alpha 2 again.

This is why we enforce that Zero’s --replicas flag is odd. In this case, you’ve set it to 3. That’s correct, but starting this test with two Alphas would get you stuck as you did here with Raft.

If you started with three Alphas and then killed the leader, then the two followers (the remaining majority) would detect that and elect a new leader amongst themselves.

@dmai Thanks a lot!

1 Like