Hi,
I’m studying how to manage a Dgraph cluster and I’d like to know how to proceed when one alpha dies.
My PoC is: I have the zero and alphas running on different machines. Something happens and one of the alphas dies. If I try to launch another alpha, it seems is healthy but the gRPC communication does not work.
I’m using Docker to simulate it.
Creating the zero:
docker network create dgraph_default
docker run -d --name zero -p 5080:5080 -p 6080:6080 --network dgraph_default --hostname zero -v $PWD/zero:/dgraph dgraph/dgraph dgraph zero --my=zero:5080 --telemetry sentry=false --replicas 3
And 2 alphas:
for i in `seq 1 2`; do
docker run -d \
--name alpha${i} \
-p 908${i}:9080 \
-p 708${i}:7080 \
-p 808${i}:8080 \
--network dgraph_default \
--hostname alpha${i} \
-v $PWD/alpha${i}:/dgraph \
dgraph/dgraph dgraph alpha --zero zero:5080 --my=alpha${i}:7080 --logtostderr --cache size-mb=2048 --telemetry sentry=false --telemetry reports=false --security whitelist=0.0.0.0/0 --limit query-timeout=500ms
done
Everything is ok. Now, when I kill alpha1 (the leader) and launch alpha3:
docker rm -f alpha1
i=3; docker run -d \
--name alpha${i} \
-p 908${i}:9080 \
-p 708${i}:7080 \
-p 808${i}:8080 \
--network dgraph_default \
--hostname alpha${i} \
-v $PWD/alpha${i}:/dgraph \
dgraph/dgraph dgraph alpha --zero zero:5080 --my=alpha${i}:7080 --logtostderr --cache size-mb=2048 --telemetry sentry=false --telemetry reports=false --security whitelist=0.0.0.0/0 --limit query-timeout=500ms
Checking the state via zero’s API localhost:6080/state:
{
"counter": "14",
"groups": {
"1": {
"members": {
"1": {
"id": "1",
"groupId": 1,
"addr": "alpha1:7080",
"leader": true,
"amDead": false,
"lastUpdate": "1623926750",
"learner": false,
"clusterInfoOnly": false,
"forceGroupId": false
},
"2": {
"id": "2",
"groupId": 1,
"addr": "alpha2:7080",
"leader": false,
"amDead": false,
"lastUpdate": "0",
"learner": false,
"clusterInfoOnly": false,
"forceGroupId": false
},
"3": {
"id": "3",
"groupId": 1,
"addr": "alpha3:7080",
"leader": false,
"amDead": false,
"lastUpdate": "0",
"learner": false,
"clusterInfoOnly": false,
"forceGroupId": false
}
}
...
Checking the status via GraphQL on alpha2:
// http://localhost:8082/admin
"health": [
{
"address": "zero:5080",
"status": "healthy",
"instance": "zero",
"version": "v21.03.0",
"lastEcho": 1623927699,
"group": "0",
"uptime": 960,
"ongoing": [],
"indexing": []
},
{
"address": "alpha1:7080",
"status": "unhealthy",
"instance": "alpha",
"version": "v21.03.0",
"lastEcho": 1623926779,
"group": "1",
"uptime": 29,
"ongoing": [],
"indexing": []
},
{
"address": "alpha3:7080",
"status": "healthy",
"instance": "alpha",
"version": "v21.03.0",
"lastEcho": 1623927699,
"group": "1",
"uptime": 884,
"ongoing": [],
"indexing": []
},
{
"address": "alpha2:7080",
"status": "healthy",
"instance": "alpha",
"version": "v21.03.0",
"lastEcho": 1623927699,
"group": "1",
"uptime": 946,
"ongoing": [
"opRollup"
],
"indexing": []
}
]
Alpha3 is “healthy”. But it does not work. I cannot access http://localhost:8083/admin.
# docker logs alpha3
Error while calling hasPeer: error while joining cluster: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host". Retrying...
Connection lost with alpha1:7080. Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host"
Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": error from client.subscribe: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host". closer err: <nil>
Error while calling hasPeer: error while joining cluster: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup alpha1 on 127.0.0.11:53: no such host". Retrying...
CONNECTING to alpha3:7080
Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
Error reading GraphQL schema: Please retry again, server is not ready to accept requests
Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
So, what I supposed to do if at least one replica dies?
And why the offline alpha1 still on the list and is the leader, according to /state?