Queries stop working when dgraph zero leader goes node goes down

Dgraph version: 21.12.0

If the dgraph zero leader node goes down, all queries and mutations start timing out until a new leader is elected.
My understanding was that alpha nodes serve data while zero nodes manage the cluster and move data between alpha instances. So, if zero leader goes down, shouldn’t the queries/mutations continue to work?

If this is expected, then how can I ensure that the app remains available during that period?

I’m hosting Dgraph in Google Cloud (GKE) with 5 alpha and 5 zero nodes.

Note: this only happens when the zero leader goes down.

Yes, when the leader dies the communication is lost.

You can read about it here https://dgraph.io/docs/design-concepts/raft/

You can try best-effort queries. https://dgraph.io/docs/clients/raw-http/#running-best-effort-queries

If this scenario happens many times you should analyze the context and find the reason behind the leader fall.

Mutation, especially they, need to request/lease UIDs, timestamps and commit the transaction process. Without the leader this is compromised. Followers just forward the transaction to the leader.

I also recommend that you put the address of all Zero servers in all your Alphas. So the Alpha instance itself will communicate directly with another Zero to quickly identify who is the leader.

Yeah, we are looking into what’s causing the zero leader to go down.

But what we are also seeing is that even after zero goes down, it takes over 20 seconds before alpha starts responding to queries. I have attached the logs from both alpha and zero where you can see the zero-4 (leader) goes down at 20:39:19 and a new leader is elected by 20:39:20.
But from alpha logs, it keeps trying to connect to zero-4 when zero-2 was elected leader.

alpha-logs.txt (35.9 KB)
zero-logs.txt (52.0 KB)

I also work with Jay and can share the configuration we’re running dgraph in, if that helps.

I’m deploying the 0.0.19 helm chart with these changes to values.yaml in namespace dgraph, helm release named ‘draph’ too.

zero.persistence.enabled=true
zero.replicaCount=5
alpha.persistence.enabled=true
alpha.replicaCount=5
alpha.extraFlags="–security whitelist=10.1.0.0:10.1.255.255"
image.tag=v21.12.0

I verified that dgraph alpha gets configured as follows within the statefulSet and has the hostnames of all the zero pods:

dgraph alpha --my=$(hostname -f | awk ‘{gsub(/.$/,""); print $0}’):7080 --zero dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-1.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-2.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-3.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-4.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080 --security whitelist=10.1.0.0:10.1.255.255

What it looks like to me from the logs is that alpha isn’t retrying other zero nodes soon enough to find which one has become the leader when dgraph-dgraph-zero-4 became unavailable.

Leader election in zero happens at 20:39:20

dgraph-dgraph-zero-2 dgraph-dgraph-zero I0805 20:39:20.967599 19 log.go:34] 3 became leader at term 577

Alpha doesn’t see this until 20:39:24

dgraph-dgraph-alpha-1 dgraph-dgraph-alpha I0805 20:39:23.204873 19 groups.go:867] Got address of a Zero leader: dgraph-dgraph-zero-4.dgraph-dgraph-zero-headless.dgraph.svc.cluster.local:5080
dgraph-dgraph-alpha-2 dgraph-dgraph-alpha I0805 20:39:24.305603 18 groups.go:867] Got address of a Zero leader: dgraph-dgraph-zero-2.dgraph-dgraph-zero-headless.dgraph.svc.cluster.local:5080

Please downgrade from this version. We are working in a new release with all fixes and security fixes. This release was a bit troubled. Users reported bugs.

Maybe we should add a flag to speed this up. Could you open an issue for this?

So the election was fast, but the acknowledgment of what happened that takes time?

That seems to be the case. If you look at the logs I attached in my previous comment, you can see the alpha is still trying to connect to the zero leader that has already been replaced by another leader.

Do you recommend going to 21.03.2? Release page suggests it’s end of life.

Ignore that for now.