Queries stop working when dgraph zero leader goes node goes down

itsjay · August 5, 2022, 4:32am

Dgraph version: 21.12.0

If the dgraph zero leader node goes down, all queries and mutations start timing out until a new leader is elected.
My understanding was that alpha nodes serve data while zero nodes manage the cluster and move data between alpha instances. So, if zero leader goes down, shouldn’t the queries/mutations continue to work?

If this is expected, then how can I ensure that the app remains available during that period?

I’m hosting Dgraph in Google Cloud (GKE) with 5 alpha and 5 zero nodes.

Note: this only happens when the zero leader goes down.

MichelDiz · August 5, 2022, 2:09pm

Yes, when the leader dies the communication is lost.

You can read about it here RAFT - Design concepts

You can try best-effort queries. https://dgraph.io/docs/clients/raw-http/#running-best-effort-queries

If this scenario happens many times you should analyze the context and find the reason behind the leader fall.

Mutation, especially they, need to request/lease UIDs, timestamps and commit the transaction process. Without the leader this is compromised. Followers just forward the transaction to the leader.

I also recommend that you put the address of all Zero servers in all your Alphas. So the Alpha instance itself will communicate directly with another Zero to quickly identify who is the leader.

itsjay · August 5, 2022, 9:10pm

Yeah, we are looking into what’s causing the zero leader to go down.

But what we are also seeing is that even after zero goes down, it takes over 20 seconds before alpha starts responding to queries. I have attached the logs from both alpha and zero where you can see the zero-4 (leader) goes down at 20:39:19 and a new leader is elected by 20:39:20.
But from alpha logs, it keeps trying to connect to zero-4 when zero-2 was elected leader.

alpha-logs.txt (35.9 KB)
zero-logs.txt (52.0 KB)

Ben_Livengood · August 5, 2022, 9:36pm

I also work with Jay and can share the configuration we’re running dgraph in, if that helps.

I’m deploying the 0.0.19 helm chart with these changes to values.yaml in namespace dgraph, helm release named ‘draph’ too.

zero.persistence.enabled=true
zero.replicaCount=5
alpha.persistence.enabled=true
alpha.replicaCount=5
alpha.extraFlags="–security whitelist=10.1.0.0:10.1.255.255"
image.tag=v21.12.0

I verified that dgraph alpha gets configured as follows within the statefulSet and has the hostnames of all the zero pods:

dgraph alpha --my=$(hostname -f | awk ‘{gsub(/.$/,""); print $0}’):7080 --zero dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-1.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-2.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-3.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080,dgraph-dgraph-zero-4.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080 --security whitelist=10.1.0.0:10.1.255.255

What it looks like to me from the logs is that alpha isn’t retrying other zero nodes soon enough to find which one has become the leader when dgraph-dgraph-zero-4 became unavailable.

Leader election in zero happens at 20:39:20

dgraph-dgraph-zero-2 dgraph-dgraph-zero I0805 20:39:20.967599 19 log.go:34] 3 became leader at term 577

Alpha doesn’t see this until 20:39:24

dgraph-dgraph-alpha-1 dgraph-dgraph-alpha I0805 20:39:23.204873 19 groups.go:867] Got address of a Zero leader: dgraph-dgraph-zero-4.dgraph-dgraph-zero-headless.dgraph.svc.cluster.local:5080
dgraph-dgraph-alpha-2 dgraph-dgraph-alpha I0805 20:39:24.305603 18 groups.go:867] Got address of a Zero leader: dgraph-dgraph-zero-2.dgraph-dgraph-zero-headless.dgraph.svc.cluster.local:5080

MichelDiz · August 5, 2022, 9:53pm

Please downgrade from this version. We are working in a new release with all fixes and security fixes. This release was a bit troubled. Users reported bugs.

Maybe we should add a flag to speed this up. Could you open an issue for this?

So the election was fast, but the acknowledgment of what happened that takes time?

itsjay · August 5, 2022, 10:15pm

That seems to be the case. If you look at the logs I attached in my previous comment, you can see the alpha is still trying to connect to the zero leader that has already been replaced by another leader.

itsjay · August 5, 2022, 10:22pm

Do you recommend going to 21.03.2? Release page suggests it’s end of life.

MichelDiz · August 5, 2022, 11:41pm

Ignore that for now.

Ben_Livengood · August 23, 2022, 10:38pm

Sorry, I am new to the issue tracking on this forum. Is this issue we opened (Queries stop working when dgraph zero leader goes node goes down - #5 by MichelDiz) a reasonable one to use for that request, or should it be a new issue with a specific “[feature request] flag to speed up alpha recovery when zero leader restarts” subject?

MichelDiz · August 26, 2022, 8:22pm

Sorry the delay, I just noticed that question.

Well, you can open a feature request at

That would be great.

Thanks!

Topic		Replies	Views
Alphas cannot select leader Dgraph kind:question , dgraph , area:kubernetes	9	1031	March 4, 2021
Production instance is taking entire load for cluster Users	8	719	November 21, 2019
What to do if the leader crashes? Dgraph kind:question , dgraph , cluster , docker	2	742	June 18, 2021
Getting issue with Zero cluster Dgraph kind:question , dgraph	2	371	January 27, 2021
Dgraph Alpha restart Dgraph dgraph	0	449	May 8, 2023

Queries stop working when dgraph zero leader goes node goes down

Related topics