Frequent Zero leadership change

surajdash · January 5, 2021, 6:51pm

Hi,

I am facing an issue where dgraph upsert fails with the following error message - “rpc error: code = Unknown desc = No connection exists”. The configuration I am using is -
Dgraph cluster mode (Using the kubernetes deployment file from https://github.com/dgraph-io/dgraph/blob/master/contrib/config/kubernetes/dgraph-ha/dgraph-ha.yaml)

I have also added a 2vcpu/6GB resource limitation to each zero and alpha node -
resources:
requests:
memory: “2048Mi”
cpu: “1000m”
limits:
memory: “6144Mi”
cpu: “2000m”

On checking the logs from dgraph alpha I found that there were frequent error messages like the following -
I0105 18:42:06.672318 1 groups.go:931] Zero leadership changed. Renewing oracle delta stream.
E0105 18:42:06.672462 1 groups.go:907] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
I0105 18:42:07.671665 1 groups.go:863] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
I0105 18:42:08.486535 1 groups.go:875] Got Zero leader: dgraph-zero-0.dgraph-zero.xyz.svc.case.local:5080

I noticed this error starts appearing frequently after I upsert ~1M nodes to dgraph.

I have tried with using just one dgraph zero node and 3 dgraph alpha nodes and the problem persists.

As a workaround i have added a dgraph keepalive ping and retrying when I get the error message. But, I would love to get a RCA for the issue and know if I can do anything from my end to fix this issue.

wenweih · May 21, 2021, 9:21am

Got the same issue after upserting a bunch of records to database. Is there any good solution?

MichelDiz · May 21, 2021, 1:30pm

If you are pushing the boundaries you should give more resources for your nodes in the cluster. 6GB it is too little to have margin. Each Alpha should have at least 16GB for cases like yours. context canceled can happen when an Alpha dies.

If you limit your resources, you should limit your usage.

Topic		Replies	Views
Dgraph runs into a error loop and freezes the host Users	20	2220	February 21, 2018
Getting issue with Zero cluster Dgraph kind:question , dgraph	2	343	January 27, 2021
Dgraph HA Cluster suddenly becomes unavailable Dgraph kind:bug	0	728	June 5, 2022
Can zero receive a large number of concurrent requests？Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled Dgraph	3	972	April 29, 2023
While doing live loader import i'm getting below error and server is not connecting Dgraph dgraph , untagged	2	801	December 28, 2020

Frequent Zero leadership change

Related topics