Frequent Zero leadership change

Hi,

I am facing an issue where dgraph upsert fails with the following error message - “rpc error: code = Unknown desc = No connection exists”. The configuration I am using is -
Dgraph cluster mode (Using the kubernetes deployment file from https://github.com/dgraph-io/dgraph/blob/master/contrib/config/kubernetes/dgraph-ha/dgraph-ha.yaml)

I have also added a 2vcpu/6GB resource limitation to each zero and alpha node -
resources:
requests:
memory: “2048Mi”
cpu: “1000m”
limits:
memory: “6144Mi”
cpu: “2000m”

On checking the logs from dgraph alpha I found that there were frequent error messages like the following -
I0105 18:42:06.672318 1 groups.go:931] Zero leadership changed. Renewing oracle delta stream.
E0105 18:42:06.672462 1 groups.go:907] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
I0105 18:42:07.671665 1 groups.go:863] Leader idx=0x1 of group=1 is connecting to Zero for txn updates
I0105 18:42:08.486535 1 groups.go:875] Got Zero leader: dgraph-zero-0.dgraph-zero.xyz.svc.case.local:5080

I noticed this error starts appearing frequently after I upsert ~1M nodes to dgraph.

I have tried with using just one dgraph zero node and 3 dgraph alpha nodes and the problem persists.

As a workaround i have added a dgraph keepalive ping and retrying when I get the error message. But, I would love to get a RCA for the issue and know if I can do anything from my end to fix this issue.

Got the same issue after upserting a bunch of records to database. Is there any good solution?

If you are pushing the boundaries you should give more resources for your nodes in the cluster. 6GB it is too little to have margin. Each Alpha should have at least 16GB for cases like yours. context canceled can happen when an Alpha dies.

If you limit your resources, you should limit your usage.