Mutation failed because Dgraph execution: Unhealthy connection

Hello :smiley:

I have 10 scripts that periodically insert or read data in Dgraph. The data are from the same node, for example, gitcommit. The scripts may run at the same time.

Unfortunately, during the execution I get the following error:

{‘message’: ‘mutation failed because Dgraph execution failed because : dispatchTaskOverNetwork: while retrieving connection.: Unhealthy connection’, ‘locations’: [{‘line’: 2, ‘column’: 3}]}

Dgraph Alpha and Zero are running in Kubernetes. I have 5 Zeros and 6 Alphas. I have checked the logs, but the errors or warnings I saw were:

From Dgraph Alpha

E1223 11:27:32.586522 20 groups.go:1000] No longer the leader of group 1. Exiting
E1223 11:27:32.586599 20 groups.go:937] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled

W1223 11:22:03.601350 20 draft.go:1313] Raft.Ready took too long to process: Timer Total: 549ms. Breakdown: [{disk 323ms} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false

From Dgraph Zero

W1221 14:29:42.811258 19 pool.go:204] Shutting down extra connection to dgraph-alpha-0.dgraph-alpha.dgraph-2011.svc.cluster.local:7080

W1221 17:15:58.941971 21 raft.go:922] Raft.Ready took too long to process: Timer Total: 838ms. Breakdown: [{proposals 838ms} {disk 0s} {advance 0s}]. Num entries: 1. Num committed entries: 0. MustSync: true
W1221 17:38:27.684235 21 raft.go:922] Raft.Ready took too long to process: Timer Total: 2.476s. Breakdown: [{disk 2.476s} {proposals 0s} {advance 0s}]. Num entries: 0. Num committed entries: 1. MustSync: false

The error is occurring in a script that tries to insert a list with 1000 elements, but the list size in memory is 9032 bytes. Besides, I can insert many list before this error occur. I don’t know if this error is because of data payload.

I also don’t know if this error is caused by Shard rebalancing at the same time data is being inserted on Dgraph.

Any help would be appreciated :smiley:

Hi @jordan, can you confirm which version of Dgraph are you using? In earlier version we have experienced this because of slowness in badger in terms of managing Raft WAL. In the recent release with v20.11, we have upgraded this behavior.

This happens when the leadership is changing and hence the connections with non leader node is getting terminated and new connection with leader will be created.

This error occurs when the node is unhealthy. The parameter for a healthy is that the last ping epoch time should be within 2 sec.

Please confirm the Dgraph version and if possible the behaviour of script run so that we can investigate further :slight_smile: .

This seems like it could be due to a slow disk.

v20.11.0

I’m checking the script again. I’m using gql with python to read and insert data into Dgraph. Maybe I spend too much time in processing messages (longer than 2 seconds) before sending them?