Production instance is taking entire load for cluster

We have a Dgraph instance in a production cluster that’s maxing out its CPU even though all the other Dgraph instances in the cluster are healthy (and not taking much load). The instance that’s overtaxed is running Zero and Alpha, but we can’t get into the instance to see what’s taking all the load because the instance has become unresponsive.
So, we can’t upgrade it yet.
Any ideas what could cause this?

We rebooted the node but didn’t yet start Zero and Alpha on it.
After we did this, we were unable to access any of the data from Ratel or export any data from any of our Alpha nodes.
As soon as we brought up the Zero and Alpha services on the node that had gone down, we were able to access the data again.
This means that the Dgraph cluster is sensitive to a single node going down, which is a major reliability problem.

How many groups are in your cluster? What’s the replication factor?

If your data is not replicated and it’s all in a single group, then you’d see something similar to what you are experiencing. However, without knowing more details, it’s hard to give a better answer.

Could you share the output of the /state endpoint in zero (Perform a GET HTTP request to <zero_host>:<zero_port>/state.

We’re getting some sort of deadlock between the Zero and Alpha nodes, so we’re having trouble bringing them up after the upgrade. We ran the bulk import successfully and copied the data to the p directory for all the Alpha nodes and brought the Zero nodes up successfully. However, we’re having issues getting the Alpha nodes to come up.

I tried just bringing up a single Alpha node, and here’s the Alpha log:

I1022 21:07:15.531499       1 pool.go:148] CONNECTED to nodename08.our.domain.com:5082
I1022 21:07:15.536474       1 pool.go:148] CONNECTED to nodename12.our.domain.com:5082
I1022 21:07:15.540067       1 groups.go:121] Connected to group zero. Assigned group: 0
I1022 21:07:15.540093       1 groups.go:123] Raft Id after connection to Zero: 0x1
I1022 21:07:15.540248       1 pool.go:148] CONNECTED to nodename12.our.domain.com:7082
I1022 21:07:15.540295       1 pool.go:148] CONNECTED to nodename10.our.domain.com:7082
I1022 21:07:15.540332       1 pool.go:148] CONNECTED to nodename09.our.domain.com:7082
I1022 21:07:15.540367       1 pool.go:148] CONNECTED to nodename11.our.domain.com:7082
I1022 21:07:15.540385       1 pool.go:148] CONNECTED to nodename09.our.domain.com:5082
I1022 21:07:15.540401       1 pool.go:148] CONNECTED to nodename11.our.domain.com:5082
I1022 21:07:15.540426       1 pool.go:148] CONNECTED to nodename10.our.domain.com:5082
I1022 21:07:15.540485       1 draft.go:76] Node ID: 0x1 with GroupID: 1
I1022 21:07:15.540535       1 node.go:143] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc015578c60 Applied:0 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x22d6ac8 DisableProposalForwarding:false}
I1022 21:07:15.540815       1 node.go:321] Group 1 found 1 entries
I1022 21:07:15.540837       1 draft.go:1330] Calling IsPeer
W1022 21:07:15.541605       1 pool.go:237] Connection lost with nodename09.our.domain.com:7082. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.19:7082: connect: connection refused"
W1022 21:07:15.542254       1 pool.go:237] Connection lost with nodename11.our.domain.com:7082. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.21:7082: connect: connection refused"
W1022 21:07:15.542358       1 pool.go:237] Connection lost with nodename10.our.domain.com:7082. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.20:7082: connect: connection refused"
E1022 21:07:15.542409       1 draft.go:1363] Error while calling hasPeer: error while joining cluster: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.21:7082: connect: connection refused". Retrying...
W1022 21:07:15.542559       1 pool.go:237] Connection lost with nodename12.our.domain.com:7082. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.22:7082: connect: connection refused"
I1022 21:07:16.542839       1 draft.go:1330] Calling IsPeer
E1022 21:07:16.543216       1 draft.go:1363] Error while calling hasPeer: error while joining cluster: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.21:7082: connect: connection refused". Retrying...
E1022 21:07:17.545427       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:18.547681       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:19.549640       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:20.551858       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:21.554080       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:22.556148       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:23.558146       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:24.560198       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:25.562229       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:26.564203       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:27.566239       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:28.568360       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:29.570382       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:30.572511       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:31.574434       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:32.576429       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:33.578542       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:34.580742       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:35.582905       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...

In the Dgraph Zero leader logs, it appears that the Alpha instance can connect, which is not what the Alpha logs are suggesting. Here’s the Zero leader log:

I1022 21:07:10.158931 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:11.160208 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:11.161285 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:12.162356 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:12.163453 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:13.164541 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:13.165571 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:15.441236 1 zero.go:415] Got connection request: id:1 addr:“nodename08.our.domain.com:7082
I1022 21:07:15.442313 1 zero.go:536] Connected: id:1 addr:“nodename08.our.domain.com:7082
I1022 21:07:17.446536 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:17.447607 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:18.448802 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:18.449954 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:19.450996 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:19.451903 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:20.452930 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:20.454091 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:21.455138 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:21.456256 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:22.457310 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:22.458332 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:23.459415 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:23.460339 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:24.461222 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:24.462271 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:25.463335 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:25.464374 1 zero.go:433] Connected: cluster_info_only:true
I1022 21:07:26.465356 1 zero.go:415] Got connection request: cluster_info_only:true
I1022 21:07:26.466331 1 zero.go:433] Connected: cluster_info_only:true

The logs are complaining about not being able to connect to another alpha. Zero thinks another node is the leader of the group and the alpha is trying to connect to it.

If you haven’t changed the data after the bulk load, I’d recommend removing the zeroes (along all their data) and the w/ folder in each alpha.

Then restart the cluster with newly recreated zeroes and alphas.

1 Like

Thanks. Does it matter if we start the zeros all at once, or should we wait a moment between starting them in sequence?

We started the Zero nodes with a 5-second delay between each one, and we started the Alpha nodes with a 5-second delay between each one, and the cluster came up healthy.

There’s no need for delaying the startup of zeroes and alphas.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.