Production instance is taking entire load for cluster

devinbost · October 22, 2019, 4:44pm

We have a Dgraph instance in a production cluster that’s maxing out its CPU even though all the other Dgraph instances in the cluster are healthy (and not taking much load). The instance that’s overtaxed is running Zero and Alpha, but we can’t get into the instance to see what’s taking all the load because the instance has become unresponsive.
So, we can’t upgrade it yet.
Any ideas what could cause this?

devinbost · October 22, 2019, 6:55pm

We rebooted the node but didn’t yet start Zero and Alpha on it.
After we did this, we were unable to access any of the data from Ratel or export any data from any of our Alpha nodes.
As soon as we brought up the Zero and Alpha services on the node that had gone down, we were able to access the data again.
This means that the Dgraph cluster is sensitive to a single node going down, which is a major reliability problem.

martinmr · October 22, 2019, 9:04pm

How many groups are in your cluster? What’s the replication factor?

If your data is not replicated and it’s all in a single group, then you’d see something similar to what you are experiencing. However, without knowing more details, it’s hard to give a better answer.

Could you share the output of the /state endpoint in zero (Perform a GET HTTP request to <zero_host>:<zero_port>/state.

devinbost · October 22, 2019, 9:16pm

We’re getting some sort of deadlock between the Zero and Alpha nodes, so we’re having trouble bringing them up after the upgrade. We ran the bulk import successfully and copied the data to the p directory for all the Alpha nodes and brought the Zero nodes up successfully. However, we’re having issues getting the Alpha nodes to come up.

I tried just bringing up a single Alpha node, and here’s the Alpha log:

I1022 21:07:15.531499       1 pool.go:148] CONNECTED to nodename08.our.domain.com:5082
I1022 21:07:15.536474       1 pool.go:148] CONNECTED to nodename12.our.domain.com:5082
I1022 21:07:15.540067       1 groups.go:121] Connected to group zero. Assigned group: 0
I1022 21:07:15.540093       1 groups.go:123] Raft Id after connection to Zero: 0x1
I1022 21:07:15.540248       1 pool.go:148] CONNECTED to nodename12.our.domain.com:7082
I1022 21:07:15.540295       1 pool.go:148] CONNECTED to nodename10.our.domain.com:7082
I1022 21:07:15.540332       1 pool.go:148] CONNECTED to nodename09.our.domain.com:7082
I1022 21:07:15.540367       1 pool.go:148] CONNECTED to nodename11.our.domain.com:7082
I1022 21:07:15.540385       1 pool.go:148] CONNECTED to nodename09.our.domain.com:5082
I1022 21:07:15.540401       1 pool.go:148] CONNECTED to nodename11.our.domain.com:5082
I1022 21:07:15.540426       1 pool.go:148] CONNECTED to nodename10.our.domain.com:5082
I1022 21:07:15.540485       1 draft.go:76] Node ID: 0x1 with GroupID: 1
I1022 21:07:15.540535       1 node.go:143] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc015578c60 Applied:0 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x22d6ac8 DisableProposalForwarding:false}
I1022 21:07:15.540815       1 node.go:321] Group 1 found 1 entries
I1022 21:07:15.540837       1 draft.go:1330] Calling IsPeer
W1022 21:07:15.541605       1 pool.go:237] Connection lost with nodename09.our.domain.com:7082. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.19:7082: connect: connection refused"
W1022 21:07:15.542254       1 pool.go:237] Connection lost with nodename11.our.domain.com:7082. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.21:7082: connect: connection refused"
W1022 21:07:15.542358       1 pool.go:237] Connection lost with nodename10.our.domain.com:7082. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.20:7082: connect: connection refused"
E1022 21:07:15.542409       1 draft.go:1363] Error while calling hasPeer: error while joining cluster: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.21:7082: connect: connection refused". Retrying...
W1022 21:07:15.542559       1 pool.go:237] Connection lost with nodename12.our.domain.com:7082. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.22:7082: connect: connection refused"
I1022 21:07:16.542839       1 draft.go:1330] Calling IsPeer
E1022 21:07:16.543216       1 draft.go:1363] Error while calling hasPeer: error while joining cluster: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 1.2.3.21:7082: connect: connection refused". Retrying...
E1022 21:07:17.545427       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:18.547681       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:19.549640       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:20.551858       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:21.554080       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:22.556148       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:23.558146       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:24.560198       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:25.562229       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:26.564203       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:27.566239       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:28.568360       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:29.570382       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:30.572511       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:31.574434       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:32.576429       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:33.578542       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:34.580742       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
E1022 21:07:35.582905       1 draft.go:1363] Error while calling hasPeer: Unable to reach leader in group 1. Retrying...

In the Dgraph Zero leader logs, it appears that the Alpha instance can connect, which is not what the Alpha logs are suggesting. Here’s the Zero leader log:

I1022 21:07:10.158931 I1022 21:07:11.160208 I1022 21:07:11.161285 I1022 21:07:12.162356 I1022 21:07:12.163453 I1022 21:07:13.164541 I1022 21:07:13.165571 I1022 21:07:15.441236 I1022 21:07:15.442313 I1022 21:07:17.446536 I1022 21:07:17.447607 I1022 21:07:18.448802 I1022 21:07:18.449954 I1022 21:07:19.450996 I1022 21:07:19.451903 I1022 21:07:20.452930 I1022 21:07:20.454091 I1022 21:07:21.455138 I1022 21:07:21.456256 I1022 21:07:22.457310 I1022 21:07:22.458332 I1022 21:07:23.459415 I1022 21:07:23.460339 I1022 21:07:24.461222 I1022 21:07:24.462271 I1022 21:07:25.463335 I1022 21:07:25.464374 I1022 21:07:26.465356 I1022 21:07:26.466331 1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: id:1 addr:“nodename08.our.domain.com:7082”
1 zero.go:536] Connected: id:1 addr:“nodename08.our.domain.com:7082”
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true
1 zero.go:415] Got connection request: cluster_info_only:true
1 zero.go:433] Connected: cluster_info_only:true

martinmr · October 22, 2019, 9:25pm

The logs are complaining about not being able to connect to another alpha. Zero thinks another node is the leader of the group and the alpha is trying to connect to it.

If you haven’t changed the data after the bulk load, I’d recommend removing the zeroes (along all their data) and the w/ folder in each alpha.

Then restart the cluster with newly recreated zeroes and alphas.

devinbost · October 22, 2019, 9:27pm

Thanks. Does it matter if we start the zeros all at once, or should we wait a moment between starting them in sequence?

devinbost · October 22, 2019, 9:41pm

We started the Zero nodes with a 5-second delay between each one, and we started the Alpha nodes with a 5-second delay between each one, and the cluster came up healthy.

martinmr · October 22, 2019, 10:27pm

There’s no need for delaying the startup of zeroes and alphas.

system · November 21, 2019, 10:27pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Disconnection under load Dgraph	2	295	March 28, 2021
Dgraph Alpha Node unresponsive Dgraph	10	1101	September 10, 2022
Some questions of understanding with alpha Users	1	533	December 27, 2019
Data is not loading from main alpha server when other replicas was down Dgraph	8	804	July 25, 2019
Getting error while doing live loading Dgraph kind:question , dgraph	17	709	August 25, 2021

Production instance is taking entire load for cluster

Related topics