Sorry, I couldn’t find a nice way to reproduce this.
Basically, I ran all dgraph instances in a kubernetes pod. There are 5 dgraph zeros and 20 dgraph servers in the pod. Each of them runs in its own container. Each server and zero has an unique group ID assigned via flag idx
After ingesting a large amount of data ~31GB in dgraph database, I tried to make a copy of the database:
- stop the running dgraph pod.
- copy the database.
- restart the pod.
I saw following logs with spiking CPU usage:
similar logs for all dgraph zero:
Setting up grpc listener at: 0.0.0.0:7080
Setting up http listener at: 0.0.0.0:8080
2018/01/15 22:41:37 node.go:233: Found Snapshot, Metadata: {ConfState:{Nodes:[1 2 3 4 5] XXX_unrecognized:[]} Index:27053 Term:3 XXX_unrecognized:[]}
2018/01/15 22:41:37 node.go:248: Found hardstate: {Term:3 Vote:1 Commit:27881 XXX_unrecognized:[]}
2018/01/15 22:41:37 node.go:260: Group 0 found 828 entries
2018/01/15 22:41:37 raft.go:400: Restarting node for dgraphzero
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7104. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7104
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7098. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7098
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7094. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7094
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7102. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7102
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7092. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7092
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7093
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7100. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7100
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7108. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7108
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7107
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7099. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7099
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7096. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7096
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7103
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7090. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7090
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7105. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7105
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7095. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7095
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7101. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7101
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7091. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7091
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7097. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7097
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7106. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7106
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7109
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7081
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7082
2018/01/15 22:41:37 pool.go:167: Echo error from localhost:7083. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7083
2018/01/15 22:41:37 pool.go:118: == CONNECT ==> Setting localhost:7084
2018/01/15 22:41:37 raft.go:567: INFO: 1 became follower at term 3
2018/01/15 22:41:37 raft.go:315: INFO: newRaft 1 [peers: [1,2,3,4,5], term: 3, commit: 27881, applied: 27053, lastindex: 27881, lastterm: 3]
Running Dgraph zero...
2018/01/15 22:41:38 zero.go:316: Got connection request: id:18 addr:"localhost:7098"
2018/01/15 22:41:38 zero.go:412: Connected
2018/01/15 22:41:38 zero.go:316: Got connection request: id:27 addr:"localhost:7107"
2018/01/15 22:41:38 zero.go:412: Connected
2018/01/15 22:41:38 zero.go:316: Got connection request: id:12 addr:"localhost:7092"
2018/01/15 22:41:38 zero.go:412: Connected
2018/01/15 22:41:38 zero.go:316: Got connection request: id:29 addr:"localhost:7109"
2018/01/15 22:41:38 zero.go:412: Connected
2018/01/15 22:41:38 zero.go:316: Got connection request: id:21 addr:"localhost:7101"
2018/01/15 22:41:38 zero.go:412: Connected
2018/01/15 22:41:38 zero.go:316: Got connection request: id:23 addr:"localhost:7103"
2018/01/15 22:41:38 zero.go:412: Connected
2018/01/15 22:41:38 zero.go:316: Got connection request: id:22 addr:"localhost:7102"
2018/01/15 22:41:38 zero.go:412: Connected
2018/01/15 22:41:38 zero.go:316: Got connection request: id:19 addr:"localhost:7099"
2018/01/15 22:41:38 zero.go:412: Connected
2018/01/15 22:41:38 zero.go:316: Got connection request: id:10 addr:"localhost:7090"
2018/01/15 22:41:38 zero.go:412: Connected
2018/01/15 22:41:38 zero.go:316: Got connection request: id:24 addr:"localhost:7104"
2018/01/15 22:41:38 zero.go:412: Connected
2018/01/15 22:41:39 zero.go:316: Got connection request: id:17 addr:"localhost:7097"
2018/01/15 22:41:39 zero.go:412: Connected
2018/01/15 22:41:39 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:39 zero.go:316: Got connection request: id:13 addr:"localhost:7093"
2018/01/15 22:41:39 zero.go:412: Connected
2018/01/15 22:41:39 zero.go:316: Got connection request: id:14 addr:"localhost:7094"
2018/01/15 22:41:39 zero.go:412: Connected
2018/01/15 22:41:39 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:39 zero.go:316: Got connection request: id:15 addr:"localhost:7095"
2018/01/15 22:41:39 zero.go:412: Connected
2018/01/15 22:41:39 zero.go:316: Got connection request: id:20 addr:"localhost:7100"
2018/01/15 22:41:39 zero.go:412: Connected
2018/01/15 22:41:39 zero.go:316: Got connection request: id:16 addr:"localhost:7096"
2018/01/15 22:41:39 zero.go:412: Connected
2018/01/15 22:41:39 zero.go:316: Got connection request: id:25 addr:"localhost:7105"
2018/01/15 22:41:39 zero.go:412: Connected
2018/01/15 22:41:40 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:40 zero.go:316: Got connection request: id:28 addr:"localhost:7108"
2018/01/15 22:41:40 zero.go:412: Connected
2018/01/15 22:41:40 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:41 zero.go:316: Got connection request: id:11 addr:"localhost:7091"
2018/01/15 22:41:41 zero.go:412: Connected
2018/01/15 22:41:41 zero.go:316: Got connection request: id:26 addr:"localhost:7106"
2018/01/15 22:41:41 zero.go:412: Connected
2018/01/15 22:41:42 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:43 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:44 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:45 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:46 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:46 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:46 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:46 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:46 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:47 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
2018/01/15 22:41:47 raft.go:1070: INFO: 1 no leader at term 3; dropping index reading msg
Similar logs for all dgraph server
2018/01/15 22:41:40 groups.go:457: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/15 22:41:40 groups.go:673: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/15 22:41:41 groups.go:457: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/15 22:41:41 groups.go:673: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/15 22:41:42 raft.go:749: INFO: e is starting a new election at term 3
2018/01/15 22:41:42 raft.go:580: INFO: e became candidate at term 4
2018/01/15 22:41:42 raft.go:664: INFO: e received MsgVoteResp from e at term 4
2018/01/15 22:41:42 raft.go:621: INFO: e became leader at term 4
2018/01/15 22:41:42 node.go:301: INFO: raft.node: e elected leader e at term 4
2018/01/15 22:41:42 groups.go:457: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/15 22:41:42 groups.go:673: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/15 22:41:43 groups.go:457: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/15 22:41:43 groups.go:673: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/15 22:41:44 groups.go:457: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/15 22:41:44 groups.go:673: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/15 22:41:45 groups.go:457: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/15 22:41:45 groups.go:673: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/15 22:41:46 groups.go:457: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/15 22:41:46 groups.go:673: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/01/15 22:41:47 groups.go:457: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/01/15 22:41:47 groups.go:673: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
When I test the similar setup locally in terminal with much smaller scale (3 zeros and 3 servers) and data set (~50MB)
I noticed when a dgraph zero starts the term increases everytime and the peers array is empty :
INFO: newRaft 1 [peers: [], term: 45, commit: 171, applied: 0, lastindex: 171, lastterm: 45]
whereas in the kubernetes pod the peers array is [1,2,3,4,5] and the term does not increases :
INFO: newRaft 1 [peers: [1,2,3,4,5], term: 3, commit: 27881, applied: 27053, lastindex: 27881, lastterm: 3]