I’ve tried a master build, commit Fix nil pointer exception on restart after removing a peer. · dgraph-io/dgraph@74bbd98 · GitHub , which somewhat fixes the problem. Both types of nodes are removed correctly, blank nodes are able to join the cluster.
However, cluster is having issues with hard reboot of the node (without removing it from the cluster or removing any data, just virsh destroy
). After the start of the crashed server, it tries to vote, it takes way too long. Crashed server:
[root@swarm-manager-1 _data]# docker service logs -f dgraph_server-3
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 gRPC server started. Listening on port 9080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 HTTP server started. Listening on port 8080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 groups.go:86: Current Raft Id: 2
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 worker.go:99: Worker listening at address: [::]:7080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 pool.go:118: == CONNECT ==> Setting zero-1:5080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 groups.go:109: Connected to group zero. Connection state: member:<id:2 addr:"server-3:7080" > state:<counter:168 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"server-2:7080" leader:true last_update:1517218389 > > members:<key:2 value:<id:2 group_id:1 addr:"server-3:7080" > > members:<key:3 value:<id:3 group_id:1 addr:"server-1:7080" > > tablets:<key:"_predicate_" value:<group_id:1 predicate:"_predicate_" space:6737282 > > tablets:<key:"actor.film" value:<group_id:1 predicate:"actor.film" space:154599 > > tablets:<key:"director.film" value:<group_id:1 predicate:"director.film" space:21934 > > tablets:<key:"genre" value:<group_id:1 predicate:"genre" space:32397 > > tablets:<key:"initial_release_date" value:<group_id:1 predicate:"initial_release_date" space:29272 > > tablets:<key:"name" value:<group_id:1 predicate:"name" space:8016290 > > tablets:<key:"performance.actor" value:<group_id:1 predicate:"performance.actor" space:189068 > > tablets:<key:"performance.character" value:<group_id:1 predicate:"performance.character" space:206256 > > tablets:<key:"performance.film" value:<group_id:1 predicate:"performance.film" space:184814 > > tablets:<key:"starring" value:<group_id:1 predicate:"starring" space:80342 > > > > zeros:<key:1 value:<id:1 addr:"zero-1:5080" leader:true > > zeros:<key:2 value:<id:2 addr:"zero-2:5080" > > zeros:<key:3 value:<id:3 addr:"zero-3:5080" > > maxLeaseId:1040000 maxTxnTs:10000 maxRaftId:3 >
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 pool.go:118: == CONNECT ==> Setting server-2:7080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 pool.go:118: == CONNECT ==> Setting server-1:7080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 pool.go:118: == CONNECT ==> Setting zero-2:5080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 pool.go:118: == CONNECT ==> Setting zero-3:5080
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 draft.go:139: Node ID: 2 with GroupID: 1
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 node.go:231: Found Snapshot, Metadata: {ConfState:{Nodes:[1 2 3] XXX_unrecognized:[]} Index:27 Term:5 XXX_unrecognized:[]}
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 node.go:246: Found hardstate: {Term:232 Vote:2 Commit:27 XXX_unrecognized:[]}
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 node.go:258: Group 1 found 0 entries
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 draft.go:657: Restarting node for group: 1
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 raft.go:567: INFO: 2 became follower at term 232
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:14 raft.go:315: INFO: newRaft 2 [peers: [1,2,3], term: 232, commit: 27, applied: 27, lastindex: 27, lastterm: 5]
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:17 raft.go:749: INFO: 2 is starting a new election at term 232
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:17 raft.go:580: INFO: 2 became candidate at term 233
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:17 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 233
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:17 raft.go:651: INFO: 2 [logterm: 5, index: 27] sent MsgVote request to 3 at term 233
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:17 raft.go:651: INFO: 2 [logterm: 5, index: 27] sent MsgVote request to 1 at term 233
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:19 raft.go:749: INFO: 2 is starting a new election at term 233
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:19 raft.go:580: INFO: 2 became candidate at term 234
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:19 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 234
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:19 raft.go:651: INFO: 2 [logterm: 5, index: 27] sent MsgVote request to 3 at term 234
dgraph_server-3.1.y94hr1awxlwh@swarm-manager-3 | 2018/01/29 09:48:19 raft.go:651: INFO: 2 [logterm: 5, index: 27] sent MsgVote request to 1 at term 234
...
...
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:33 groups.go:669: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:34 groups.go:669: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:34 groups.go:453: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:35 groups.go:669: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:35 groups.go:453: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:36 raft.go:749: INFO: 2 is starting a new election at term 596
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:36 raft.go:580: INFO: 2 became candidate at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:36 raft.go:664: INFO: 2 received MsgVoteResp from 2 at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:36 raft.go:651: INFO: 2 [logterm: 298, index: 50] sent MsgVote request to 1 at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:36 raft.go:651: INFO: 2 [logterm: 298, index: 50] sent MsgVote request to 3 at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:36 groups.go:453: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:39 raft.go:772: INFO: 2 [logterm: 298, index: 50, vote: 2] rejected MsgVote from 3 [logterm: 298, index: 50] at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:39 raft.go:567: INFO: 2 became follower at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:39 node.go:301: INFO: raft.node: 2 elected leader 3 at term 597
dgraph_server-3.1.ubltlopg8dzo@swarm-manager-3 | 2018/01/29 10:08:45 raft.go:731: INFO: 2 [term: 597] ignored a MsgVote message with lower term from 1 [term: 299]
Server that stayed alive:
dgraph_server-1.1.maeljubio2qv@swarm-manager-1 | 2018/01/29 10:08:24 raft.go:692: INFO: 3 [logterm: 298, index: 50, vote: 3] ignored MsgVote from 2 [logterm: 298, index: 50] at term 298: lease is not expired (remaining ticks: 9)
dgraph_server-1.1.maeljubio2qv@swarm-manager-1 | 2018/01/29 10:08:27 raft.go:692: INFO: 3 [logterm: 298, index: 50, vote: 3] ignored MsgVote from 2 [logterm: 298, index: 50] at term 298: lease is not expired (remaining ticks: 30)
dgraph_server-1.1.maeljubio2qv@swarm-manager-1 | 2018/01/29 10:08:30 raft.go:692: INFO: 3 [logterm: 298, index: 50, vote: 3] ignored MsgVote from 2 [logterm: 298, index: 50] at term 298: lease is not expired (remaining ticks: 11)
dgraph_server-1.1.maeljubio2qv@swarm-manager-1 | 2018/01/29 10:08:32 raft.go:692: INFO: 3 [logterm: 298, index: 50, vote: 3] ignored MsgVote from 2 [logterm: 298, index: 50] at term 298: lease is not expired (remaining ticks: 99)
dgraph_server_health_status
on debug/vars
endpoint says 1
during the rejected votes, same after the election. Is there a proper way to keep track of server state in terms of cluster membership (and data synchronization, if possible)?
Amount of ticks in the log of alive server looks almost random - is it how it supposed to be? Is there a way to decrease the time after crashed node could rejoin (decrease lease)? It took like 15 mins both times I hard-rebooted the node to get back into the cluster.