Master crash in a 3-node cluster

Hi @xiang90,

So, with persistence, if any follower crashes and comes back in a 3-node cluster, that scenario works fine. However, when the master crashes, a re-election gets triggered and the group size becomes 2 (or it seems). When the master comes back, it triggers another election, and then the cluster doesn’t successfully vote for it, neither does the resurrected master joins the cluster, and it just infinitely tries to get a vote, and gets rejected by the 2-node cluster.

Any ideas how to recover from a master crash in a 3-node cluster – so that the master can join back on a restart? All the relevant code is here:
https://github.com/dgraph-io/dgraph/blob/feature/draft9/worker/draft.go

Got 1 messages
raft2016/10/12 18:44:02 INFO: 1 received vote rejection from 3 at term 15
raft2016/10/12 18:44:02 INFO: 1 [quorum:2] has received 1 votes and 2 vote rejections
raft2016/10/12 18:44:02 INFO: 1 became follower at term 15
[0]              READY START
[0]              READY DONE
[4294967295]              TICK 2016-10-12 18:44:03.320508586 +1100 AEDT
[0]              TICK 2016-10-12 18:44:03.434959873 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:04.331848964 +1100 AEDT
[0]              TICK 2016-10-12 18:44:04.450604173 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:05.343566176 +1100 AEDT
[0]              TICK 2016-10-12 18:44:05.463059285 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:06.35575666 +1100 AEDT
[0]              TICK 2016-10-12 18:44:06.474499665 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:07.367596142 +1100 AEDT
[0]              TICK 2016-10-12 18:44:07.487229936 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:08.380241432 +1100 AEDT
[0]              TICK 2016-10-12 18:44:08.499838177 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:09.392065851 +1100 AEDT
[0]              TICK 2016-10-12 18:44:09.51236689 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:10.404390105 +1100 AEDT
[0]              TICK 2016-10-12 18:44:10.524522537 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:11.417668781 +1100 AEDT
[0]              TICK 2016-10-12 18:44:11.536925334 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:12.431381896 +1100 AEDT
[0]              TICK 2016-10-12 18:44:12.549092943 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:13.444706711 +1100 AEDT
[0]              TICK 2016-10-12 18:44:13.562947122 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:14.457022009 +1100 AEDT
[0]              TICK 2016-10-12 18:44:14.575437056 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:15.470047351 +1100 AEDT
[0]              TICK 2016-10-12 18:44:15.588736343 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:16.48604611 +1100 AEDT
[0]              TICK 2016-10-12 18:44:16.601996649 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:17.502481185 +1100 AEDT
[0]              TICK 2016-10-12 18:44:17.616928505 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:18.515287966 +1100 AEDT
[0]              TICK 2016-10-12 18:44:18.631820613 +1100 AEDT
[4294967295]              TICK 2016-10-12 18:44:19.528850807 +1100 AEDT
raft2016/10/12 18:44:19 INFO: 1 is starting a new election at term 15
raft2016/10/12 18:44:19 INFO: 1 became candidate at term 16
raft2016/10/12 18:44:19 INFO: 1 received vote from 1 at term 16
raft2016/10/12 18:44:19 INFO: 1 [logterm: 2, index: 7] sent vote request to 3 at term 16
raft2016/10/12 18:44:19 INFO: 1 [logterm: 2, index: 7] sent vote request to 2 at term 16
[4294967295]              READY START
[4294967295]              READY DONE
Got message: {Type:MsgVoteResp To:1 From:3 Term:16 LogTerm:0 Index:0 Entries:[] Commit:0 Snapshot:{Data:[] Metadata:{ConfState:{Nodes:[] XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]} Reject:true RejectHint:0 Context:[20 0 0 0 0 0 0 0 0 0 10 0 24 0 12 0 8 0 4 0 10 0 0 0 20 0 0 0 255 255 255 255 3 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 108 111 99 97 108 104 111 115 116 58 49 50 51 52 55 0] XXX_unrecognized:[]}
Got 1 messages
raft2016/10/12 18:44:19 INFO: 1 received vote rejection from 3 at term 16
raft2016/10/12 18:44:19 INFO: 1 [quorum:2] has received 1 votes and 1 vote rejections
Got message: {Type:MsgVoteResp To:1 From:2 Term:16 LogTerm:0 Index:0 Entries:[] Commit:0 Snapshot:{Data:[] Metadata:{ConfState:{Nodes:[] XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]} Reject:true RejectHint:0 Context:[20 0 0 0 0 0 0 0 0 0 10 0 24 0 12 0 8 0 4 0 10 0 0 0 20 0 0 0 255 255 255 255 2 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 108 111 99 97 108 104 111 115 116 58 49 50 51 52 54 0] XXX_unrecognized:[]}
Got 1 messages
raft2016/10/12 18:44:19 INFO: 1 received vote rejection from 2 at term 16
raft2016/10/12 18:44:19 INFO: 1 [quorum:2] has received 1 votes and 2 vote rejections
raft2016/10/12 18:44:19 INFO: 1 became follower at term 16
[4294967295]              READY START
[4294967295]              READY DONE
[0]              TICK 2016-10-12 18:44:19.645680985 +1100 AEDT
raft2016/10/12 18:44:19 INFO: 1 is starting a new election at term 15
raft2016/10/12 18:44:19 INFO: 1 became candidate at term 16
raft2016/10/12 18:44:19 INFO: 1 received vote from 1 at term 16
raft2016/10/12 18:44:19 INFO: 1 [logterm: 2, index: 4] sent vote request to 2 at term 16
raft2016/10/12 18:44:19 INFO: 1 [logterm: 2, index: 4] sent vote request to 3 at term 16
[0]              READY START
[0]              READY DONE
Got message: {Type:MsgVoteResp To:1 From:3 Term:16 LogTerm:0 Index:0 Entries:[] Commit:0 Snapshot:{Data:[] Metadata:{ConfState:{Nodes:[] XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]} Reject:true RejectHint:0 Context:[16 0 0 0 0 0 10 0 20 0 8 0 0 0 4 0 10 0 0 0 16 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 108 111 99 97 108 104 111 115 116 58 49 50 51 52 55 0] XXX_unrecognized:[]}
Got 1 messages
raft2016/10/12 18:44:19 INFO: 1 received vote rejection from 3 at term 16
raft2016/10/12 18:44:19 INFO: 1 [quorum:2] has received 1 votes and 1 vote rejections
Got message: {Type:MsgVoteResp To:1 From:2 Term:16 LogTerm:0 Index:0 Entries:[] Commit:0 Snapshot:{Data:[] Metadata:{ConfState:{Nodes:[] XXX_unrecognized:[]} Index:0 Term:0 XXX_unrecognized:[]} XXX_unrecognized:[]} Reject:true RejectHint:0 Context:[16 0 0 0 0 0 10 0 20 0 8 0 0 0 4 0 10 0 0 0 16 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 108 111 99 97 108 104 111 115 116 58 49 50 51 52 54 0] XXX_unrecognized:[]}
Got 1 messages
raft2016/10/12 18:44:19 INFO: 1 received vote rejection from 2 at term 16
raft2016/10/12 18:44:19 INFO: 1 [quorum:2] has received 1 votes and 2 vote rejections
raft2016/10/12 18:44:19 INFO: 1 became follower at term 16
[0]              READY START
[0]              READY DONE

Ping, @xiang90! Could you please reply to this.

Resolved this via email.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.