What I want to do
fix group with corrupt peer.
What I did
I had a peer with the missing file badger corruption I have brought up here before, and I had to call /removeNode on one peer of a group. This group has had a couple peers removed at this point and now is down without a leader but still trying to do pre votes to the removed peers:
I0705 16:33:45.813998 21 log.go:34] 1 is starting a new election at term 15
I0705 16:33:45.814029 21 log.go:34] 1 became pre-candidate at term 15
I0705 16:33:45.814033 21 log.go:34] 1 received MsgPreVoteResp from 1 at term 15
I0705 16:33:45.814046 21 log.go:34] 1 [logterm: 14, index: 11230462] sent MsgPreVote request to 2 at term 15
I0705 16:33:45.814055 21 log.go:34] 1 [logterm: 14, index: 11230462] sent MsgPreVote request to d at term 15
I0705 16:33:45.814060 21 log.go:34] 1 [logterm: 14, index: 11230462] sent MsgPreVote request to e at term 15
I0705 16:33:45.814693 21 log.go:34] 1 received MsgPreVoteResp from d at term 15
I0705 16:33:45.814715 21 log.go:34] 1 [quorum:3] has received 2 MsgPreVoteResp votes and 0 vote rejections
I0705 16:33:46.186119 21 log.go:34] 1 [logterm: 14, index: 11230462, vote: d] cast MsgPreVote for d [logterm: 14, index: 11230462] at term 15
W0705 16:33:46.815264 21 node.go:420] Unable to send message to peer: 0xe. Error: Do not have address of peer 0xe
W0705 16:33:46.815292 21 node.go:420] Unable to send message to peer: 0x2. Error: Do not have address of peer 0x2
Peer āeā was just removed, and peer ā2ā was removed weeks ago. A new peer was added but the new peer cannot find a leader so is sitting there doing nothing:
Error while calling hasPeer: Unable to reach leader in group 1. Retrying...
You can see in the /state output the new alpha(15) has been added to group 1:
{
"1": {
"id": "1",
"groupId": 1,
"addr": "graphdb-b-dgraph-alpha-2.graphdb-b-dgraph-alpha-headless.data-engine.svc.cluster.local:7080",
"leader": false,
"amDead": false,
"lastUpdate": "1625492088",
"learner": false,
"clusterInfoOnly": false,
"forceGroupId": false
},
"13": {
"id": "13",
"groupId": 1,
"addr": "graphdb-b-dgraph-alpha-0.graphdb-b-dgraph-alpha-headless.data-engine.svc.cluster.local:7080",
"leader": false,
"amDead": false,
"lastUpdate": "1624292805",
"learner": false,
"clusterInfoOnly": false,
"forceGroupId": false
},
"15": {
"id": "15",
"groupId": 1,
"addr": "graphdb-b-dgraph-alpha-1.graphdb-b-dgraph-alpha-headless.data-engine.svc.cluster.local:7080",
"leader": false,
"amDead": false,
"lastUpdate": "0",
"learner": false,
"clusterInfoOnly": false,
"forceGroupId": false
}
}
I assume that leader election is failing because it is expecting votes from 4 peers total and only 2 are alive. I would have hoped that /removeNode would have removed the nodes as members in this raft group but it has not. Is this a bug or somehow expected?
Anything I can do to help this? My cluster is effectively down until we can fix this group.
Dgraph metadata
V21.03.1
shardReplicas=3
groups=4