Raft group cannot pass leader election

What I want to do

fix group with corrupt peer.

What I did

I had a peer with the missing file badger corruption I have brought up here before, and I had to call /removeNode on one peer of a group. This group has had a couple peers removed at this point and now is down without a leader but still trying to do pre votes to the removed peers:

I0705 16:33:45.813998      21 log.go:34] 1 is starting a new election at term 15
I0705 16:33:45.814029      21 log.go:34] 1 became pre-candidate at term 15
I0705 16:33:45.814033      21 log.go:34] 1 received MsgPreVoteResp from 1 at term 15
I0705 16:33:45.814046      21 log.go:34] 1 [logterm: 14, index: 11230462] sent MsgPreVote request to 2 at term 15
I0705 16:33:45.814055      21 log.go:34] 1 [logterm: 14, index: 11230462] sent MsgPreVote request to d at term 15
I0705 16:33:45.814060      21 log.go:34] 1 [logterm: 14, index: 11230462] sent MsgPreVote request to e at term 15
I0705 16:33:45.814693      21 log.go:34] 1 received MsgPreVoteResp from d at term 15
I0705 16:33:45.814715      21 log.go:34] 1 [quorum:3] has received 2 MsgPreVoteResp votes and 0 vote rejections
I0705 16:33:46.186119      21 log.go:34] 1 [logterm: 14, index: 11230462, vote: d] cast MsgPreVote for d [logterm: 14, index: 11230462] at term 15
W0705 16:33:46.815264      21 node.go:420] Unable to send message to peer: 0xe. Error: Do not have address of peer 0xe
W0705 16:33:46.815292      21 node.go:420] Unable to send message to peer: 0x2. Error: Do not have address of peer 0x2

Peer ā€˜eā€™ was just removed, and peer ā€˜2ā€™ was removed weeks ago. A new peer was added but the new peer cannot find a leader so is sitting there doing nothing:

Error while calling hasPeer: Unable to reach leader in group 1. Retrying...

You can see in the /state output the new alpha(15) has been added to group 1:

{
  "1": {
    "id": "1",
    "groupId": 1,
    "addr": "graphdb-b-dgraph-alpha-2.graphdb-b-dgraph-alpha-headless.data-engine.svc.cluster.local:7080",
    "leader": false,
    "amDead": false,
    "lastUpdate": "1625492088",
    "learner": false,
    "clusterInfoOnly": false,
    "forceGroupId": false
  },
  "13": {
    "id": "13",
    "groupId": 1,
    "addr": "graphdb-b-dgraph-alpha-0.graphdb-b-dgraph-alpha-headless.data-engine.svc.cluster.local:7080",
    "leader": false,
    "amDead": false,
    "lastUpdate": "1624292805",
    "learner": false,
    "clusterInfoOnly": false,
    "forceGroupId": false
  },
  "15": {
    "id": "15",
    "groupId": 1,
    "addr": "graphdb-b-dgraph-alpha-1.graphdb-b-dgraph-alpha-headless.data-engine.svc.cluster.local:7080",
    "leader": false,
    "amDead": false,
    "lastUpdate": "0",
    "learner": false,
    "clusterInfoOnly": false,
    "forceGroupId": false
  }
}

I assume that leader election is failing because it is expecting votes from 4 peers total and only 2 are alive. I would have hoped that /removeNode would have removed the nodes as members in this raft group but it has not. Is this a bug or somehow expected?

Anything I can do to help this? My cluster is effectively down until we can fix this group.

Dgraph metadata

V21.03.1
shardReplicas=3
groups=4

i have found the dgraph debug tool has some code to manage the dgraph custom raftwal implementation - here is the printout of dgraph debug -o in the w/ directory on one of the nodes in that group:

I0706 05:39:13.768829      75 storage.go:125] Init Raft Storage with snap: 11113694, first: 11113695, last: 0
Raft Id = 1 Groupd Id = 1

Snapshot Metadata: {ConfState:{Nodes:[1 2 13 14] Learners:[] XXX_unrecognized:[]} Index:11113694 Term:14 XXX_unrecognized:[]}
Snapshot Alpha: {Context:id:1 group:1 addr:"graphdb-b-dgraph-alpha-2.graphdb-b-dgraph-alpha-headless.data-engine.svc.cluster.local:7080"  Index:11113694 ReadTs:13871182 Done:false SinceTs:0}

Hardstate: {Term:15 Vote:13 Commit:11229254 XXX_unrecognized:[]}
Checkpoint: 11229253
Last Index: 11113694 . Num Entries: 18446744073709551615 .

You can see the ConfState.Nodes has 2 and 14, both are removed peers.

The only options I have for working with the wal in this binary are TruncateUntil and SetSnapshot and I have to be honest and say I do not know how to safely use either of those in this situation, nor do I know how they would effect the nodes listed in the group.

Is it possible I need to advance the raft state manually with this tool?

ps: I can post the entire output of dgraph debug -w ./w/ if you want it, it is 116782 lines.

is there any wisdom anyone can provide on how to do proper surgery on the raft storage?

I have a version of the debug tool that can edit the ConfState (with SaveSnapshot()) which includes the raft members, and have run it on a copy of the w/ directory - but I do not know what else to do to properly run the raft machine (eg: what to give --snap in the debug tool)

All I want is to get the group up enough that I can run a full export of what is there already, and I will rebuild the cluster from that.

Ok well after 40h of dgraph outage my team and I were able to apply enough patches to dgraph to get it to get up and export.

If ever anyone from dgraph looks at this:

  • we wrapped the raftwal storage interface with one that would not produce raft peers that were removed according to the membership. This allowed raft elections to succeed.
    • the issue really is that the custom raftwal implementation (or something) was not removing peers that were removed using the removeNode endpoint, though the dgraph side (not etcd/raft side) of things knew these peers were correctly removed.
    • fundamentally the change we applied may be a ok safeguard if it is implausible to figure out why the peers were not being removed from storage in the first place.
  • after this, we were stalled on that group with tens of thousands of transactions (at least to its point of view) away from a usable readTS. This was very confusing but made it so that only best-effort queries were succeeding, and only if you hit a member of that group directly. It did not appear that it was making progress on advancing that timestamp for some reason.
  • we then applied another patch that allowed an export to be taken without waiting for readTS to be the latest according to the zeros. This allowed a full cluster export to succeed, where before it would wait indefinitely on reaching a current readTS.
    • we probably lost some changes in the wal on that group, but after a couple of days of partial downtime, we had to go for a slightly destructive solution over none at all.
  • after all of the above, I was able to use the export to rebuild the 12node cluster.

All in all, this was a massive pain, quite unfortunate we had to read dgraph code for 2 days to attempt to figure it out ourselves.