Raft group cannot pass leader election

iluminae · July 5, 2021, 2:37pm

What I want to do

fix group with corrupt peer.

What I did

I had a peer with the missing file badger corruption I have brought up here before, and I had to call /removeNode on one peer of a group. This group has had a couple peers removed at this point and now is down without a leader but still trying to do pre votes to the removed peers:

I0705 16:33:45.813998      21 log.go:34] 1 is starting a new election at term 15
I0705 16:33:45.814029      21 log.go:34] 1 became pre-candidate at term 15
I0705 16:33:45.814033      21 log.go:34] 1 received MsgPreVoteResp from 1 at term 15
I0705 16:33:45.814046      21 log.go:34] 1 [logterm: 14, index: 11230462] sent MsgPreVote request to 2 at term 15
I0705 16:33:45.814055      21 log.go:34] 1 [logterm: 14, index: 11230462] sent MsgPreVote request to d at term 15
I0705 16:33:45.814060      21 log.go:34] 1 [logterm: 14, index: 11230462] sent MsgPreVote request to e at term 15
I0705 16:33:45.814693      21 log.go:34] 1 received MsgPreVoteResp from d at term 15
I0705 16:33:45.814715      21 log.go:34] 1 [quorum:3] has received 2 MsgPreVoteResp votes and 0 vote rejections
I0705 16:33:46.186119      21 log.go:34] 1 [logterm: 14, index: 11230462, vote: d] cast MsgPreVote for d [logterm: 14, index: 11230462] at term 15
W0705 16:33:46.815264      21 node.go:420] Unable to send message to peer: 0xe. Error: Do not have address of peer 0xe
W0705 16:33:46.815292      21 node.go:420] Unable to send message to peer: 0x2. Error: Do not have address of peer 0x2

Peer ‘e’ was just removed, and peer ‘2’ was removed weeks ago. A new peer was added but the new peer cannot find a leader so is sitting there doing nothing:

Error while calling hasPeer: Unable to reach leader in group 1. Retrying...

You can see in the /state output the new alpha(15) has been added to group 1:

{
  "1": {
    "id": "1",
    "groupId": 1,
    "addr": "graphdb-b-dgraph-alpha-2.graphdb-b-dgraph-alpha-headless.data-engine.svc.cluster.local:7080",
    "leader": false,
    "amDead": false,
    "lastUpdate": "1625492088",
    "learner": false,
    "clusterInfoOnly": false,
    "forceGroupId": false
  },
  "13": {
    "id": "13",
    "groupId": 1,
    "addr": "graphdb-b-dgraph-alpha-0.graphdb-b-dgraph-alpha-headless.data-engine.svc.cluster.local:7080",
    "leader": false,
    "amDead": false,
    "lastUpdate": "1624292805",
    "learner": false,
    "clusterInfoOnly": false,
    "forceGroupId": false
  },
  "15": {
    "id": "15",
    "groupId": 1,
    "addr": "graphdb-b-dgraph-alpha-1.graphdb-b-dgraph-alpha-headless.data-engine.svc.cluster.local:7080",
    "leader": false,
    "amDead": false,
    "lastUpdate": "0",
    "learner": false,
    "clusterInfoOnly": false,
    "forceGroupId": false
  }
}

I assume that leader election is failing because it is expecting votes from 4 peers total and only 2 are alive. I would have hoped that /removeNode would have removed the nodes as members in this raft group but it has not. Is this a bug or somehow expected?

Anything I can do to help this? My cluster is effectively down until we can fix this group.

Dgraph metadata

V21.03.1
shardReplicas=3
groups=4

iluminae · July 6, 2021, 5:55am

i have found the dgraph debug tool has some code to manage the dgraph custom raftwal implementation - here is the printout of dgraph debug -o in the w/ directory on one of the nodes in that group:

I0706 05:39:13.768829      75 storage.go:125] Init Raft Storage with snap: 11113694, first: 11113695, last: 0
Raft Id = 1 Groupd Id = 1

Snapshot Metadata: {ConfState:{Nodes:[1 2 13 14] Learners:[] XXX_unrecognized:[]} Index:11113694 Term:14 XXX_unrecognized:[]}
Snapshot Alpha: {Context:id:1 group:1 addr:"graphdb-b-dgraph-alpha-2.graphdb-b-dgraph-alpha-headless.data-engine.svc.cluster.local:7080"  Index:11113694 ReadTs:13871182 Done:false SinceTs:0}

Hardstate: {Term:15 Vote:13 Commit:11229254 XXX_unrecognized:[]}
Checkpoint: 11229253
Last Index: 11113694 . Num Entries: 18446744073709551615 .

You can see the ConfState.Nodes has 2 and 14, both are removed peers.

The only options I have for working with the wal in this binary are TruncateUntil and SetSnapshot and I have to be honest and say I do not know how to safely use either of those in this situation, nor do I know how they would effect the nodes listed in the group.

Is it possible I need to advance the raft state manually with this tool?

ps: I can post the entire output of dgraph debug -w ./w/ if you want it, it is 116782 lines.

iluminae · July 7, 2021, 12:15am

is there any wisdom anyone can provide on how to do proper surgery on the raft storage?

I have a version of the debug tool that can edit the ConfState (with SaveSnapshot()) which includes the raft members, and have run it on a copy of the w/ directory - but I do not know what else to do to properly run the raft machine (eg: what to give --snap in the debug tool)

All I want is to get the group up enough that I can run a full export of what is there already, and I will rebuild the cluster from that.

iluminae · July 7, 2021, 6:13am

Ok well after 40h of dgraph outage my team and I were able to apply enough patches to dgraph to get it to get up and export.

If ever anyone from dgraph looks at this:

we wrapped the raftwal storage interface with one that would not produce raft peers that were removed according to the membership. This allowed raft elections to succeed.
- the issue really is that the custom raftwal implementation (or something) was not removing peers that were removed using the removeNode endpoint, though the dgraph side (not etcd/raft side) of things knew these peers were correctly removed.
- fundamentally the change we applied may be a ok safeguard if it is implausible to figure out why the peers were not being removed from storage in the first place.
after this, we were stalled on that group with tens of thousands of transactions (at least to its point of view) away from a usable readTS. This was very confusing but made it so that only best-effort queries were succeeding, and only if you hit a member of that group directly. It did not appear that it was making progress on advancing that timestamp for some reason.
we then applied another patch that allowed an export to be taken without waiting for readTS to be the latest according to the zeros. This allowed a full cluster export to succeed, where before it would wait indefinitely on reaching a current readTS.
- we probably lost some changes in the wal on that group, but after a couple of days of partial downtime, we had to go for a slightly destructive solution over none at all.
after all of the above, I was able to use the export to rebuild the 12node cluster.

All in all, this was a massive pain, quite unfortunate we had to read dgraph code for 2 days to attempt to figure it out ourselves.

Topic		Replies	Views
RAFT membership can fail to remove peers Dgraph kind:bug	2	1048	June 7, 2022
Remove bad node in k8s after error with storage (Alpha node) and Recover Alpha Dgraph	6	897	October 7, 2021
Not Zero leader. Aborting proposal: member : while removing node using admin endpoint Dgraph kind:bug	3	800	September 1, 2021
Master crash in a 3-node cluster Users	2	1027	October 18, 2016
Raft Group in DGraph Dgraph	3	743	August 12, 2019

Raft group cannot pass leader election

What I want to do

What I did

Dgraph metadata

Related topics