Remove bad node in k8s after error with storage (Alpha node) and Recover Alpha

After error with storage we see in log file of Alpha node in K8S Cluster:


19 log.go:34] 6 [term: 0] received a MsgHeartbeat message with higher term from 4 [term: 5]
] 6 became follower at term 5

tocommit(25580) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(25580) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

This non working node in GROUP=2 and ID=6

What is correct way remove node from cluster and restore it?
GET removeNode?id=6&group=2 to dgraph-alpha-public:8080
gives me error with 404 not fould

but request get state to dgraph-alpha-public:8080 gives correct answer 200

State of cluster


"counter": "10943",

"groups": {

    "1": {

        "members": {

            "1": {

                "id": "1",

                "groupId": 1,
  1. Maybe need to send request removeNode with params to other port of alpha node?

dgraph-alpha-public listen now 8080,8090 ports

  1. Than let’s pretend that removing operation was completed. What steeps we need follow to recover Node id=6 in group=2 after losing all data with outer 2 nodes in group continue to working as normal. How to start replication process? For example, entering new clean node alpha with id=6 to group = 2 and start replication from 2 other nodes alpha in the same group?

Current documentation is not clear described this procedure (recover after fail-over alpha nodes, zero nodes).

Dgraph metadata

dgraph version

Dgraph version : v21.03.0
Dgraph codename : rocket
Dgraph SHA-256 : b4e4c77011e2938e9da197395dbce91d0c6ebb83d383b190f5b70201836a773f
Commit SHA-1 : a77bbe8ae

According this topic Replacing zero and server nodes - #4 by nbnh

should get request to zero host http default port 6080 also no need “/” at the end

Only one question disturbing now:
Data stored in group will be safe after delete one of 3 members and replace new one again into group? or this means lose data that stored on broken alpha node?

Yes it’s the zero you have to call /removeNode on.

The process you are following is to fully remove a node from the cluster, so yes, all data needs to be deleted from that node…

but you probably have more than one node in each group if you are running a real system - the point of this is every member of the group has the exact same data.

  1. take a full export
  2. remove bad node with /removeNode
  3. stop that node’s process
  4. delete all the state on that node (p,w,t directories)
  5. bring up process with no state
  6. zero sees it as a new node, and sees a group that is down one node - and assigns it to the group where the node was removed
  7. leader of group copies all data to new node.
1 Like

i have fixed some nodes like was in description

but on 3 group alpha’s (3 nodes: 1 healthy - node6, 2 dead - node7 & 8 ) i saw message after delete data on dead nodes():
A tick missed to fire. Node blocks too long!

How i can fix it? How debug or understand the reason why this error may happen?

1 healthy, 2 dead

You may want to read up on raft. Basically, if you have a group size of 3, you have a failure tolerance of 1 node. If you have a group size of 5, you have a failure tolerance of 2 nodes.

I wont be able to give you advice on fixing that since the group is not in a working state - it is beyond it’s failure tolerance.

Thank you for answer
Yes it is a lot of implementation Raft protocol and i don’t know what exactly was using by Dgraph team.
Is this formula ( =(N/2)-1) correct for fault tolerance in Dgraph Raft protocol implantation?

How i can resize default size of group? Now it’s 3 member. For example, if i will create 6 nodes, then now it will be 2 groups (3 nodes + 3 nodes). I want group size change from 3 to 6.

If you have 2 groups each with 3 members, and you want each group to have 5 members, then change the --replicas flag on the zeros to 5 and restart them, then add 4 more alpha nodes. They will be automatically assigned to the groups to make them 5 each.

5 members in a group will give a failure tolerance of 2 - having 6 members of a group will still only allow 2 failures so no reason to make it 6 really.