Remove bad node in k8s after error with storage (Alpha node) and Recover Alpha

1Const1 · August 21, 2021, 11:14am

After error with storage we see in log file of Alpha node in K8S Cluster:

Summary

19 log.go:34] 6 [term: 0] received a MsgHeartbeat message with higher term from 4 [term: 5]
] 6 became follower at term 5

tocommit(25580) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(25580) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

This non working node in GROUP=2 and ID=6

What is correct way remove node from cluster and restore it?
GET removeNode?id=6&group=2 to dgraph-alpha-public:8080
gives me error with 404 not fould

but request get state to dgraph-alpha-public:8080 gives correct answer 200

State of cluster

{

"counter": "10943",

"groups": {

    "1": {

        "members": {

            "1": {

                "id": "1",

                "groupId": 1,

Maybe need to send request removeNode with params to other port of alpha node?

dgraph-alpha-public listen now 8080,8090 ports

Than let’s pretend that removing operation was completed. What steeps we need follow to recover Node id=6 in group=2 after losing all data with outer 2 nodes in group continue to working as normal. How to start replication process? For example, entering new clean node alpha with id=6 to group = 2 and start replication from 2 other nodes alpha in the same group?

Current documentation is not clear described this procedure (recover after fail-over alpha nodes, zero nodes).

Dgraph metadata

dgraph version

Dgraph version : v21.03.0 Dgraph codename : rocket Dgraph SHA-256 : b4e4c77011e2938e9da197395dbce91d0c6ebb83d383b190f5b70201836a773f Commit SHA-1 : a77bbe8ae

1Const1 · August 21, 2021, 1:54pm

According this topic Replacing zero and server nodes - #4 by nbnh

should get request to zero host http default port 6080 also no need “/” at the end

Only one question disturbing now:
Data stored in group will be safe after delete one of 3 members and replace new one again into group? or this means lose data that stored on broken alpha node?

iluminae · August 21, 2021, 7:54pm

Yes it’s the zero you have to call /removeNode on.

The process you are following is to fully remove a node from the cluster, so yes, all data needs to be deleted from that node…

…but you probably have more than one node in each group if you are running a real system - the point of this is every member of the group has the exact same data.

take a full export
remove bad node with /removeNode
stop that node’s process
delete all the state on that node (p,w,t directories)
bring up process with no state
zero sees it as a new node, and sees a group that is down one node - and assigns it to the group where the node was removed
leader of group copies all data to new node.

1Const1 · August 25, 2021, 4:58pm

i have fixed some nodes like was in description

but on 3 group alpha’s (3 nodes: 1 healthy - node6, 2 dead - node7 & 8 ) i saw message after delete data on dead nodes():
A tick missed to fire. Node blocks too long!

How i can fix it? How debug or understand the reason why this error may happen?

iluminae · August 25, 2021, 5:03pm

1 healthy, 2 dead

You may want to read up on raft. Basically, if you have a group size of 3, you have a failure tolerance of 1 node. If you have a group size of 5, you have a failure tolerance of 2 nodes.

I wont be able to give you advice on fixing that since the group is not in a working state - it is beyond it’s failure tolerance.

1Const1 · October 7, 2021, 7:25pm

Thank you for answer
Yes it is a lot of implementation Raft protocol and i don’t know what exactly was using by Dgraph team.
Is this formula ( =(N/2)-1) correct for fault tolerance in Dgraph Raft protocol implantation?

How i can resize default size of group? Now it’s 3 member. For example, if i will create 6 nodes, then now it will be 2 groups (3 nodes + 3 nodes). I want group size change from 3 to 6.

iluminae · October 7, 2021, 9:41pm

If you have 2 groups each with 3 members, and you want each group to have 5 members, then change the --replicas flag on the zeros to 5 and restart them, then add 4 more alpha nodes. They will be automatically assigned to the groups to make them 5 each.

5 members in a group will give a failure tolerance of 2 - having 6 members of a group will still only allow 2 failures so no reason to make it 6 really.

Topic		Replies	Views
Where is the raft log located? "tocommit(21) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?" Dgraph dgraph , kind:bug	2	2206	August 13, 2020
Dgraph alpha crash "Was the raft log corrupted, truncated, or lost?" Dgraph kind:question , kind:bug	1	1119	April 7, 2021
Advice on handling corrupt alpha pod Dgraph kind:question , dgraph	1	732	January 5, 2023
Problems with node removal Dgraph cluster	14	1273	January 7, 2021
After the alpha leader's data is corrupted, the node cannot join the cluster as a new node Dgraph kind:question	2	613	May 10, 2021

Remove bad node in k8s after error with storage (Alpha node) and Recover Alpha

Dgraph metadata

Related topics