After error with storage we see in log file of Alpha node in K8S Cluster:
Summary
19 log.go:34] 6 [term: 0] received a MsgHeartbeat message with higher term from 4 [term: 5]
] 6 became follower at term 5
tocommit(25580) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(25580) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
This non working node in GROUP=2 and ID=6
What is correct way remove node from cluster and restore it?
GET removeNode?id=6&group=2 to dgraph-alpha-public:8080
gives me error with 404 not fould
but request get state to dgraph-alpha-public:8080 gives correct answer 200
Maybe need to send request removeNode with params to other port of alpha node?
dgraph-alpha-public listen now 8080,8090 ports
Than let’s pretend that removing operation was completed. What steeps we need follow to recover Node id=6 in group=2 after losing all data with outer 2 nodes in group continue to working as normal. How to start replication process? For example, entering new clean node alpha with id=6 to group = 2 and start replication from 2 other nodes alpha in the same group?
Current documentation is not clear described this procedure (recover after fail-over alpha nodes, zero nodes).
should get request to zero host http default port 6080 also no need “/” at the end
Only one question disturbing now:
Data stored in group will be safe after delete one of 3 members and replace new one again into group? or this means lose data that stored on broken alpha node?
Yes it’s the zero you have to call /removeNode on.
The process you are following is to fully remove a node from the cluster, so yes, all data needs to be deleted from that node…
…but you probably have more than one node in each group if you are running a real system - the point of this is every member of the group has the exact same data.
take a full export
remove bad node with /removeNode
stop that node’s process
delete all the state on that node (p,w,t directories)
bring up process with no state
zero sees it as a new node, and sees a group that is down one node - and assigns it to the group where the node was removed
but on 3 group alpha’s (3 nodes: 1 healthy - node6, 2 dead - node7 & 8 ) i saw message after delete data on dead nodes():
A tick missed to fire. Node blocks too long!
How i can fix it? How debug or understand the reason why this error may happen?
You may want to read up on raft. Basically, if you have a group size of 3, you have a failure tolerance of 1 node. If you have a group size of 5, you have a failure tolerance of 2 nodes.
I wont be able to give you advice on fixing that since the group is not in a working state - it is beyond it’s failure tolerance.
Thank you for answer
Yes it is a lot of implementation Raft protocol and i don’t know what exactly was using by Dgraph team.
Is this formula ( =(N/2)-1) correct for fault tolerance in Dgraph Raft protocol implantation?
How i can resize default size of group? Now it’s 3 member. For example, if i will create 6 nodes, then now it will be 2 groups (3 nodes + 3 nodes). I want group size change from 3 to 6.
If you have 2 groups each with 3 members, and you want each group to have 5 members, then change the --replicas flag on the zeros to 5 and restart them, then add 4 more alpha nodes. They will be automatically assigned to the groups to make them 5 each.
5 members in a group will give a failure tolerance of 2 - having 6 members of a group will still only allow 2 failures so no reason to make it 6 really.