Alphas cannot select leader

I noticed the following entries in the logs

I0115 10:07:02.822532      17 log.go:34] 1 is starting a new election at term 7
I0115 10:07:02.822568      17 log.go:34] 1 became pre-candidate at term 7
I0115 10:07:02.822574      17 log.go:34] 1 received MsgPreVoteResp from 1 at term 7
I0115 10:07:02.822586      17 log.go:34] 1 [logterm: 7, index: 40885] sent MsgPreVote request to 3 at term 7
I0115 10:07:02.823123      20 log.go:34] 2 [logterm: 7, index: 40918, vote: 3] rejected MsgPreVote from 1 [logterm: 7, index: 40885] at term 7
I0115 10:07:02.823155      17 log.go:34] 3 [logterm: 7, index: 40918, vote: 3] rejected MsgPreVote from 1 [logterm: 7, index: 40885] at term 7
I0115 10:07:02.822594      17 log.go:34] 1 [logterm: 7, index: 40885] sent MsgPreVote request to 2 at term 7
I0115 10:07:04.922394      17 log.go:34] 1 is starting a new election at term 7
I0115 10:07:04.922434      17 log.go:34] 1 became pre-candidate at term 7
I0115 10:07:04.922440      17 log.go:34] 1 received MsgPreVoteResp from 1 at term 7
I0115 10:07:04.922453      17 log.go:34] 1 [logterm: 7, index: 40885] sent MsgPreVote request to 3 at term 7
I0115 10:07:04.922461      17 log.go:34] 1 [logterm: 7, index: 40885] sent MsgPreVote request to 2 at term 7
I0115 10:07:04.922917      17 log.go:34] 3 [logterm: 7, index: 40918, vote: 3] rejected MsgPreVote from 1 [logterm: 7, index: 40885] at term 7
I0115 10:07:04.922934      20 log.go:34] 2 [logterm: 7, index: 40918, vote: 3] rejected MsgPreVote from 1 [logterm: 7, index: 40885] at term 7

and they repeat all over, looks like Alphas can not select a new leader. Query time is longer (a lot) than usual, is it possible to fix that somehow?

Fixed the problem by removing one of the Alphas (randomly).

1 Like

Here is a more detailed explanation (as far as I understand it):

  • Alpha 0 was a leader
  • Alpha 0 has had some pending mutations when the pod has been terminated
  • other Alphas selected a new leader as Alpha 0 was down
  • when Alpha 0 was back, it tried to finish those pending mutations but it wasn’t a leader, so it started a vote
  • Alpha 0’s vote cannot be accepted as the other two Alpha already selected a new leader
  • and Alphas ended up in an infinite loop on selecting a new leader
  • terminating the current leader allowed to bring Alpha 0 as a new leader and finished the mutations

Shouldn’t this be resolved in some other way? Like if Alpha noticed pending queries it should delegate them to the leader?

While you found the solution, I was wondering if @ibrahim might be interested in working on the root cause?

I tried to follow the steps that @lukaszlenart mentioned but I don’t see the alpha getting stuck in a loop.

  • and Alphas ended up in an infinite loop on selecting a new leader

This shouldn’t happen. When an old node realizes that it is receiving the heartbeat of a higher term, the node becomes a follower and is should stop trying to become a leader.

@lukaszlenart are you able to reliably reproduce this issue?

This shouldn’t happen. When an old node realizes that it is receiving the heartbeat of a higher term, the node becomes a follower and is should stop trying to become a leader.

This is my understanding of the logs which I put in my first post in this topic - I left cluster for whole weekend and all I noticed was exactly the same logs became pre-candidate at term 7 and cluster was terrible slow.
Terminating the leader allowed Alpha 0 became leader and apply pending queries. Till then cluster has started to be operative.

@lukaszlenart are you able to reliably reproduce this issue?

Somehow yes, when I use our UI and try to perform a few delete operations on some nodes, I see that a transaction starts but it’s never committed or rolled back and Dgraph client starts throwing Caused by: io.grpc.StatusRuntimeException: UNKNOWN: : context deadline exceeded

In Dgraph I see this

I0119 12:13:40.355373      17 draft.go:1611] Skipping snapshot at index: 50857. Insufficient discard entries: 1. MinPendingStartTs: 58828
I0119 12:13:40.379684      17 draft.go:1448] Found 1 old transactions. Acting to abort them.
I0119 12:13:40.380960      20 draft.go:602] Skipping snapshot at 50857, because found one at 50857
I0119 12:13:40.380871      18 draft.go:602] Skipping snapshot at 50857, because found one at 50857
I0119 12:13:40.381519      17 draft.go:1409] TryAbort 1 txns with start ts. Error: <nil>
I0119 12:13:40.381536      17 draft.go:1432] TryAbort selectively proposing only aborted txns: txns:<start_ts:58828 >

So then I start terminating pods, to terminate leader I must use --force

kubectl delete pod intouch-dgraph-alpha-2 --force --grace-period=0

and then I see in the logs

I0119 12:16:42.991090      18 log.go:34] 3 became pre-candidate at term 13
I0119 12:16:42.991096      18 log.go:34] 3 received MsgPreVoteResp from 3 at term 13
I0119 12:16:42.991651      18 log.go:34] 1 [logterm: 13, index: 50893, vote: 2] rejected MsgPreVote from 3 [logterm: 13, index: 50891] at term 13
I0119 12:16:42.991108      18 log.go:34] 3 [logterm: 13, index: 50891] sent MsgPreVote request to 1 at term 13
I0119 12:16:42.991115      18 log.go:34] 3 [logterm: 13, index: 50891] sent MsgPreVote request to 2 at term 13
I0119 12:16:45.591182      18 log.go:34] 3 is starting a new election at term 13
I0119 12:16:45.591788      20 log.go:34] 2 [logterm: 13, index: 50893, vote: 2] rejected MsgPreVote from 3 [logterm: 13, index: 50891] at term 13
I0119 12:16:45.591207      18 log.go:34] 3 became pre-candidate at term 13
I0119 12:16:45.591213      18 log.go:34] 3 received MsgPreVoteResp from 3 at term 13
I0119 12:16:45.591230      18 log.go:34] 3 [logterm: 13, index: 50891] sent MsgPreVote request to 1 at term 13
I0119 12:16:45.591238      18 log.go:34] 3 [logterm: 13, index: 50891] sent MsgPreVote request to 2 at term 13

so again gracefully terminating the new leader

kubectl delete pod intouch-dgraph-alpha-1

and hopefully that resolves the problem, if not I need to terminate the new leader again.

I was able to reproduce the problem again and I think it’s related to transaction size - it contains 6 queries and 6 mutations - such transaction will hang on Dgraph side and Dgraph won’t be able to revert it or drop:

I0119 14:23:31.249922      16 draft.go:1448] Found 1 old transactions. Acting to abort them.
I0119 14:23:31.251924      16 draft.go:1409] TryAbort 1 txns with start ts. Error: <nil>
I0119 14:23:31.251947      16 draft.go:1432] TryAbort selectively proposing only aborted txns: txns:<start_ts:60011 >
I0119 14:21:48.098522      21 draft.go:1611] Skipping snapshot at index: 52106. Insufficient discard entries: 1. MinPendingStartTs: 60011
I0119 14:24:31.249745      16 draft.go:1611] Skipping snapshot at index: 52106. Insufficient discard entries: 1. MinPendingStartTs: 60011
I0119 14:24:31.249804      16 draft.go:1448] Found 1 old transactions. Acting to abort them.
I0119 14:24:31.259791      16 draft.go:1409] TryAbort 1 txns with start ts. Error: <nil>
I0119 14:24:31.259814      16 draft.go:1432] TryAbort selectively proposing only aborted txns: txns:<start_ts:60011 >
I0119 14:25:31.249798      16 draft.go:1611] Skipping snapshot at index: 52106. Insufficient discard entries: 1. MinPendingStartTs: 60011

so now killing the leader will break the cluster as I described above.

Just to clarify: when this transaction hangs, cluster slows down, that’s why we need to restart pods

I’m having the same problem!
After randomly removing nodes, the remaining nodes cannot select the leader.

I think we found the main reason of these problems - it’s still under investigation but this is the best hunch we have for now.

Our Dgraph cluster once again had some problems yesterday. One of the Zeros’ Pod was terminated unexpectedly - no log entries, no OOM, nothing. It happened during heavy load on the cluster, a lot of transactions and so. This put the cluster in an unstable state.

After investigating logs, Grafana, and so we found that at the same time one of our EC2 box was terminated as well. This is a Spot EC2 instance which is restarted every 24h - mainly it should be used by Jenkins, but it looks like the was a bug in a filter and also normal Pods (like Dgraph Pods) were scheduled on those Spot instances. The case is that there is also a monitor which tries to re-schedule the Pods running on a Spot instance onto another EC2 box and in most cases it worked well. Yet under heavy load on Dgraph this could fail from time to time.

We are going to fix the filter or temporarily remove the Spot instances - hopefully this will make this issue gone :smiley: