Moved from GitHub dgraph/5836
Posted by christian-roggia:
What version of Dgraph are you using?
This issue has been consistently observed in all the following versions:
- v20.03
- v20.07.0-beta.Jun15
- v20.07.0-beta.Jun22
Have you tried reproducing the issue with the latest release?
Yes.
What is the hardware spec (RAM, OS)?
- Google Kubernetes Engine (e2-highmem-4)
- Helm chart (3 replicas of alpha and zero)
- 4 CPUs
- 32 GB of RAM
- SSD storage
- Preemptible
Steps to reproduce the issue (command/config used to run Dgraph).
- Launch dgraph on a preemptible node were a sudden restart of the node with no graceful shutdown might happen.
- Start heavy writing operations.
- Wait for a node to be restarted.
Expected behavior and actual result.
The expected result is that dgraph is able to recover after a sudden restart.
It is also expected to observe partial data loss as dgraph was not shutdown gracefully.
NOTE: Preemptible machines are not intended to host databases but are a very good simulation of a disaster (e.g. outage in a Cloud provider, network failure, the unexpected crash of dgraph).
The actual behavior of dgraph is instead a complete failure of the cluster which is unable to recover and enters a state where it is no longer reachable.
Received first state update from Zero: counter:3111932 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"dgraph-alpha-0.dgraph-alpha-headless.parser.svc.cluster.local:7080" last_update:1593993800 > > members:<key:2 value:<id:2 group_id:1 addr:"dgraph-alpha-1.dgraph-alpha-headless.parser.svc.cluster.local:7080" last_update:1593986968 > > members:<key:3 value:<id:3 group_id:1 addr:"dgraph-alpha-2.dgraph-alpha-headless.parser.svc.cluster.local:7080" last_update:1593951535 > > tablets:<......> > snapshot_ts:5445471 checksum:17334002985932778878 > > zeros:<key:1 value:<id:1 addr:"dgraph-zero-0.dgraph-zero-headless.parser.svc.cluster.local:5080" > > zeros:<key:2 value:<id:2 addr:"dgraph-zero-1.dgraph-zero-headless.parser.svc.cluster.local:5080" leader:true > > zeros:<key:3 value:<id:3 addr:"dgraph-zero-2.dgraph-zero-headless.parser.svc.cluster.local:5080" > > maxLeaseId:5550000 maxTxnTs:6690000 maxRaftId:3 cid:"27dc9254-8fd4-4c73-bf77-659f319d0830" license:<maxNodes:18446744073709551615 expiryTs:1596482528 enabled:true >
Sending Snapshot Created batch of size: 175 MB in 12.405286971s.
Sending Snapshot Time elapsed: 01m18s, bytes sent: 588 MB, speed: 7.5 MB/sec
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255449
Sending Snapshot Created batch of size: 255 MB in 15.353450423s.
Sending Snapshot Time elapsed: 01m37s, bytes sent: 843 MB, speed: 8.7 MB/sec
Sending Snapshot Created batch of size: 243 MB in 6.770549492s.
Sending Snapshot Time elapsed: 01m45s, bytes sent: 1.1 GB, speed: 10 MB/sec
1 is starting a new election at term 7
1 became pre-candidate at term 7
1 received MsgPreVoteResp from 1 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 2 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 3 at term 7
1 is starting a new election at term 7
1 became pre-candidate at term 7
1 received MsgPreVoteResp from 1 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 3 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 2 at term 7
1 is starting a new election at term 7
1 became pre-candidate at term 7
1 received MsgPreVoteResp from 1 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 2 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 3 at term 7
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255448
Sending Snapshot Created batch of size: 110 MB in 43.377727473s.
Sending Snapshot Created batch of size: 101 MB in 46.845357675s.
Sending Snapshot Sent 20942742 keys
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
Streaming done. Sent 20942742 entries. Waiting for ACK...
Received ACK with done: true
Stream snapshot: OK
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
What is important to notice here is
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
and
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
Attempted solution to the issue
We tried the following steps to force the recovery of the database:
- Scale the cluster back to 0 replicas (gracefully) and scale back up to 3 replicas
- Restart 2 out of 3 replicas with clean persistent storage (no data inside)
- Scale the cluster to a single replica
After scaling down the entire cluster to 0 replicas and back to 3 replicas for a clean restart the following logs appear:
I0706 05:37:06.533661 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:38:06.551211 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:39:06.445989 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:40:06.478813 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:41:06.475061 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:42:06.509597 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:43:06.479442 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:44:06.513435 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:45:06.462318 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:46:06.454675 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:47:06.557153 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:48:06.572650 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:49:06.539283 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:50:06.562204 15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
Additional notes
We suspect that the issue is caused by dgraph-alpha which accepts and continues to process new entries even when the cluster is not ready (the majority of replicas are still offline). The error might also be caused by dgraph-zero not being available or by the sudden crash of the entire cluster which can be compared to a disaster.
We just enabled liveness and readiness probes on our deployments. We urge the dgraph team to enable such probes by default as it is not expected to have a Helm chart with probes disabled. We also suspect that having probes disabled might largely contribute to this issue as the cluster resumes writing operation too early and the cluster is not yet ready.
This behavior was also observed when a crash occurred only partially, with the majority of replicas going offline and restarting (2 out of 3 replicas not available).
We strongly invite the dgraph team to simulate disasters through preemptible machines as this is a valuable experiment to verify dgraph’s disaster recovery capabilities. If a scenario like the one here described was to happen in a production environment, where the cluster experiences a sudden disruption of the services, and it is not able to recover anymore, it would likely have a tremendous impact on normal operations as well as business operations with a long downtime for the service which relies on dgraph.