[disaster recovery] Cluster unable to recover after crash during intensive writing operations

Moved from GitHub dgraph/5836

Posted by christian-roggia:

What version of Dgraph are you using?

This issue has been consistently observed in all the following versions:

  • v20.03
  • v20.07.0-beta.Jun15
  • v20.07.0-beta.Jun22

Have you tried reproducing the issue with the latest release?

Yes.

What is the hardware spec (RAM, OS)?

  • Google Kubernetes Engine (e2-highmem-4)
  • Helm chart (3 replicas of alpha and zero)
  • 4 CPUs
  • 32 GB of RAM
  • SSD storage
  • Preemptible

Steps to reproduce the issue (command/config used to run Dgraph).

  1. Launch dgraph on a preemptible node were a sudden restart of the node with no graceful shutdown might happen.
  2. Start heavy writing operations.
  3. Wait for a node to be restarted.

Expected behavior and actual result.

The expected result is that dgraph is able to recover after a sudden restart.
It is also expected to observe partial data loss as dgraph was not shutdown gracefully.

NOTE: Preemptible machines are not intended to host databases but are a very good simulation of a disaster (e.g. outage in a Cloud provider, network failure, the unexpected crash of dgraph).

The actual behavior of dgraph is instead a complete failure of the cluster which is unable to recover and enters a state where it is no longer reachable.

Received first state update from Zero: counter:3111932 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"dgraph-alpha-0.dgraph-alpha-headless.parser.svc.cluster.local:7080" last_update:1593993800 > > members:<key:2 value:<id:2 group_id:1 addr:"dgraph-alpha-1.dgraph-alpha-headless.parser.svc.cluster.local:7080" last_update:1593986968 > > members:<key:3 value:<id:3 group_id:1 addr:"dgraph-alpha-2.dgraph-alpha-headless.parser.svc.cluster.local:7080" last_update:1593951535 > > tablets:<......> > snapshot_ts:5445471 checksum:17334002985932778878 > > zeros:<key:1 value:<id:1 addr:"dgraph-zero-0.dgraph-zero-headless.parser.svc.cluster.local:5080" > > zeros:<key:2 value:<id:2 addr:"dgraph-zero-1.dgraph-zero-headless.parser.svc.cluster.local:5080" leader:true > > zeros:<key:3 value:<id:3 addr:"dgraph-zero-2.dgraph-zero-headless.parser.svc.cluster.local:5080" > > maxLeaseId:5550000 maxTxnTs:6690000 maxRaftId:3 cid:"27dc9254-8fd4-4c73-bf77-659f319d0830" license:<maxNodes:18446744073709551615 expiryTs:1596482528 enabled:true > 
Sending Snapshot Created batch of size: 175 MB in 12.405286971s.
Sending Snapshot Time elapsed: 01m18s, bytes sent: 588 MB, speed: 7.5 MB/sec
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255449
Sending Snapshot Created batch of size: 255 MB in 15.353450423s.
Sending Snapshot Time elapsed: 01m37s, bytes sent: 843 MB, speed: 8.7 MB/sec
Sending Snapshot Created batch of size: 243 MB in 6.770549492s.
Sending Snapshot Time elapsed: 01m45s, bytes sent: 1.1 GB, speed: 10 MB/sec
1 is starting a new election at term 7
1 became pre-candidate at term 7
1 received MsgPreVoteResp from 1 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 2 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 3 at term 7
1 is starting a new election at term 7
1 became pre-candidate at term 7
1 received MsgPreVoteResp from 1 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 3 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 2 at term 7
1 is starting a new election at term 7
1 became pre-candidate at term 7
1 received MsgPreVoteResp from 1 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 2 at term 7
1 [logterm: 7, index: 3270956] sent MsgPreVote request to 3 at term 7
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255448
Sending Snapshot Created batch of size: 110 MB in 43.377727473s.
Sending Snapshot Created batch of size: 101 MB in 46.845357675s.
Sending Snapshot Sent 20942742 keys
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
Streaming done. Sent 20942742 entries. Waiting for ACK...
Received ACK with done: true
Stream snapshot: OK
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7
Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529

What is important to notice here is

1 [logterm: 7, index: 3270956, vote: 1] rejected MsgPreVote from 3 [logterm: 6, index: 2518247] at term 7

and

Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529

Attempted solution to the issue

We tried the following steps to force the recovery of the database:

  • Scale the cluster back to 0 replicas (gracefully) and scale back up to 3 replicas
  • Restart 2 out of 3 replicas with clean persistent storage (no data inside)
  • Scale the cluster to a single replica

After scaling down the entire cluster to 0 replicas and back to 3 replicas for a clean restart the following logs appear:

I0706 05:37:06.533661      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:38:06.551211      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:39:06.445989      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:40:06.478813      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:41:06.475061      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:42:06.509597      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:43:06.479442      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:44:06.513435      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:45:06.462318      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:46:06.454675      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:47:06.557153      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:48:06.572650      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:49:06.539283      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529
I0706 05:50:06.562204      15 draft.go:1505] Skipping snapshot at index: 2605051. Insufficient discard entries: 0. MinPendingStartTs: 5255529

Additional notes

We suspect that the issue is caused by dgraph-alpha which accepts and continues to process new entries even when the cluster is not ready (the majority of replicas are still offline). The error might also be caused by dgraph-zero not being available or by the sudden crash of the entire cluster which can be compared to a disaster.

We just enabled liveness and readiness probes on our deployments. We urge the dgraph team to enable such probes by default as it is not expected to have a Helm chart with probes disabled. We also suspect that having probes disabled might largely contribute to this issue as the cluster resumes writing operation too early and the cluster is not yet ready.

This behavior was also observed when a crash occurred only partially, with the majority of replicas going offline and restarting (2 out of 3 replicas not available).

We strongly invite the dgraph team to simulate disasters through preemptible machines as this is a valuable experiment to verify dgraph’s disaster recovery capabilities. If a scenario like the one here described was to happen in a production environment, where the cluster experiences a sudden disruption of the services, and it is not able to recover anymore, it would likely have a tremendous impact on normal operations as well as business operations with a long downtime for the service which relies on dgraph.

fl-max commented :

This is good to know. We recently disabled probes on Alpha since the recovery time is extremely long when a pod bounces, exceeding any reasonable initialDelaySeconds value, and causes the pod to go into a CrashLoop.

christian-roggia commented :

After more tests, we are now able to conclude that probes do not help with the cluster recovery. Once a disaster happens and there is a sudden collapse of all zero and alpha replicas we have been completely unable to recover the database.

This discovery is extremely worrying for us as the cluster won’t recover from a general outage of the resources (e.g. datacenter failure, kubernetes failure, panic introduced with an update).

We have not been able to bring a single cluster back online after a non-graceful shutdown.

seanlaff commented :

Hey @christian-roggia I’ve been running into a problem with similar symptoms as outlined here Active Mutations stuck forever

Do you see your mutations stuck like this?
Screen Shot 2020-06-11 at 10 20 38 AM

If I overload a node with writes long enough, I will get into this state, and be unable to recover even if I cut all load. I’m not running on preemptible nodes but I am running with the official helm chart on GKE.

dmitryyankowski commented :

@christian-roggia has this been resolved in master yet? I think i’ve experienced something like that as well locally (I wasn’t running Kubernetes, just the alpha, zero binaries themselves).

christian-roggia commented :

@seanlaff we are setting up monitoring with Datadog right now to better understand where the underlying issue might be coming from.

@dmitryyankowski AFAIK there is no fix yet, I also believe that it is still unknown where this issue might be located in the code (or in which component).

We confirmed that this issue does not appear just with disasters where all nodes experience an outage, it is enough that a single alpha node goes suddenly offline during write-intensive operations and the cluster won’t recover from this state.

It might be an issue related to resources starvation or too heavy load for workers, but in general, I would not expect the database to collapse and not recover anymore.

We switched to non-preemptible machines where dgraph had around 4 CPUs and 30GB of RAM for each alpha replica (for a total of 12 CPUs and 90 GB of RAM). One alpha node restarted due to a change of the StatefulSet configuration, this happened in the middle of a full import of our dataset (some hundreds of million nodes). The database collapsed and did not recover with the same symptoms described above. We had to wipe dgraph again, this time we restarted the cluster with HA disabled (single node) and run successfully the full import. The current situation is not ideal and I am still worried about large and heavy operations once we go to production and HA will be enabled.

dmitryyankowski commented :

Thanks for all that info @christian-roggia Please keep us updated in this thread if you run into more issues and such! Definitely want to keep an eye on this!

martinmr commented :

@christian-roggia I don’t think scaling back the cluster will work. I am not sure exactly what would happen but we can try it on our end. Restarting nodes with clean storage makes things worse because now those nodes need to catch up.

The logs don’t seem like an issue. Skipping a snapshot just means there are not enough entries in the write ahead log to guarantee making one.

Also what are the options you are passing to dgraph alpha. Some of those options are related to snapshots so they are relevant to this issue.

Also, can you be more specific about the symptoms? Do requests to the cluster fail immediately or do they hang? If they hang I suspect the cluster is recovering but it’s doing the recovery while receiving new mutations so it’s catching up. Can you try this with no load on the cluster? Just start a new cluster, add some data, and wait for it to crash but stop sending load to the cluster before the crash. If the issue is indeed related to having to replay a lot of entries in the write-ahead log, then the cluster should recover fine when there’s no load.

We’ll look into running Dgraph in preemptive nodes on our end.

jarifibrahim commented :

@christian-roggia @seanlaff were you running dgraph in ludicrous mode?

seanlaff commented :

@jarifibrahim No ludicrous mode for me. However I have been using --badger.tables=disk and L0OnMemory=false.

My reproduction was, aim a very high write workload at dgraph- moreso than the hardware can handle. Eventually a node will OOM. At this point I cut all load- but after waiting 24 hrs the mutations never progressed.

iluminae commented :

another data point here:
I have a locked up alpha that was slammed with a single particularly gruesome upsert (one single request, touching 2M nodes - not sustained load) and has stopped responding - for hours now (even post restart) I only get this in the logs every 60s

I0710 21:11:21.913375      15 draft.go:1367] Found 1 old transactions. Acting to abort them.
I0710 21:11:21.917590      15 draft.go:1328] TryAbort 1 txns with start ts. Error: <nil>
I0710 21:11:21.917622      15 draft.go:1351] TryAbort selectively proposing only aborted txns: txns:<start_ts:100139 >

badger_v2_disk_writes_total is slowly incrementing according to the prometheus stats. Even schema{} queries time out. Currently node is sitting at 2GB ram of 30GB available using 1 core of 4 available.

running master@156bc23bca4ef941ed5b1e84638961764bd59f27

christian-roggia commented :

@martinmr we use the standard Helm chart for deployment on Kubernetes, the following args are passed to alpha: dgraph alpha ${TLS_OPTIONS} --my=$(hostname -f):7080 --lru_mb 24576 --zero dgraph-zero-0.dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080

Also, can you be more specific about the symptoms? Do requests to the cluster fail immediately or do they hang? If they hang I suspect the cluster is recovering but it’s doing the recovery while receiving new mutations so it’s catching up. Can you try this with no load on the cluster? Just start a new cluster, add some data, and wait for it to crash but stop sending load to the cluster before the crash. If the issue is indeed related to having to replay a lot of entries in the write-ahead log, then the cluster should recover fine when there’s no load.

The alpha nodes are hanging and no longer responding to any query or mutation, the CPU and RAM are no longer under intensive utilization. We tried to let the cluster run for a few hours to check if it would recover but it didn’t. We are planning now to setup datadog so that we can properly monitor if the cluster is doing anything at all. Also, once an alpha node goes down there is no load anymore on the cluster since it is rejecting all mutations with a timeout (context exceeded deadline).

I would suggest that if the cluster is recovering it would be a nice addition to have a goroutine ticking every 1-5 minutes that notifies about the recovery progress in the logs, so that the cluster doesn’t look completely unresponsive when it is indeed performing operations in the background.

The logs don’t seem like an issue. Skipping a snapshot just means there are not enough entries in the write ahead log to guarantee making one.

We reported this log as a symptom of the disaster as it is the only log available after a crash. There is no other log of any kind, just alpha that notifies us that it is skipping snapshots. This makes it even more confusing to understand what is really going on.

We’ll look into running Dgraph in preemptive nodes on our end.

Thank you, hopefully it will help also the Dgraph team simulate real disasters.

@jarifibrahim we are not using ludicrous mode but that’s something we were definitely planning to test for the initial mass import.

@christian-roggia we are experiencing exactly the same issue. You can try to use “Best effort” queries just against that node and it should work. Also you can try to perform a mutation against that node and the data will be persisted - you can query other nodes and you will notice the data exists.

Running normal queries against that node will fail with timeout.

We have no other idea then just export everything and start cluster from scratch, which scares us as we are going live with our product in a few months :\