Active Mutations stuck forever

Hi all, following up with progress from the previous thread Dgraph can't idle without being oomkilled after large data ingestion

We’ve been getting better performance with the changes being pushed to master, which has been great (and also with the WIP changes in https://github.com/dgraph-io/dgraph/pull/5535). Also want to note, L0OnMemory=false makes a major difference for us, without it the cluster falls over from oom almost immediately. Thoughts on exposing that variable as a runtime config?

However we’re running into a situation now where if we kick up our ingestion slightly above what dgraph likes to handle, we get stuck mutations

Screen Shot 2020-06-11 at 10.20.38 AM

We cut all load on the cluster, but those mutations never clear on alpha-0, and we can’t ingest further. Here’s some logs after I tried restarting the node to fix it. Through this time mem/cpu were only at 25% util. What should I try next?

logout_restart_rightimg (295.4 KB)

This image is built on current master with the L0 flag set to disk.

If your Alpha is stuck right now, can you share your goroutine stack trace?

https://dgraph.io/docs/howto/#goroutine-stack

Also looks like your Zero was down for sometime, but then recovered.

Sure can, and yup after the cluster was in this state for an hour I tried bouncing everything to see if was a transient issue- so I think there was some zero churn at the beginning of the log.

dump (956.2 KB)

Here’s the goroutine dump

Also slightly updated logs at time of the goroutine trace alpha0logs (299.5 KB)

Since it feels a bit like a deadlock I looked to see if the traces revealed anything. I sampled at 100%, let it run, and then let the cluster gracefully shut down.

Here are the long spans I found. Nothing too useful in them, and I’m guessing these are just the long-lived connections between the alphas, but thought Id drop it here just in case.

This log is from an Alpha follower (raft ID 1). Looks like there was some issues with Zero. I bet the Alpha leader (raft ID 3) could not connect with Zero properly and get updates from it. OR, this Alpha follower could not connect to the Alpha leader. Therefore, all the queries are stuck waiting for a timestamp.

I got the same problem when online write since v20.03.2, v20.03.1 is fine.
Here is the zero and alpha goroutine log.
bx-bossnlpgraphdb-11_202007010650_alpha.log.tar.gz (51.0 KB) bx-bossnlpgraphdb-11_202007010650_zero.log.tar.gz (8.2 KB)

1 Like

@JimWen there’s a discussion over here you may be interested in [disaster recovery] Cluster unable to recover after crash during intensive writing operations · Issue #5836 · dgraph-io/dgraph · GitHub. Symptoms bear resemblance to what we’ve seen here