Active Mutations stuck forever

seanlaff · June 11, 2020, 3:14pm

Hi all, following up with progress from the previous thread Dgraph can't idle without being oomkilled after large data ingestion

We’ve been getting better performance with the changes being pushed to master, which has been great (and also with the WIP changes in https://github.com/dgraph-io/dgraph/pull/5535). Also want to note, L0OnMemory=false makes a major difference for us, without it the cluster falls over from oom almost immediately. Thoughts on exposing that variable as a runtime config?

However we’re running into a situation now where if we kick up our ingestion slightly above what dgraph likes to handle, we get stuck mutations

Screen Shot 2020-06-11 at 10.20.38 AM

We cut all load on the cluster, but those mutations never clear on alpha-0, and we can’t ingest further. Here’s some logs after I tried restarting the node to fix it. Through this time mem/cpu were only at 25% util. What should I try next?

logout_restart_rightimg (295.4 KB)

This image is built on current master with the L0 flag set to disk.

mrjn · June 11, 2020, 3:20pm

If your Alpha is stuck right now, can you share your goroutine stack trace?

https://dgraph.io/docs/howto/#goroutine-stack

Also looks like your Zero was down for sometime, but then recovered.

seanlaff · June 11, 2020, 3:24pm

Sure can, and yup after the cluster was in this state for an hour I tried bouncing everything to see if was a transient issue- so I think there was some zero churn at the beginning of the log.

dump (956.2 KB)

Here’s the goroutine dump

Also slightly updated logs at time of the goroutine trace alpha0logs (299.5 KB)

seanlaff · June 11, 2020, 9:33pm

Since it feels a bit like a deadlock I looked to see if the traces revealed anything. I sampled at 100%, let it run, and then let the cluster gracefully shut down.

Here are the long spans I found. Nothing too useful in them, and I’m guessing these are just the long-lived connections between the alphas, but thought Id drop it here just in case.

mrjn · June 12, 2020, 2:55am

This log is from an Alpha follower (raft ID 1). Looks like there was some issues with Zero. I bet the Alpha leader (raft ID 3) could not connect with Zero properly and get updates from it. OR, this Alpha follower could not connect to the Alpha leader. Therefore, all the queries are stuck waiting for a timestamp.

JimWen · June 30, 2020, 10:57pm

I got the same problem when online write since v20.03.2, v20.03.1 is fine.
Here is the zero and alpha goroutine log.
bx-bossnlpgraphdb-11_202007010650_alpha.log.tar.gz (51.0 KB) bx-bossnlpgraphdb-11_202007010650_zero.log.tar.gz (8.2 KB)

seanlaff · July 8, 2020, 10:19pm

@JimWen there’s a discussion over here you may be interested in [disaster recovery] Cluster unable to recover after crash during intensive writing operations · Issue #5836 · dgraph-io/dgraph · GitHub. Symptoms bear resemblance to what we’ve seen here

Topic		Replies	Views
Dgraph can't idle without being oomkilled after large data ingestion Dgraph	63	3830	September 14, 2020
Mutation always get stuck and Timeout 。。。 Dgraph	8	1095	April 6, 2021
Run mutations concurrently per predicate in Ludicrous mode Dgraph dgraph , status:accepted	2	540	September 2, 2020
Mutation hangs forever Dgraph	1	436	February 27, 2021
Resource limits cause cluster oom kill lock Issues untagged , charts	1	809	June 4, 2020

Active Mutations stuck forever

Related topics