We’ve been getting better performance with the changes being pushed to master, which has been great (and also with the WIP changes in https://github.com/dgraph-io/dgraph/pull/5535). Also want to note, L0OnMemory=false makes a major difference for us, without it the cluster falls over from oom almost immediately. Thoughts on exposing that variable as a runtime config?
However we’re running into a situation now where if we kick up our ingestion slightly above what dgraph likes to handle, we get stuck mutations
We cut all load on the cluster, but those mutations never clear on alpha-0, and we can’t ingest further. Here’s some logs after I tried restarting the node to fix it. Through this time mem/cpu were only at 25% util. What should I try next?
Sure can, and yup after the cluster was in this state for an hour I tried bouncing everything to see if was a transient issue- so I think there was some zero churn at the beginning of the log.
Since it feels a bit like a deadlock I looked to see if the traces revealed anything. I sampled at 100%, let it run, and then let the cluster gracefully shut down.
Here are the long spans I found. Nothing too useful in them, and I’m guessing these are just the long-lived connections between the alphas, but thought Id drop it here just in case.
This log is from an Alpha follower (raft ID 1). Looks like there was some issues with Zero. I bet the Alpha leader (raft ID 3) could not connect with Zero properly and get updates from it. OR, this Alpha follower could not connect to the Alpha leader. Therefore, all the queries are stuck waiting for a timestamp.