Why is the server overloaded with pending proposals?

Running Dgraph 23.1.0, the following error occurs when ingesting data:

groups.go:1096] While proposing delta with MaxAssigned: 78591850 and num txns: 0.  Error=Server overloaded with pending proposals. Please retry later.  Retrying...

Reference line of code: dgraph/worker/groups.go at main · dgraph-io/dgraph · GitHub

Any advice on what this means? Or how the MaxAssigned is set?

Thanks

Found some information here:
https://dgraph.io/docs/deploy/cli-command-reference/#dgraph-alpha

pending-proposals=256; Number of pending mutation proposals. Useful for rate limiting.

In the instance that we don’t want to rate limit, rather we’d like the alpha to do more work, would we increase this value? If so, what’s a reasonable increase and how would one track how many pending-proposals are queued?

I found this: Metrics - Deploy, although I don’t immediately see how to poll those metrics.

This is also helpful to see the endpoints to get the metrics: Docs: Missing Prometheus metrics or wrong metrics name or removed. (Grafana related) · Issue #4772 · dgraph-io/dgraph · GitHub

localhost/debug/prometheus_metrics
localhost/debug/vars

Note: The debug/vars and prometheus_metrics are specific to the alpha it is run on.

Any advice on what this means? Or how the MaxAssigned is set?

So, MaxAssigned is the maximum timestamp that has been issued. It keeps increasing with each proposal. The error means that alpha has reached its maximum proposal limit.

We’d like the alpha to do more work, would we increase this value? If so, what’s a reasonable increase and how would one track how many pending-proposals are queued?

Increasing this value won’t help alpha do more work. It would help alpha queue more work up. As more proposals queue up, the alphas would utilize more memory. Since the proposals would be retried after this error, filling up the queue won’t help with performance a lot. You can increase the value and see if it helps though. You can increase it according to the memory left / average size of your mutations / queries.

I found this: Metrics - Deploy, although I don’t immediately see how to poll those metrics.

You can use grafana to poll this, or you can write a poller that would hit the http endpoint you mentioned and then parse the metrics.

I would add that the typical approach is to optimize queries, or other work the server is doing.

For queries in particular, you can use a binary search approach and take out parts of the query, including individual filter conditions, to see what aspect of a slow query is taking up the most time. Time taken is usually a good proxy for workload (unless there is a situation where the Apply Channel is backed up and queries are blocked waiting for updates to make it through a critical section of code, which can happen in an overloaded system processing too many updates).

Then construct a minimal case of the slow query (e.g. using the specific combination of query conditions or fields that cause the slowdown) which the community can advise on optimizing.

Having Jaeger traces is also a good way to see what parts of a query are slow.

@Damon and @harshil_goel, thanks for the advice. I am in the process of getting jaeger up and running. I’ll report back with anything of substance I find.

Any suggestion on what a safe --memory.max-traces would be? Or how large a trace is? I’m concerned the container will die with an out of memory error if I don’t set this.
Ref: Deployment — Jaeger documentation

@rahst12 Personally we haven’t seen any issues with traces. We don’t publish a lot of events right now, so traces won’t be that big. If you do see a lot of traces, you can reduce the sampling percentage of the traces.