Extremely slow Oracle().WaitForTs() during mutation

I am trying to get to the bottom of severe performance issues on our dgraph system. I have traces indicating that posting.Oracle().WaitForTs() (here) is taking several seconds on what seems to be most of my queries:

166.75ms: funcId=847459680 funcName=processTaskhca.xid message=Start.
166.75ms: message=Waiting for startTs: 16052571 at node: 3, gid: 1
4.85s:    message=Done waiting for maxAssigned. Attr: "\x00\x00\x00\x00\x00\x00\x00\x00hca.xid" ReadTs: 16052571 Max: 16052571
4.87s:    message=Done waiting for checksum match

What does this indicate?

dgraph version v21.03, 1group, 3alphas, 20 cores 27GiB ram each, 2TiB disk in gke.

are the other instances (Alphas) healthy? Or having some limitation(like resources)?

I do see the alphas using 18 cores of 20 possible, but this is more than the 16 dgraph has available in dgraph cloud, so I feel like thats a good amount. The prometheus stat dgraph_num_queries_total shows a total of ~1.7k queries per second being served, which does not sound like too much. Seems like traffic is being spread across the 3 alphas in the group decently. Google shows no disk throttling on the pvcs. We have like 800MB/s throughput quota on the 2TiB disk.

Are you feeling any loss in perf? Some of the logs are just noise for users. They are used for engineers be guided in some issues. But isn’t always a sign of anything too serious.

My production system has 50 million messages in a backlog to insert, and the fastest it can do it is ~3k nquads/s. The backlog is growing by the minute, but the 3k/s is a real problem. When we scale tested dgraph locally we were able to get ~130k nquads/s so I am trying to understand what the extreme slow down is caused by.

What cloud service are you using? my guess is that (In my opinion) containers systems has some drawbacks related to IO. In comparison with a Bare metal machine and using Dgraph’s binary directly on the machine is unfair as Dgraph has total use of the machine, as in a containerized model the container system controls everything. So it is a matter of tunning the Docker/K8s or it is a real limitation that you could avoid by spreading your cluster horizontally more and more.

I am using GKE (google). My scale tests were also in k8s.

I suppose I can attempt to scale down and horizontally scale the system to be many groups. It is a bit rough to do that on production, but I am rather desperate for performance at this point.

Hi @iluminae. Are you seeing these numbers in a Dgraph Cloud backend?

no, this is running in my GKE.

And what are the machine specs for Zero?

Are these numbers for both Alphas and Zeros?

You can check the dgraph_pending_proposals_total metric to see if there’s a bunch of proposals queuing up in the system which would cause more delays.

The zeros are 3x(4c 10GiB) and have extremely low utilization. Here is the kubectl top pods output (i have boosted my alphas to 30c 27GiB ram).

NAME                     CPU(cores)   MEMORY(bytes)
graphdb-dgraph-alpha-0   27186m       25486Mi
graphdb-dgraph-alpha-1   27924m       25054Mi
graphdb-dgraph-alpha-2   28195m       25605Mi
graphdb-dgraph-zero-0    208m         192Mi
graphdb-dgraph-zero-1    204m         190Mi
graphdb-dgraph-zero-2    200m         647Mi

Here is pending proposals: