Extremely slow Oracle().WaitForTs() during mutation

iluminae · June 10, 2021, 4:17am

I am trying to get to the bottom of severe performance issues on our dgraph system. I have traces indicating that posting.Oracle().WaitForTs() (here) is taking several seconds on what seems to be most of my queries:

166.75ms: funcId=847459680 funcName=processTaskhca.xid message=Start.
166.75ms: message=Waiting for startTs: 16052571 at node: 3, gid: 1
4.85s:    message=Done waiting for maxAssigned. Attr: "\x00\x00\x00\x00\x00\x00\x00\x00hca.xid" ReadTs: 16052571 Max: 16052571
4.87s:    message=Done waiting for checksum match

What does this indicate?

dgraph version v21.03, 1group, 3alphas, 20 cores 27GiB ram each, 2TiB disk in gke.

MichelDiz · June 10, 2021, 2:08pm

are the other instances (Alphas) healthy? Or having some limitation(like resources)?

iluminae · June 10, 2021, 3:08pm

I do see the alphas using 18 cores of 20 possible, but this is more than the 16 dgraph has available in dgraph cloud, so I feel like thats a good amount. The prometheus stat dgraph_num_queries_total shows a total of ~1.7k queries per second being served, which does not sound like too much. Seems like traffic is being spread across the 3 alphas in the group decently. Google shows no disk throttling on the pvcs. We have like 800MB/s throughput quota on the 2TiB disk.

MichelDiz · June 10, 2021, 3:13pm

Are you feeling any loss in perf? Some of the logs are just noise for users. They are used for engineers be guided in some issues. But isn’t always a sign of anything too serious.

iluminae · June 10, 2021, 3:17pm

My production system has 50 million messages in a backlog to insert, and the fastest it can do it is ~3k nquads/s. The backlog is growing by the minute, but the 3k/s is a real problem. When we scale tested dgraph locally we were able to get ~130k nquads/s so I am trying to understand what the extreme slow down is caused by.

MichelDiz · June 10, 2021, 3:32pm

What cloud service are you using? my guess is that (In my opinion) containers systems has some drawbacks related to IO. In comparison with a Bare metal machine and using Dgraph’s binary directly on the machine is unfair as Dgraph has total use of the machine, as in a containerized model the container system controls everything. So it is a matter of tunning the Docker/K8s or it is a real limitation that you could avoid by spreading your cluster horizontally more and more.

iluminae · June 10, 2021, 4:04pm

I am using GKE (google). My scale tests were also in k8s.

I suppose I can attempt to scale down and horizontally scale the system to be many groups. It is a bit rough to do that on production, but I am rather desperate for performance at this point.

dmai · June 10, 2021, 4:05pm

Hi @iluminae. Are you seeing these numbers in a Dgraph Cloud backend?

iluminae · June 10, 2021, 4:06pm

no, this is running in my GKE.

dmai · June 10, 2021, 4:25pm

And what are the machine specs for Zero?

Are these numbers for both Alphas and Zeros?

You can check the dgraph_pending_proposals_total metric to see if there’s a bunch of proposals queuing up in the system which would cause more delays.

iluminae · June 10, 2021, 4:29pm

The zeros are 3x(4c 10GiB) and have extremely low utilization. Here is the kubectl top pods output (i have boosted my alphas to 30c 27GiB ram).

NAME                     CPU(cores)   MEMORY(bytes)
graphdb-dgraph-alpha-0   27186m       25486Mi
graphdb-dgraph-alpha-1   27924m       25054Mi
graphdb-dgraph-alpha-2   28195m       25605Mi
graphdb-dgraph-zero-0    208m         192Mi
graphdb-dgraph-zero-1    204m         190Mi
graphdb-dgraph-zero-2    200m         647Mi

Here is pending proposals:

Topic		Replies	Views
Batch mutations are very slow Dgraph kind:question	11	766	November 25, 2020
Dgraph Latency Tests with Embargo Dev benchmark	0	731	June 1, 2020
Very slow schema mutation on empty database Users	8	719	June 12, 2020
Very slow schema mutation on empty database (still an issue) Dgraph dgraph , status:accepted , ticket:created	19	1448	October 6, 2020
Start alpha take up nearly an hour and can not query when exist pending proposals or pending query Users	16	1806	February 10, 2019

Extremely slow Oracle().WaitForTs() during mutation

Related topics