I am trying to get to the bottom of severe performance issues on our dgraph system. I have traces indicating that posting.Oracle().WaitForTs() (here) is taking several seconds on what seems to be most of my queries:
166.75ms: funcId=847459680 funcName=processTaskhca.xid message=Start.
166.75ms: message=Waiting for startTs: 16052571 at node: 3, gid: 1
4.85s: message=Done waiting for maxAssigned. Attr: "\x00\x00\x00\x00\x00\x00\x00\x00hca.xid" ReadTs: 16052571 Max: 16052571
4.87s: message=Done waiting for checksum match
What does this indicate?
dgraph version v21.03, 1group, 3alphas, 20 cores 27GiB ram each, 2TiB disk in gke.
I do see the alphas using 18 cores of 20 possible, but this is more than the 16 dgraph has available in dgraph cloud, so I feel like thats a good amount. The prometheus stat dgraph_num_queries_total shows a total of ~1.7k queries per second being served, which does not sound like too much. Seems like traffic is being spread across the 3 alphas in the group decently. Google shows no disk throttling on the pvcs. We have like 800MB/s throughput quota on the 2TiB disk.
Are you feeling any loss in perf? Some of the logs are just noise for users. They are used for engineers be guided in some issues. But isn’t always a sign of anything too serious.
My production system has 50 million messages in a backlog to insert, and the fastest it can do it is ~3k nquads/s. The backlog is growing by the minute, but the 3k/s is a real problem. When we scale tested dgraph locally we were able to get ~130k nquads/s so I am trying to understand what the extreme slow down is caused by.
What cloud service are you using? my guess is that (In my opinion) containers systems has some drawbacks related to IO. In comparison with a Bare metal machine and using Dgraph’s binary directly on the machine is unfair as Dgraph has total use of the machine, as in a containerized model the container system controls everything. So it is a matter of tunning the Docker/K8s or it is a real limitation that you could avoid by spreading your cluster horizontally more and more.
I am using GKE (google). My scale tests were also in k8s.
I suppose I can attempt to scale down and horizontally scale the system to be many groups. It is a bit rough to do that on production, but I am rather desperate for performance at this point.