Increasing latency

Thanks @vdubc, one of our engineers @ashishgoswami is looking at the issue right now and would get back to you soon.

Hello @pawan. I have some update, seems it relates to AWS volumes and IOPS blocking:
38
It’s strange, why in dgraph-logs we don’t see problems on disks. We’ll try local storages and I’ll write here about the news after.
Thank you for your time

UPD:
@pawan
The situation is repeated (fresh cluster from 3 instances, replicas=3, local storages), all instances are writing about 7M/s and in about 40min the writing speed is falling(to 1M/s), latency is increasing (from 100ms to 10s), the errors are appears (“Read index context timed out”, “Got error: Assigning IDs is only allowed on leader”), the same logs.

Hey @vdubc, we have tried something similar on our side and are able to reproduce the issues you are seeing. We are working to fix those issues. Will update you on the progress.

Hello, @ashishgoswami. Thank you for the update.
Could you leave me an issue number or link on the problem at github.com (if you’ve created) so that I can track?
Thanks

Hey @vdubc, we have two PRs to address the issue.

https://github.com/dgraph-io/dgraph/pull/4453
https://github.com/dgraph-io/dgraph/pull/4472

We will merge those into master by tomorrow. You can run your workload on master and let us know your findings.

1 Like

Hey @vdubc, we have merged both the PRs. We are still trying more optimisations.
In the mean time can you try running your benchmarks on master branch?

Hey, @ashishgoswami. Thank you for the update, I did build and run on fresh instances but the errors occurred after a few hours of work and increasing latency.

b not forwarding to leader 16 at term 3; dropping proposal
Read index context timed out
Assigning IDs is only allowed on leader.


dgraph-logs.zip (32.2 KB)

Hey @vdubc, thanks for getting back to us. We will look into the logs and get back to you.
Also, if possible, please try to run your benchmarks with 1 zero and 3 alphas cluster and let us know your findings.

Hey, @ashishgoswami.
It’s working 23 hours without errors (one zero and three alphas). The latency is much slower but anyway increasing (already from 40ms to 10s) and disk writing speed has fallen from 4.3 to 1.0 MB/sec.

RPS to my service:


P99:

P75:

dgraph-logs.zip (249.3 KB)

Some update:
after 4 days of work there are errors in the logs and the latency leaves much to be desired:



dgraph-logs.zip (1.9 MB)

4 posts were split to a new topic: Increasing Latency in v20.03.3

Hey @vdubc, we recently released the Dgraph v20.11.0 . Can you run the tests on the latest release? We’ve made a bunch of performance improvements in this release.

Hey @ibrahim. Yes, I can, but I need to recover all my flow and it takes time, will come back in a few days

2 Likes

Hey @vdubc

Sending a quick followup to check if you had the chance to run the test with Dgraph v20.11.0 as suggested by @ibrahim

Thanks

@omar, hey! yes, I did run the flow several hours ago and it looks good for now (as previously), thus now I’m looking at the metrics and waiting, I’ll come back when I get the results

1 Like

@vdubc great, please keep us posted

Hi @vdubc

Sending a quick followup, how things are going, do your metrics look good as of now?

Please provide update

Hi @vdubc

We’ve released Dgraph v20.11.1 (Changelog)

Wondering if you had the chance to test out your use case with one of our latest releases

Best,
Omar Ayoubi

Hey, @omar. It(v20.11.0) looked much better but anyway not good, after two days of good latency (for 50-500ms), transactions started to hang (up to 8 hours), after restarting service it was working an hour and did come back to the hanging again.
So, I’ll try to gather logs and metrics when I have a chance after other issues (in one or two weeks), sorry but this already is not my mandatory project, maybe I try to connect someone else from our team or come back by myself, sorry for delaying

@vdubc would you be able to share your program with us? (I couldn’t find it in any of the previous messages). It’s easier for me to debug the issue if I can reproduce it on our end. I believe @ashishgoswami had a program but I can’t find it right now.