7. Logs of server and zero: s.log (93.0 KB) z.log (18.0 KB)
8. What are we doing wrong? After several hours, the memory runs out (tested on 16), the cores of CPU take off to 100% and such errors appear: Assigning IDs is only allowed on leader. with big delay on sync, disk or proposals. The data is writing almost without errors in transactions.
Thanks for sharing a detailed post about the problem that you are facing. Could you share some more details about your usage of Dgraph.
I am assuming you are using AWS instances with 32GB RAM for each instance? Is that correct?
How many nodes do you have in your graph? Does this problem happen on a fresh cluster or a cluster which already has some data?
How many groups does your Dgraph cluster have? Are the 5 alpha servers part of the same group or different groups? What is your replication factor?
I don’t see anything wrong in the logs. If you can share us a way for us to replicate this, we can investigate this more easily. I am also happy to get on a call with you to understand the problem here. Feel free to drop me a mail at PAWAN AT DGRAPH.IO
Hello, @pawan. Thanks for your attention and I’m sorry for the delaying with the answer:
Yes, that’s correct.
We have zero, alpha and ratel nodes on all five instances. This happens on a fresh cluster.
They are in the same group, the replication factor is 5.
I would like to share the way to replicate this but I can’t because it’s real data flow, difficult to imitate.
Thanks for your reply @vdubc. We are going to look into this right away and try to replicate this on our end. We’ll keep you informed about our findings.
Thank you @pawan. I was trying to reproduce it by benchmarks and our flow and seems I’ve found a problem. I’ll try tomorrow some and write about a result. Thank you
Hello @pawan, this week I was benchmarking my service in many ways by 3 hours each and everything was ok, the latency was till 500ms and kept constant. Today I did run again the same cleaned cluster, my service and waited for the appearance of errors, - the logs with errors are attached, maybe it helps: dgraph-server.log (201.2 KB) dgraph-zero.log (114.6 KB)
Hello @pawan. I have some update, seems it relates to AWS volumes and IOPS blocking:
It’s strange, why in dgraph-logs we don’t see problems on disks. We’ll try local storages and I’ll write here about the news after.
Thank you for your time
UPD: @pawan
The situation is repeated (fresh cluster from 3 instances, replicas=3, local storages), all instances are writing about 7M/s and in about 40min the writing speed is falling(to 1M/s), latency is increasing (from 100ms to 10s), the errors are appears (“Read index context timed out”, “Got error: Assigning IDs is only allowed on leader”), the same logs.
Hey @vdubc, we have tried something similar on our side and are able to reproduce the issues you are seeing. We are working to fix those issues. Will update you on the progress.
Hello, @ashishgoswami. Thank you for the update.
Could you leave me an issue number or link on the problem at github.com (if you’ve created) so that I can track?
Thanks
Hey @vdubc, we have merged both the PRs. We are still trying more optimisations.
In the mean time can you try running your benchmarks on master branch?
Hey, @ashishgoswami. Thank you for the update, I did build and run on fresh instances but the errors occurred after a few hours of work and increasing latency.
b not forwarding to leader 16 at term 3; dropping proposal
Read index context timed out
Assigning IDs is only allowed on leader.
Hey @vdubc, thanks for getting back to us. We will look into the logs and get back to you.
Also, if possible, please try to run your benchmarks with 1 zero and 3 alphas cluster and let us know your findings.
Hey, @ashishgoswami.
It’s working 23 hours without errors (one zero and three alphas). The latency is much slower but anyway increasing (already from 40ms to 10s) and disk writing speed has fallen from 4.3 to 1.0 MB/sec.
Hey @vdubc, we recently released the Dgraph v20.11.0 . Can you run the tests on the latest release? We’ve made a bunch of performance improvements in this release.