Increasing latency

we have:

1. dgraph 1.1.0:

Built at 2019-10-28T17:50:33.873Z
Commit: c8e58bf
Commit Info: c8e58bf Mon Oct 28 19:50:19 2019 +0200 (HEAD -> master, origin/master, origin/HEAD)

2. 5 instances x 4cores, 32GB each (aws cloud).
3. dgo client (v2)
4. schema:

<friend_block>: [uid] @reverse .
<friends>: [uid] @reverse .
<in_room>: [uid] @count @reverse .
<logged_at>: datetime @index(hour) .
<user_id>: int @index(int) @upsert .

type User {
    user_id: int
    logged_at: datetime
    friend_block: [uid]
    friends: [uid]
    in_room: [uid]

5. two main endpoints with queries and graphs of latency for the first hour:
5.1: creating user with edges (latency starts about 30ms, as expected)

# get

# create
_:vtx <dgraph.type> \"User\" .
_:vtx <logged_at> \"2019-11-05T14:14:05+02:00\"^^<xs:dateTime> .
_:vtx <user_id> \"1\"^^<xs:int> .

# get friends
{get(func:eq(user_id,1, 2)){uid,user_id}}

# relate
<0x9c41> <friend_block> <0x9c42>
<0x9c41> <friends> <0x9c42>

5.2: deleting user: <0x9c41> * * .

6. Metrics:

7. Logs of server and zero:
s.log (93.0 KB)
z.log (18.0 KB)

8. What are we doing wrong? After several hours, the memory runs out (tested on 16), the cores of CPU take off to 100% and such errors appear:
Assigning IDs is only allowed on leader. with big delay on sync, disk or proposals. The data is writing almost without errors in transactions.

Hey @vdubc

Thanks for sharing a detailed post about the problem that you are facing. Could you share some more details about your usage of Dgraph.

  1. I am assuming you are using AWS instances with 32GB RAM for each instance? Is that correct?

  2. How many nodes do you have in your graph? Does this problem happen on a fresh cluster or a cluster which already has some data?

  3. How many groups does your Dgraph cluster have? Are the 5 alpha servers part of the same group or different groups? What is your replication factor?

I don’t see anything wrong in the logs. If you can share us a way for us to replicate this, we can investigate this more easily. I am also happy to get on a call with you to understand the problem here. Feel free to drop me a mail at PAWAN AT DGRAPH.IO

Hello, @pawan. Thanks for your attention and I’m sorry for the delaying with the answer:

  1. Yes, that’s correct.
  2. We have zero, alpha and ratel nodes on all five instances. This happens on a fresh cluster.
  3. They are in the same group, the replication factor is 5.
    I would like to share the way to replicate this but I can’t because it’s real data flow, difficult to imitate.

Thanks for your reply @vdubc. We are going to look into this right away and try to replicate this on our end. We’ll keep you informed about our findings.

Thank you @pawan. I was trying to reproduce it by benchmarks and our flow and seems I’ve found a problem. I’ll try tomorrow some and write about a result. Thank you

1 Like

If you are able to reproduce it using benchmarks and are able to share the benchmark test then that would be really helpful to us. Thanks!

Hello @pawan, this week I was benchmarking my service in many ways by 3 hours each and everything was ok, the latency was till 500ms and kept constant. Today I did run again the same cleaned cluster, my service and waited for the appearance of errors, - the logs with errors are attached, maybe it helps:
dgraph-server.log (201.2 KB)
dgraph-zero.log (114.6 KB)

Thanks @vdubc, one of our engineers @ashishgoswami is looking at the issue right now and would get back to you soon.

Hello @pawan. I have some update, seems it relates to AWS volumes and IOPS blocking:
It’s strange, why in dgraph-logs we don’t see problems on disks. We’ll try local storages and I’ll write here about the news after.
Thank you for your time

The situation is repeated (fresh cluster from 3 instances, replicas=3, local storages), all instances are writing about 7M/s and in about 40min the writing speed is falling(to 1M/s), latency is increasing (from 100ms to 10s), the errors are appears (“Read index context timed out”, “Got error: Assigning IDs is only allowed on leader”), the same logs.

Hey @vdubc, we have tried something similar on our side and are able to reproduce the issues you are seeing. We are working to fix those issues. Will update you on the progress.

Hello, @ashishgoswami. Thank you for the update.
Could you leave me an issue number or link on the problem at (if you’ve created) so that I can track?

Hey @vdubc, we have two PRs to address the issue.

We will merge those into master by tomorrow. You can run your workload on master and let us know your findings.

1 Like

Hey @vdubc, we have merged both the PRs. We are still trying more optimisations.
In the mean time can you try running your benchmarks on master branch?

Hey, @ashishgoswami. Thank you for the update, I did build and run on fresh instances but the errors occurred after a few hours of work and increasing latency.

b not forwarding to leader 16 at term 3; dropping proposal
Read index context timed out
Assigning IDs is only allowed on leader. (32.2 KB)

Hey @vdubc, thanks for getting back to us. We will look into the logs and get back to you.
Also, if possible, please try to run your benchmarks with 1 zero and 3 alphas cluster and let us know your findings.

Hey, @ashishgoswami.
It’s working 23 hours without errors (one zero and three alphas). The latency is much slower but anyway increasing (already from 40ms to 10s) and disk writing speed has fallen from 4.3 to 1.0 MB/sec.

RPS to my service:


P75: (249.3 KB)

Some update:
after 4 days of work there are errors in the logs and the latency leaves much to be desired: (1.9 MB)