Increasing latency

vdubc · November 5, 2019, 2:30pm

we have:

1. dgraph 1.1.0:

Built at 2019-10-28T17:50:33.873Z
Commit: c8e58bf
Commit Info: c8e58bf Mon Oct 28 19:50:19 2019 +0200 (HEAD -> master, origin/master, origin/HEAD)

2. 5 instances x 4cores, 32GB each (aws cloud).
3. dgo client (v2)
4. schema:

<friend_block>: [uid] @reverse .
<friends>: [uid] @reverse .
<in_room>: [uid] @count @reverse .
<logged_at>: datetime @index(hour) .
<user_id>: int @index(int) @upsert .

type User {
    user_id: int
    logged_at: datetime
    friend_block: [uid]
    friends: [uid]
    in_room: [uid]
}

5. two main endpoints with queries and graphs of latency for the first hour:
5.1: creating user with edges (latency starts about 30ms, as expected)

# get
{get(func:eq(user_id,1)){uid,user_id}}

# create
_:vtx <dgraph.type> \"User\" .
_:vtx <logged_at> \"2019-11-05T14:14:05+02:00\"^^<xs:dateTime> .
_:vtx <user_id> \"1\"^^<xs:int> .

# get friends
{get(func:eq(user_id,1, 2)){uid,user_id}}

# relate
<0x9c41> <friend_block> <0x9c42>
<0x9c41> <friends> <0x9c42>

5.2: deleting user: <0x9c41> * * .

6. Metrics:

7. Logs of server and zero:
s.log (93.0 KB)
z.log (18.0 KB)

8. What are we doing wrong? After several hours, the memory runs out (tested on 16), the cores of CPU take off to 100% and such errors appear:
Assigning IDs is only allowed on leader. with big delay on sync, disk or proposals. The data is writing almost without errors in transactions.

pawan · November 11, 2019, 11:16pm

Hey @vdubc

Thanks for sharing a detailed post about the problem that you are facing. Could you share some more details about your usage of Dgraph.

I am assuming you are using AWS instances with 32GB RAM for each instance? Is that correct?
How many nodes do you have in your graph? Does this problem happen on a fresh cluster or a cluster which already has some data?
How many groups does your Dgraph cluster have? Are the 5 alpha servers part of the same group or different groups? What is your replication factor?

I don’t see anything wrong in the logs. If you can share us a way for us to replicate this, we can investigate this more easily. I am also happy to get on a call with you to understand the problem here. Feel free to drop me a mail at PAWAN AT DGRAPH.IO

vdubc · November 19, 2019, 9:47am

Hello, @pawan. Thanks for your attention and I’m sorry for the delaying with the answer:

Yes, that’s correct.
We have zero, alpha and ratel nodes on all five instances. This happens on a fresh cluster.
They are in the same group, the replication factor is 5.
I would like to share the way to replicate this but I can’t because it’s real data flow, difficult to imitate.

pawan · November 25, 2019, 10:26pm

Thanks for your reply @vdubc. We are going to look into this right away and try to replicate this on our end. We’ll keep you informed about our findings.

vdubc · November 25, 2019, 11:17pm

Thank you @pawan. I was trying to reproduce it by benchmarks and our flow and seems I’ve found a problem. I’ll try tomorrow some and write about a result. Thank you

pawan · November 26, 2019, 2:55am

If you are able to reproduce it using benchmarks and are able to share the benchmark test then that would be really helpful to us. Thanks!

vdubc · December 2, 2019, 10:49am

Hello @pawan, this week I was benchmarking my service in many ways by 3 hours each and everything was ok, the latency was till 500ms and kept constant. Today I did run again the same cleaned cluster, my service and waited for the appearance of errors, - the logs with errors are attached, maybe it helps:
dgraph-server.log (201.2 KB)
dgraph-zero.log (114.6 KB)

pawan · December 6, 2019, 7:37am

Thanks @vdubc, one of our engineers @ashishgoswami is looking at the issue right now and would get back to you soon.

vdubc · December 12, 2019, 11:08am

Hello @pawan. I have some update, seems it relates to AWS volumes and IOPS blocking:

It’s strange, why in dgraph-logs we don’t see problems on disks. We’ll try local storages and I’ll write here about the news after.
Thank you for your time

UPD:
@pawan
The situation is repeated (fresh cluster from 3 instances, replicas=3, local storages), all instances are writing about 7M/s and in about 40min the writing speed is falling(to 1M/s), latency is increasing (from 100ms to 10s), the errors are appears (“Read index context timed out”, “Got error: Assigning IDs is only allowed on leader”), the same logs.

ashishgoswami · December 16, 2019, 2:54pm

Hey @vdubc, we have tried something similar on our side and are able to reproduce the issues you are seeing. We are working to fix those issues. Will update you on the progress.

vdubc · January 2, 2020, 12:26pm

Hello, @ashishgoswami. Thank you for the update.
Could you leave me an issue number or link on the problem at github.com (if you’ve created) so that I can track?
Thanks

ashishgoswami · January 2, 2020, 1:49pm

Hey @vdubc, we have two PRs to address the issue.

https://github.com/dgraph-io/dgraph/pull/4453
https://github.com/dgraph-io/dgraph/pull/4472

We will merge those into master by tomorrow. You can run your workload on master and let us know your findings.

ashishgoswami · January 14, 2020, 8:16am

Hey @vdubc, we have merged both the PRs. We are still trying more optimisations.
In the mean time can you try running your benchmarks on master branch?

vdubc · January 22, 2020, 3:04pm

Hey, @ashishgoswami. Thank you for the update, I did build and run on fresh instances but the errors occurred after a few hours of work and increasing latency.

b not forwarding to leader 16 at term 3; dropping proposal
Read index context timed out
Assigning IDs is only allowed on leader.

dgraph-logs.zip (32.2 KB)

ashishgoswami · January 22, 2020, 4:01pm

Hey @vdubc, thanks for getting back to us. We will look into the logs and get back to you.
Also, if possible, please try to run your benchmarks with 1 zero and 3 alphas cluster and let us know your findings.

vdubc · January 24, 2020, 9:15am

Hey, @ashishgoswami.
It’s working 23 hours without errors (one zero and three alphas). The latency is much slower but anyway increasing (already from 40ms to 10s) and disk writing speed has fallen from 4.3 to 1.0 MB/sec.

RPS to my service:

P99:

P75:

dgraph-logs.zip (249.3 KB)

vdubc · January 27, 2020, 2:27pm

Some update:
after 4 days of work there are errors in the logs and the latency leaves much to be desired:

dgraph-logs.zip (1.9 MB)

chewxy · December 19, 2020, 9:05pm

4 posts were split to a new topic: Increasing Latency in v20.03.3

ibrahim · December 22, 2020, 7:27am

Hey @vdubc, we recently released the Dgraph v20.11.0 . Can you run the tests on the latest release? We’ve made a bunch of performance improvements in this release.

vdubc · December 23, 2020, 10:58am

Hey @ibrahim. Yes, I can, but I need to recover all my flow and it takes time, will come back in a few days

Topic		Replies	Views
[20.07] Guidence on major performance bottlenecks Dgraph kind:question	19	1884	November 17, 2022
Increasing Latency in v20.03.3 Dgraph	4	793	December 19, 2020
DGraph Times Out Processing Graph Dgraph dgraph , investigate , status:accepted , area:performance	26	1039	November 13, 2019
What I do wrong that my write performance is so bad Dgraph mutation	13	1397	December 19, 2018
Help needed: dgraph keeps crashing under load - what s going on? Dgraph	4	1867	April 13, 2020

Increasing latency

Related topics