Increasing latency

(Valerii S ) #1

we have:

1. dgraph 1.1.0:

Built at 2019-10-28T17:50:33.873Z
Commit: c8e58bf
Commit Info: c8e58bf Mon Oct 28 19:50:19 2019 +0200 (HEAD -> master, origin/master, origin/HEAD)

2. 5 instances x 4cores, 32GB each (aws cloud).
3. dgo client (v2)
4. schema:

<friend_block>: [uid] @reverse .
<friends>: [uid] @reverse .
<in_room>: [uid] @count @reverse .
<logged_at>: datetime @index(hour) .
<user_id>: int @index(int) @upsert .

type User {
    user_id: int
    logged_at: datetime
    friend_block: [uid]
    friends: [uid]
    in_room: [uid]

5. two main endpoints with queries and graphs of latency for the first hour:
5.1: creating user with edges (latency starts about 30ms, as expected)

# get

# create
_:vtx <dgraph.type> \"User\" .
_:vtx <logged_at> \"2019-11-05T14:14:05+02:00\"^^<xs:dateTime> .
_:vtx <user_id> \"1\"^^<xs:int> .

# get friends
{get(func:eq(user_id,1, 2)){uid,user_id}}

# relate
<0x9c41> <friend_block> <0x9c42>
<0x9c41> <friends> <0x9c42>

5.2: deleting user: <0x9c41> * * .

6. Metrics:

7. Logs of server and zero:
s.log (93.0 KB)
z.log (18.0 KB)

8. What are we doing wrong? After several hours, the memory runs out (tested on 16), the cores of CPU take off to 100% and such errors appear:
Assigning IDs is only allowed on leader. with big delay on sync, disk or proposals. The data is writing almost without errors in transactions.

(Pawan Rawal) #2

Hey @vdubc

Thanks for sharing a detailed post about the problem that you are facing. Could you share some more details about your usage of Dgraph.

  1. I am assuming you are using AWS instances with 32GB RAM for each instance? Is that correct?

  2. How many nodes do you have in your graph? Does this problem happen on a fresh cluster or a cluster which already has some data?

  3. How many groups does your Dgraph cluster have? Are the 5 alpha servers part of the same group or different groups? What is your replication factor?

I don’t see anything wrong in the logs. If you can share us a way for us to replicate this, we can investigate this more easily. I am also happy to get on a call with you to understand the problem here. Feel free to drop me a mail at PAWAN AT DGRAPH.IO

(Valerii S ) #3

Hello, @pawan. Thanks for your attention and I’m sorry for the delaying with the answer:

  1. Yes, that’s correct.
  2. We have zero, alpha and ratel nodes on all five instances. This happens on a fresh cluster.
  3. They are in the same group, the replication factor is 5.
    I would like to share the way to replicate this but I can’t because it’s real data flow, difficult to imitate.

(Pawan Rawal) #4

Thanks for your reply @vdubc. We are going to look into this right away and try to replicate this on our end. We’ll keep you informed about our findings.

(Valerii S ) #5

Thank you @pawan. I was trying to reproduce it by benchmarks and our flow and seems I’ve found a problem. I’ll try tomorrow some and write about a result. Thank you

(Pawan Rawal) #6

If you are able to reproduce it using benchmarks and are able to share the benchmark test then that would be really helpful to us. Thanks!

(Valerii S ) #7

Hello @pawan, this week I was benchmarking my service in many ways by 3 hours each and everything was ok, the latency was till 500ms and kept constant. Today I did run again the same cleaned cluster, my service and waited for the appearance of errors, - the logs with errors are attached, maybe it helps:
dgraph-server.log (201.2 KB)
dgraph-zero.log (114.6 KB)

(Pawan Rawal) #8

Thanks @vdubc, one of our engineers @ashishgoswami is looking at the issue right now and would get back to you soon.