Throughput Issues with Dgraph ^v1 (currently running the nightly through docker)

I have been working some with @pawan to try and understand what is going on with out dev dgraph cluster, I will describe our setup, the describe the symptoms of our issues:

Setup
We have a cluster of 3 servers, 1 dgraphzero, 1 replica running in GCE (all 3 machines are n1-standard-4’s). Machine 1 hosts dgraph zero as well as the first server.

WE USE NO @upsert directives in our schema

Symptoms

  • When we start off, things seem to be fine
  • As we start to get more and more items coming through our event stream, into the graph, we start to see a lot of these errors: “rpc error: code = Unknown desc = Conflicts with pending transaction. Please abort.”
  • We do a retry on these over time and they seem to maybe go through, but we get these errors very, very often(at least a few per second)
  • Then we start to see that the machines seem to lock up for writes: “rpc error: code = DeadlineExceeded desc = context deadline exceeded”
  • Then we notice that queries even start to take longer, and also start to time out
  • CPU seems fine on each machine at this time, likely lower than when its processing correctly
  • After a great deal of time, 15 minutes - 30 mins+ things seem to start to move again and CPU goes back up (like its actually doing something)
  • This happens pretty consistently every time I run data
  • Having 15G of memory and only utilizing like 6, we don’t see the OOMs we saw yesterday, which now has a much more expensive setup that our v0.8.3 cluster had and it was completely fine. Up until today (29th) we were getting OOMs every day with a better setup than we used to have (and we never got OOMs).

Below is a link to a folder in google drive with the profiles of each machine at a time when we start to see timeouts. Also there is a file for the logs from each machine and dgraph zero TODAY (29th).

I also added the profiles from yesterday’s run when we started to see the system really degrade, but not logs.

https://drive.google.com/open?id=1pj6W2rdnbekQgE-zUuc6tQxBefVD7lSi

@pawan – Can you look into this on Monday?

Having a look at this now. I don’t see anything suspicious in the logs. @willcj33 can you share the output of dgraph version? I need to have the correct commit checked out locally to make any sense of the logs. The profile don’t seem to make sense with the current master or v1.0.4. Was this profile taken using the instructions mentioned at Get started with Dgraph?

Do you have a timeout on the client for mutations?

Also, we have been fixing issues rapidly, so I’d suggest giving the current nightly a try.

@pawan
The version I had when running this was:

Dgraph version   : v1.0.4-dev
Commit SHA-1     : 8f9eff32
Commit timestamp : 2018-03-26 14:54:41 +1100
Branch           : HEAD

The profile was taken per machine like so: {server}/debug/pprof/profile through a port forward to the server in my browser. It’s the same way I have always done it and it triggered the profile download. Now there are two sets in there, one from the day I made this ticket, and one set from the previous day. They should be labeled accordingly.

Do you have a timeout on the client for mutations? → Yes, like 10-15 minutes.

I plan to rerun data today with the latest version again, but in the past 4 times I’ve gotten master, I have seen no improvements over the last two weeks (except for the schema fix you made for me)

@pawan : Reran with the newest master today:

Dgraph version   : v1.0.4-dev
Commit SHA-1     : 774fe461
Commit timestamp : 2018-04-03 14:36:12 +1000
Branch           : HEAD

It seems that things were a bit better. We mainly got the “Transaction has been aborted. Please retry.” error on commit, but since we do have it so that it retries, it less of a concern, however, I still don’t get why this is happening so often when I am not using @upsert. I have one theory, but I’d like some direction on this. Also, even though I think things pretty much worked all the way through, it took far too long again. We started off really strong, but then messages in our pub/sub started stacking up and acking much less often. This COULD be due to the above error I suppose if dgraph was rejecting many calls over and over, or something else, not sure. I am adding our 3 profiles of today when things were processing a bit slower in the link below. There were no restarts as far as I could see and nothing that stood out too much in the server logs.

https://drive.google.com/open?id=1GmDHPrcf8OGZDvv4rcq7YzP6uVlo-ErP

I am not able to make much sense of the profile. Could be because I am using the pprof tool which comes with Go v1.9.2. Can you try upgrading to this Go version and take a profile again, please? Aborts can be there even if you are not doing upserts, they will be there if concurrent transactions are trying to modify same S P combinations.

Hey @pawan I got most of that figured out on my own, but now I am getting a cannot allocate memory error, on the node that dgraph zero is running on, but Ive set the lru_mb to 7168 on a 15G machine. Again, I run both dgraph zero, and a dgraph server on this machine. It’s the only one that has had this issue. Below is the link with the error in it. One change I made was to make a replica of 3, with 3 machines.

https://drive.google.com/open?id=1egCkQX_JEiy1YoiDviHNAZIRUrSd9mE8

Also, with replica = 3 and 3 servers, I get ALOT more of this in the dgraph logs → 21:33:15 node.go:344: Error while sending message to node with addr: dev-dgraph-master-02:7080, err: rpc error: code = DeadlineExceeded desc = context deadline exceeded

also, with that same setup the following query gives the following error:

  start(func: eq(PROFILE_type, 1)) {
		PROFILE_type
  }
}``` 


strconv.ParseInt: parsing "": invalid syntax


and this edge is indexed as an int

Hey @willcj33

After Change default to memorymap and update recommendation to one-third RA… · dgraph-io/dgraph@98c6102 · GitHub, the memory used by Dgraph should decrease. It is also recommended to set lru_mb to be one-third of your total memory to avoid going OOM. This change is already part of master.

The timeout for sending messages was 1 sec earlier but has been increased in the nightly so this shouldn’t happen anymore.

Can you try quoting 1 with double-quotes?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.