I have been working some with @pawan to try and understand what is going on with out dev dgraph cluster, I will describe our setup, the describe the symptoms of our issues:
Setup
We have a cluster of 3 servers, 1 dgraphzero, 1 replica running in GCE (all 3 machines are n1-standard-4’s). Machine 1 hosts dgraph zero as well as the first server.
WE USE NO @upsert directives in our schema
Symptoms
When we start off, things seem to be fine
As we start to get more and more items coming through our event stream, into the graph, we start to see a lot of these errors: “rpc error: code = Unknown desc = Conflicts with pending transaction. Please abort.”
We do a retry on these over time and they seem to maybe go through, but we get these errors very, very often(at least a few per second)
Then we start to see that the machines seem to lock up for writes: “rpc error: code = DeadlineExceeded desc = context deadline exceeded”
Then we notice that queries even start to take longer, and also start to time out
CPU seems fine on each machine at this time, likely lower than when its processing correctly
After a great deal of time, 15 minutes - 30 mins+ things seem to start to move again and CPU goes back up (like its actually doing something)
This happens pretty consistently every time I run data
Having 15G of memory and only utilizing like 6, we don’t see the OOMs we saw yesterday, which now has a much more expensive setup that our v0.8.3 cluster had and it was completely fine. Up until today (29th) we were getting OOMs every day with a better setup than we used to have (and we never got OOMs).
Below is a link to a folder in google drive with the profiles of each machine at a time when we start to see timeouts. Also there is a file for the logs from each machine and dgraph zero TODAY (29th).
I also added the profiles from yesterday’s run when we started to see the system really degrade, but not logs.
Having a look at this now. I don’t see anything suspicious in the logs. @willcj33 can you share the output of dgraph version? I need to have the correct commit checked out locally to make any sense of the logs. The profile don’t seem to make sense with the current master or v1.0.4. Was this profile taken using the instructions mentioned at Get started with Dgraph?
Do you have a timeout on the client for mutations?
Also, we have been fixing issues rapidly, so I’d suggest giving the current nightly a try.
Dgraph version : v1.0.4-dev
Commit SHA-1 : 8f9eff32
Commit timestamp : 2018-03-26 14:54:41 +1100
Branch : HEAD
The profile was taken per machine like so: {server}/debug/pprof/profile through a port forward to the server in my browser. It’s the same way I have always done it and it triggered the profile download. Now there are two sets in there, one from the day I made this ticket, and one set from the previous day. They should be labeled accordingly.
Do you have a timeout on the client for mutations? → Yes, like 10-15 minutes.
I plan to rerun data today with the latest version again, but in the past 4 times I’ve gotten master, I have seen no improvements over the last two weeks (except for the schema fix you made for me)
Dgraph version : v1.0.4-dev
Commit SHA-1 : 774fe461
Commit timestamp : 2018-04-03 14:36:12 +1000
Branch : HEAD
It seems that things were a bit better. We mainly got the “Transaction has been aborted. Please retry.” error on commit, but since we do have it so that it retries, it less of a concern, however, I still don’t get why this is happening so often when I am not using @upsert. I have one theory, but I’d like some direction on this. Also, even though I think things pretty much worked all the way through, it took far too long again. We started off really strong, but then messages in our pub/sub started stacking up and acking much less often. This COULD be due to the above error I suppose if dgraph was rejecting many calls over and over, or something else, not sure. I am adding our 3 profiles of today when things were processing a bit slower in the link below. There were no restarts as far as I could see and nothing that stood out too much in the server logs.
I am not able to make much sense of the profile. Could be because I am using the pprof tool which comes with Go v1.9.2. Can you try upgrading to this Go version and take a profile again, please? Aborts can be there even if you are not doing upserts, they will be there if concurrent transactions are trying to modify same S P combinations.
Hey @pawan I got most of that figured out on my own, but now I am getting a cannot allocate memory error, on the node that dgraph zero is running on, but Ive set the lru_mb to 7168 on a 15G machine. Again, I run both dgraph zero, and a dgraph server on this machine. It’s the only one that has had this issue. Below is the link with the error in it. One change I made was to make a replica of 3, with 3 machines.
Also, with replica = 3 and 3 servers, I get ALOT more of this in the dgraph logs → 21:33:15 node.go:344: Error while sending message to node with addr: dev-dgraph-master-02:7080, err: rpc error: code = DeadlineExceeded desc = context deadline exceeded