I have been working some with @pawan to try and understand what is going on with out dev dgraph cluster, I will describe our setup, the describe the symptoms of our issues:
We have a cluster of 3 servers, 1 dgraphzero, 1 replica running in GCE (all 3 machines are n1-standard-4’s). Machine 1 hosts dgraph zero as well as the first server.
WE USE NO @upsert directives in our schema
- When we start off, things seem to be fine
- As we start to get more and more items coming through our event stream, into the graph, we start to see a lot of these errors: “rpc error: code = Unknown desc = Conflicts with pending transaction. Please abort.”
- We do a retry on these over time and they seem to maybe go through, but we get these errors very, very often(at least a few per second)
- Then we start to see that the machines seem to lock up for writes: “rpc error: code = DeadlineExceeded desc = context deadline exceeded”
- Then we notice that queries even start to take longer, and also start to time out
- CPU seems fine on each machine at this time, likely lower than when its processing correctly
- After a great deal of time, 15 minutes - 30 mins+ things seem to start to move again and CPU goes back up (like its actually doing something)
- This happens pretty consistently every time I run data
- Having 15G of memory and only utilizing like 6, we don’t see the OOMs we saw yesterday, which now has a much more expensive setup that our v0.8.3 cluster had and it was completely fine. Up until today (29th) we were getting OOMs every day with a better setup than we used to have (and we never got OOMs).
Below is a link to a folder in google drive with the profiles of each machine at a time when we start to see timeouts. Also there is a file for the logs from each machine and dgraph zero TODAY (29th).
I also added the profiles from yesterday’s run when we started to see the system really degrade, but not logs.