I’m not sure where to even start, I’ve come into a job recently that’s been using dgraph in production (20.07) in their authentication service for some time now.
As the data has grown (~1TB now), the service has been impacted pretty significantly.
Those impacts include:
- K8s alpha/zero pods restarting every other day
- Major CPU Spikes causing complete downtime
- Write Performance degredation
It seems when writes are happening on the DB, the entire DB slows down causing reads to be very slow. When there’s brief times with no writes, reads will speed back up again and be relatively fast.
My question for the community would be, do you have any guidance on how i could possibly go about optimizing this thing?
Solutions we’ve attempted so far:
- increasing RAM usage from
32GB
to64GB
(did not seem to help)
Some of the issues we’ve been experiencing have been noted by others in the community in the past:
- Raft.Ready took too long to process:
- Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
- High memory utilization on alpha node (use of memory cache)
- Context Exceeded recurring again
Some things I have notices:
- when CPU spikes to 100% utilization, I see these logs.
I1027 08:30:30.709017 17 groups.go:900] Got Zero leader: dgraph-zero-2.dgraph-zero.dgraph-prod.svc.cluster.local:5080
W1027 08:30:30.749268 17 draft.go:1245] Raft.Ready took too long to process: Timer Total: 703ms. Breakdown: [{disk 703ms} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
I1027 08:30:33.621398 19 groups.go:956] Zero leadership changed. Renewing oracle delta stream.
E1027 08:30:33.621520 19 groups.go:932] Error in oracle delta stream. Error: rpc error: code = Canceled desc = context canceled
W1027 08:30:36.046003 19 draft.go:1245] Raft.Ready took too long to process: Timer Total: 11.395s. Breakdown: [{disk 11.395s} {proposals 0s} {advance 0s}] Num entries: 0. MustSync: false
I1027 08:30:38.626573 19 log.go:34] 2 is starting a new election at term 152
-
Raft.Ready took too long to process
log occurs very frequently.
A solution we are considering is obtaining an enterprise license and attempting to use Learner Nodes as read only nodes to help users not feel the writes and performance degredation. Does anyone have input on this?
Current Deployment stack is the following:
- Amazon EKS Kubernetes cluster using the HA Cluster setup.
- 3 Dgraph alpha pods in 1 Alpha Group
- 3 Dgraph zero pods in 1 Zero Group
- 970Gi Storage using AWS EBS Volume GP2 for alpha pods
- 170Gi Storage using AWS EBS Volume GP2 for zero pods
We’ve explored switching to IOPS SSD EBS Volumes, however we’ve noticed the GP2 Volumes we are using have almost 100% burst balance - so our assumption is that this is not where the bottleneck lies.