Hi,
Our Dgraph Alpha nodes are getting OOMKilled by Kubernetes as RAM consumption cross our configured limits. We are trying to find out why this is happening but we are looking for guidance. Here is a summary of our setup:
Dgraph runs in HA mode on i3.2xlarge instances (8 cores / 64GB) and data is stored on local disk (SSDs). We are still on Dgraph 20.11
We notice that there is a huge disparity between the RAM usage has seen by the OS/kubelet and what is actually in use in the go process.
The reported prometheus metrics show usage for each alpha pod between 1GB and 10GB while the kubelet account for 50GB to 60GB.
Taking a couple of heap profiles shows us that in_use
memory for the alpha process is actually close to what’s reported by prometheus.
My guess is that the difference comes from badger that is mapping data to memory.
When looking inside alpha pods, I checked the size of the dgraph directories and go back
dgraph-alpha-0:/dgraph# du -sh *
71G p
643M t
11M w
dgraph-alpha-1:/dgraph# du -sh *
191G p
617M t
2.1M w
dgraph-alpha-2:/dgraph# du -sh *
74G p
801M t
3.0M w
So the p
directory is quite large.
I don’t know exactly why data is unbalanced on alpha-1
and if that’s an issue. I also don’t know if Dgraph is attempting to map the entire p
directories into memory or not and crashing for that reason.
Alpha logs don’t seem to show any meaningful error.
We see things like
No longer the leader of group 1. Exiting
Error occured while aborting transaction: rpc error: code = Canceled desc = context canceled
Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": while receiving from stream: rpc error: code = Unavailable desc = transport is closing. closer err: <nil>
around restarts which I guess is expected as we loose connections (we also use GraphQL subscriptions, so it might also arise when we stop the subscription).
A couple of warnings like
Raft.Ready took too long to process: Timer Total: 531ms. Breakdown: [{disk 382ms} {advance 149ms} {proposals 0s}] Num entries: 0. MustSync: false
unable to write CID to file open : no such file or directory
No membership update for 10s. Closing connection to Zero.
Another thing we noted is a spike in Pending Proposals metrics.
Any idea on how to debug this further or what could be the root cause ?
PS: Attached our dashboard.
I can also send the captured heap profiles if that’s useful (I can’t upload here as a new user).