Dgraph Alpha Eating Up All RAM

Hi,

Our Dgraph Alpha nodes are getting OOMKilled by Kubernetes as RAM consumption cross our configured limits. We are trying to find out why this is happening but we are looking for guidance. Here is a summary of our setup:

Dgraph runs in HA mode on i3.2xlarge instances (8 cores / 64GB) and data is stored on local disk (SSDs). We are still on Dgraph 20.11

We notice that there is a huge disparity between the RAM usage has seen by the OS/kubelet and what is actually in use in the go process.

The reported prometheus metrics show usage for each alpha pod between 1GB and 10GB while the kubelet account for 50GB to 60GB.

Taking a couple of heap profiles shows us that in_use memory for the alpha process is actually close to what’s reported by prometheus.

My guess is that the difference comes from badger that is mapping data to memory.

When looking inside alpha pods, I checked the size of the dgraph directories and go back

dgraph-alpha-0:/dgraph# du -sh *
71G	p
643M	t
11M	w

dgraph-alpha-1:/dgraph# du -sh *
191G	p
617M	t
2.1M	w

dgraph-alpha-2:/dgraph# du -sh *
74G	p
801M	t
3.0M	w

So the p directory is quite large.

I don’t know exactly why data is unbalanced on alpha-1 and if that’s an issue. I also don’t know if Dgraph is attempting to map the entire p directories into memory or not and crashing for that reason.

Alpha logs don’t seem to show any meaningful error.

We see things like

No longer the leader of group 1. Exiting

Error occured while aborting transaction: rpc error: code = Canceled desc = context canceled

Error during SubscribeForUpdates for prefix "\x00\x00\x15dgraph.graphql.schema\x00": while receiving from stream: rpc error: code = Unavailable desc = transport is closing. closer err: <nil>

around restarts which I guess is expected as we loose connections (we also use GraphQL subscriptions, so it might also arise when we stop the subscription).

A couple of warnings like

Raft.Ready took too long to process: Timer Total: 531ms. Breakdown: [{disk 382ms} {advance 149ms} {proposals 0s}] Num entries: 0. MustSync: false
unable to write CID to file  open : no such file or directory
No membership update for 10s. Closing connection to Zero.

Another thing we noted is a spike in Pending Proposals metrics.

Any idea on how to debug this further or what could be the root cause ?

PS: Attached our dashboard.

I can also send the captured heap profiles if that’s useful (I can’t upload here as a new user).

Nope, Dgraph doesn’t push the dataset to RAM. It has a mmap that goes to RAM, but it is small.

That is important.

What is the size of it?

Weird, are you sure there isn’t any other thing running at the same time?

Yes, storing Graphs require some space. Also, there are some tmp data that will be wiped out soon.

That Alpha(the leader) might be dead.

Looks like you are doing a heavy load. During a load, Dgraph expands the dataset a bit in RAM. And it should be cleaned with time by jemalloc.

To avoid this, I would recommend balancing the load between Alphas. So the memory handler works evenly across the cluster.

There were some issues related to RAM, some of them were addressed, so updated is a good thing to do.

1.8To

yes we are using tolerations and pod affinity to ensure no other process is running on the nodes beside dgraph (and the couple of small k8s sidecars like the kubelet or some monitoring agents)

likely OOMKilled :slight_smile:

How can I monitor that process beside looking at memory usage metrics ? Can we trigger that cleanup manually ?

What do you mean by balancing exactly. Spinning up more replicas (we have 3 right now) or using sharding maybe ? Right now we connect to Dgraph over the GraphQL interface and we have a k8s service in front of the HTTP endpoint, so load should be split over the 3 alphas already.

Here is one

There are jemalloc logs in the instances.

Not sure, I never saw such an option. Neither in docs or the code.

If just the leader died due to OOM. That means the whole load is in his back. I mean, in a single instance.

Nope, each transaction should go in a robin(or a better balancing algo) manner to each Alpha. Sharding you are already doing. Each new Alpha without replicas is a shard/group.

You say that you have 3 replicas with 3 Alphas. Humm. So you have no shard. In my conception, having 3 replicas is quite consuming. Cuz any data coming, will be immediately replicated to the other instances. In that case, I would put the replicas in separate machines, cuz you gonna stress out the machine. That’s my opinion. I would keep only shards in the same machine.

Sharding is really good for performance. You should use them.

Well, in the case of replicas, they are not split, they are replicated. So every transaction is copied to the next 2 Alphas. As they are replicas. But still, a single Alpha can happen to be overloaded with tasks if you not balance the transactions between them.

@dmai can you take a look?

Our setup is 3 alphas running on a dedicated i3.2xlarge EC2 instance. All alphas are part of the same group as we don’t use sharding. To my understanding, it means that the entire data we hold is replicated on the 3 alpha nodes right ?

Our k8s service is in theory balancing request across all 3 alpha nodes.

So any idea from the heap profile ?

@lminaudier The heap profile you shared says that 2452.59 MB of memory was currently in-use. That’s well below the 64 GB i3.2xlarge instances that you’re running Dgraph on. Because the memory usage is so low in the heap profile you shared, it doesn’t help explain what’s taking up memory since the usage is nowhere close to ~64 GB which would indicate an OOM scenario.

The dashboard metrics you shared show that you have spikes of 7k - 9k pending queries or 1k pending mutations at once. That many pending requests could account for increased memory usage.

It also looks like you have a lot of transaction aborts based on the dashboard. Pending transactions require memory and transaction aborts are ultimately processing work that went to waste. It’d help if you can minimize the number of aborts by either discarding txns when you don’t need them or by reducing the number of conflicts you have in updates.

If the Dgraph in-use memory metric you’re charting is from dgraph_memory_inuse_bytes (Go heap in-use), then the gap is probably the Go idle memory that the Go runtime keeps around instead of releasing it back to the OS. Idle memory gets released back as needed by the OS.

Thanks for the info.

So, let’s say our code is not buggy and we need to make those mutations concurrently. Does it mean our only way to scale to support that is to use larger instances sizes with more RAM because we actually “need” it ?

FYI: we are querying and mutating Dgraph through the GraphQL layer, so we have no real control over the underlying Dgraph transactions afaik hence my question.