Dgraph can't idle without being oomkilled after large data ingestion

I’ve upserted a ~100 million nodes into dgraph (each node is small… a few fields each but they all have one relationship back to the same “root” node… which is how Im trying to achieve namespacing in the cluster). They also each have an xid field.

I have 3 zeros and 3 alphas running on GKE with the v20.03.0 release. The alphas each have their own n1-highcpu-16 node, with the alpha lru cache set to 2048 (via the dgraph helm chart)

With this setup, I could consistently write at around 10k upserts per second- however once I reach a certain scale, a dgraph node gets oom killed by kube.

Once that node gets shut down- the cluster never recovers. The node gets oom killed repeatedly despite the cluster being idle/ingestion being paused.

I’ve attached the logs for the 3 zeros and alphas- as well as another set of logs from alpha1 (the bad node) after it restarted and oom killed again.

zero0 (16.7 KB)
zero1 (6.4 KB) zero2 (5.2 KB) alpha1 (81.0 KB) alpha2 (77.2 KB) alpha0 (109.8 KB) alpha1_restart (65.8 KB)

It feels like the bad node is trying to read files from disk that are too large to fit in memory. What can I do to mitigate the situation/is this expected behavior?

I added a 4th alpha node- but the problem persists- although it’s alpha0 running out of memory now.

3 E0501 00:00:00.649519 1 writer.go:49] TxnWriter got error during callback: Unable to write to value log file: “p/001532.vlog”: write p/001532.vlog: cannot allocate memory

How much memory are you allocating to these instances via K8s? I see a lot of data being written, and Badger is busy doing a lot of compactions in the background. So, this data ingestion seems quite memory intensive.

Also note that Go does get memory hungry somewhat. So, I’d err on the side of giving it more memory than less.

Hey @mrjn thanks for looking into it,

Right now these are running without memory limits on the n1-highcpu-16 nodes- so each dgraph alpha has about 12-14 gb to play with, however these logs are taken from a time when no active reads/writes were happing to the cluster so I think its all internal compaction/replication after the big data load.

The more I think about this- does dgraph shard a single predicate across multiple nodes? Specifically, are we bound to a graph whose most common predicate must be less than the maximum a single alpha node could handle?

Any other tips for resource sizing, cluster layout for upsert speed are also welcome :slight_smile:

If you’re doing a serious data load, I’d give it 32GB.

No. A single predicate is kept within one shard. See section 2.3 in Dgraph Whitepapers: In-Depth Insights and Analysis .

A single Alpha node can handle a lot. Behind it is Badger, which is being used to server TBs of data.

If ingestion speed is important, I’d suggest looking into ludicrous mode. I can’t find documentation on our docs yet, but you can see it as a flag on both Alpha and Zero in the latest release.

Thanks @mrjn - answers all my questions :+1: I’ll try some bigger nodes, but ludicrous mode also sounds interesting. We have a few request patterns that are transactional in nature, but the majority of them can likely survive with weaker consistency guarantees. I assume if its an on/off thing for the entire cluster? I’ll keep my eye out for docs- but if theres any rough drafts about the tradeoffs I’d love to read up.

Ludicrous mode gives you the same guarantees as a typical NoSQL database, down from the strict SI transactional guarantees that Dgraph provides by default.

@dmai: Do we have documentation somewhere?

Yeah, we do. https://dgraph.io/docs/deploy/#ludicrous-mode

Unfortunately ludicrous mode didn’t provide a performance increase for me, I’m blasting dgraph with upserts and I think the read-then-write nature of those is outside of the scope of ludicrous mode? On my 30gb 3 node cluster I struggle to get over 50 upserts per second.

Which version of Dgraph are you using?

v20.03.1 - I made bulkier requests, (400 upserts in a request up from 10) and that helped a good deal. Up from 50 upserts/s to 1000s, but that seems to be my ceiling. This is a small sample of the type of requests I’m making. Mostly Im attempting to create nodes with external IDs with some relations

dgtest (5.8 KB)

Are you making these requests concurrently and stuff? Is Dgraph maxing out the CPU core allocation?

Actually, looking at the mutations, looks like there are only two predicates being used xid and project. That might be causing only 2 goroutines to be doing all the writing – one per predicate. That wouldn’t give you great performance.

@ashishgoswami: Perhaps we can look into running more mutations concurrently within the same predicate.

1 Like

Yup I have a 3 processes, each one connected to a different alpha. CPU doesn’t seem to be maxed, it’s hanging around 8 out of 16 cores. Memory usually hangs around 10gb of 30gb. I tried to take some jaeger traces (there was some rreeeally slow ones when I was doing batch sizes of 10 (handleUidPostings spans were over 2 seconds), however with my bigger batch sizes I don’t catch any anomalous multi-second traces.

Are these processes sending these requests serially, or concurrently? You really want to send many requests concurrently. I’d say perhaps 16 in flight all the time, but you can experiment with when does your cores start becoming saturated (>80% usage), and increase or decrease num in flight accordingly.

I created this issue in Github so we can investigate: Run mutations concurrently per predicate in Ludicrous mode · Issue #5403 · dgraph-io/dgraph · GitHub

Ah- they were sending serially. Im upped the concurrency to have ~16 in flight at once. Im seeing increased throughput (up to 2k/s from 1k/s), however Im struggling to get the cpu util up without blowing out the memory util first. Im guessing this has to do with the load being so heavy on xid over any other predicate. Here’s the resources- the memory will climb until 30gb when it oomkills. Cpu on the loaded alpha is ~60-70%

If you want to debug, you can take a Go heap profile when the memory usage is high. We can have a look at it and see what we find.

https://dgraph.io/docs/howto/#profiling-information

Sure I’ll grab one in a moment- however when I see oomkills like this there’s a huge memory spike on reboot that usually causes another oomkill. I’ll try to grab a snapshot at the top but unsure if this are two different issues

Here’s the reboot spike
Screen Shot 2020-05-08 at 4.59.32 PM

Hi Sean - we could also schedule a quick call with one of our Dgraph engineers next week if you would be interested. Please let me know. I can be reached here or at derek@dgraph.io

Grabbed a heap snapshot near the top- once a node is oomkilled, each alpha gets into an oomkill loop. I saw some tickets related to high memory usage on badger restores- maybe related? Unsure if this heap snapshot exhibits my original oomkill from heavy load, or the oomkill from reboot (perhaps they are the same root cause)

EDIT: This is while tracing is still on, which I realize is probably taking a good bit of the mem- Ill grab a cleaner one
pprof.dgraph.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz (66.3 KB)

Here you can see them struggling to come back online

@dereksfoster99 Sure, that would be great.