Memory use and crashes when live loading

(I’m running in Kubernetes at the moment)

I’m trying to do an upgrade from v1.1.1 to v20.07.0. I’ve exported my data which comes to about 28 million quads and I’m trying to import it into the database instance into which I’m upgrading using the live loader. I believe I’m running into some memory issues; on the clusters where the upgrade has succeeded I’ve noticed that the Alpha node tends to hit 6-7gb resident and cruise around there. In other, less-successful restorations, the pod terminates and when it restarts the live loader is unable to resume, it sits at 0 transactions committed indefinitely after.

When I run top on the Alpha after the CPU usage is quite high but it doesn’t seem to be doing anything. The Alpha log contains these entries, approximately one a minute:

I0807 02:15:12.662504      14 draft.go:1531] Skipping snapshot at index: 36395. Insufficient discard entries: 1. MinPendingStartTs: 51043
I0807 02:15:12.663280      14 draft.go:1367] Found 3 old transactions. Acting to abort them.
I0807 02:15:12.684843      14 draft.go:1328] TryAbort 3 txns with start ts. Error: <nil>
I0807 02:15:12.684906      14 draft.go:1351] TryAbort selectively proposing only aborted txns: txns:<start_ts:51052 > txns:<start_ts:51190 > txns:<start_ts:51043 > 
I0807 02:15:40.685652      14 draft.go:1370] Done abortOldTransactions for 3 txns. Error: Server overloaded with pending proposals. Please retry later

Hello @m18e,

Generally we recommend 32 GB for most use cases. We would recommend definitely for the live load process, at least temporarily allocate 32 GB for the alpha pods, then when finished, you can size it back down.

In my own tests, using 21 million RDF, I found that with 8 GB, live loader hit memory boundaries immediately, but when I had 32 GB, it ran more smoothly.

@m18e: could you share your cluster resource configuration?

@joaquin / @vvbalaji We use nodes with 16gb RAM right now. I didn’t see anything in the Dgraph documentation about recommended sizing; I did see mentions of t2.medium instances in the multi-host setup guide so I figured our default instance size would be sufficient especially given that our data set is only about 400mb gzipped at the moment.

Our cluster is simple, we have a single alpha right now.

We still generally recommend 32gb. When we do a performance test with liveloader, we use 16 vCPUs + 60 GB mem system for a 16-million dataset.

This may not be ideal, but could you possibly size the EC2 instance up something with more resources, then after finishing the live loader process, size it back down?

@m18e: Would you be able to share the memory profile and 400mb dataset that is causing issues during live loader?

@joaquin Where are these recommendations? Can you please update your documentation with these recommendations?

I don’t really want to have to go through the process of upsizing our nodes when we do data ingestion so that live load doesn’t fall over. It defeats the point of having a live load if we need downtime while we upsize and downsize our nodes!

@vvbalaji Unfortunately we cannot share our dataset because of confidentiality restrictions.

Is it possible to have the live load resume successfully if the process is terminated because of resource constraints?

Why is it so memory hungry? Is it aggressively caching data? If so, is it possible to tune it so that it is more conservative with what’s cached?

Just some clarification, the 60gb was just for testing, 32gb is generally recommended, but it is dependent on the data usage. Sometimes to can get by with 16gb, but it something that must be monitored to establish a base line.

There is not a simple rule where such as if you have x predicates you need x GB of memory. This is because each use case is a bespoke solution, dependent on the data, queries, traffic, etc.

I will talk this over with my team on Monday and look at our current docs and see if we can follow up with something more concrete. We are continuously fine-tuning to find optimal memory usage including caching and memory allocation schemes.

We have guidelines documented that document 32 GiB as a common setup:

Live Loader will impact a dgraph cluster, so the cluster may need to be momentarily scaled up to handle the load, then it can be scaled down afterward.

@m18e: following up to check if you got a chance to get the memory profile of the OOM crash.

We are actively working on addressing excessive memory usage or uncontrolled growth in memory usage (Commit to enable recent changes that reduce memory footprint: build(jemalloc): Install Jemalloc if missing via Makefile. (#6463) · dgraph-io/dgraph@02b5118 · GitHub). If you can try it and share a memory profile or even share the memory profile of the version you are using, we can work on addressing it

Hello! I unfortunately haven’t been able to jump on this because of some work deadlines but I wouldn’t mind giving this a try sometime. I’ll let you know if/when I’m able to do so.

Just updating here that these memory improvements slated for the next release in Dgraph v20.11. You can try out those changes today in the release/v20.11 branch on GitHub.

@m18e happy to help set you up with the new build if you’re still seeing these memory issues.