Out of memory error


(Nikita Zaletov) #1

what can trigger this issue?
we have OOM error when loaded (during dgraph4j) ~20m edges on m5.large instance with 8g ram.

lru_mb was set to 1024. since it’s just cache size - can it lead to OOM?


(Michel Conrado (Support Engineer)) #2

It is recommended to use half the available memory. BTW This rule was created to prevent OOM.

IMHO Despite this, you are (probably) loading 20 million edges in a short period. This leaves little room for Dgraph to allocate and process data. Any DB with low resources would die in high usage.

Also It would be necessary to know how much processing you devote to the instance. And it’s always important to tell which version you’re using.

Loading the 21milion.rdf (available in https://github.com/dgraph-io/benchmarks/tree/master/data/release) Dgraph uses around 12 to 15GB of memory to load using Live Load. This is because the resources is required at that period. Dgraph needs to process and allocate memory to work this (Graph is complex). So 1GB of memory and 8GB available with that load is a lot of demand for the instance to work.

Besides the load, Dgraph has processes like Snapshot, Balancing, and so on. They occur independently.

Dgraph can improve this, but only 8GB for this level of load in a single shot needs a little more resources for now.

PS. If you see others Graph DBs requirements they recommend 2GB to 16GB of RAM. Depending on use case.


(Nikita Zaletov) #3

i heard about the rule for using 1/3 of ram. but anyway, can lru_mb be the case? i thought it’s just cache size (according to it’s name), if i set it lower that 1/3 ram, it shouldn’t fail with oom error, right?
as for now we are loading this data in 1 thread using batches containing ~5000-10000 edges. we wait for an answer, commit, then load next batch.
we can increase ram to 16gb or even more, but is there a guarantee that it will not fail on 40m edges during next try?
it would be nice to have some strong rules about memory usage. cause we are planning to load ~10 billions rows in future.

why we choose 8gb? this value is recommended in docs as value when dgraph should work stable.

i am using the last version, 1.0.9


(Michel Conrado (Support Engineer)) #4

The OOM in this case I think is related to the low resources and high demand.

1000 RDF/s is a good roof for low resources. But the problem is the temporal range of constancy of writing. The more spaced, the better. If you have low resources.

As I said, the memory amount is based in the usage. So how much you increase the usage of writings, more memory and CPU you need.

We can evaluate Modify Use Case-based Docs. But what there are in docs standard values for basic and intermediate activities.


(Nikita Zaletov) #5

the problem is that we have hundreds of millions events already put in kafka, and dgraph is now a bottleneck. so, it will process as much as it can. there is no space between batches, we just read a batch of 1000 resoures, form batch of edges and run a mutation. we wait for an answer, commit, wait again until commit ends and then fetch next batch and process it, and so on. so, if we create a cluster of powerful machines with 128g of ram, then we can process 100000 edges in a second - will it cause oom error again?


(Daniel Mai) #6

Dgraph will try to use the necessary amount of memory to service requests.

For large amounts of data we recommend using the live loader to load data in reasonable batches or the bulk loader to load data quickly into a new cluster.


(Nikita Zaletov) #7

but our requests were batch mutations in a sequence, each was 5000-10000 edges only. what ram amount do you suggest to handle load of 10 billions of edges in this manner?


(Nikita Zaletov) #8

unfortunately we can’t use live or bulk loader and have to operate mutations by ourselves for many reasons.


(Michel Conrado (Support Engineer)) #9

kind of diff:
I’m not sure how kafka works (or what you’re using tho). But I believe you could create a simple code that parses the response from kafka to RDF and then use it on BulkLoad. So you would have something more efficient for that case an do it in one gig batch once for all.


(Nikita Zaletov) #10

when processing kafka (and all other streams), there is no start and end - so right after gig batch there will be next one. so, no difference between gig batch and few kilobytes batch.
+we need lookups for each batch because some of nodes are already created, some we need to create, so it’s complicated and rdf is not an option.

still have a question - is there some ram amount starting with that dgraph works stable? or it’s always a random under regular heavy load?


(Nikita Zaletov) #11

maybe some parameters like badger.vlog or others will make it stable?


(Nikita Zaletov) #12

and is there a difference in terms of memory usage between commit after mutation run, or enabling autocommit?


(Manish R Jain) #13

If you can give more memory to Dgraph, that always helps for sure. I’d keep the batch sizes reasonable, like say 1000 edges per transaction. You could run the txns concurrently, say 10 txns in-flight at any point of time.

If you can get a heap profile when Dgraph is going to go OOM, that’d help determine what is the cause.

https://docs.dgraph.io/howto/#retrieving-debug-information


(Nikita Zaletov) #14

thank you for answer. we will try to reduce batch size and increase ram