How to import 300G data in cluster?

gsz · October 10, 2020, 2:48am

There is a cluster of three machines(CentOS 7), each machine has one zero and three alphas.
The three alphas of each machine are distributed in different groups.
The machine has 32G of memory. How to import 300G-data?
The 300G data is generated with code, divided into three 100G data sets(three predicates).
I tried to use BULK to import, and found that there was not enough memory, which would cause the import to fail.
When using LIVE import, the time was too long, and during the import process, the leader of zero changed and then the import failed.

BlankRain · October 10, 2020, 3:00am

It’ s hard.

amaster507 · October 10, 2020, 3:29am

That will eat up a TON of memory!!! Dgraph feeds on memory. With 300G it will laugh at 32Gb. I loaded about 1/2G and it took me 16Gb. It really depends on the schema though. Depending on how many fields are indexed, how the data is spread across predicates, etc.

Take a look at https://github.com/EnricoMi/dgraph-dbpedia Here he has a 12Gb dataset that used 64Gb if I understood it correctly.

MichelDiz · October 10, 2020, 3:36am

If you guys try the master version, you will see a nice improvement in RAM usage. Especially the jemalloc commits. I think these commits will come in the next major release, not sure.

Valdanito · October 10, 2020, 7:45am

I have successfully imported 1.5TB rdf on a machine with 1TB memory. Maybe you can generate the out directory through bulk import on a high-performance machine, and then copy the data in out directory to the three centos machines.

EnricoMi · October 10, 2020, 8:14am

I presume docker image dgraph/dgraph:master contains that improvement? I’ll give it a try.

MichelDiz · October 10, 2020, 2:47pm

I guess it has. I saw that it was built a day ago. So probably it has. Also, Manish has increased the GC time to execute but that commit is really recent. But jemalloc is the most important here now.

EnricoMi · October 10, 2020, 3:14pm

Another question on this topic: is there a good post on how memory consumption of bulk loader scales with dataset size? In the sense of how it scales with the number of predicates, distinct uris and triples in your dataset.

Given these parameters (and any other that are relevant) of my dataset, can I compute / estimate the MEM required to successfully load the dataset with bulk?

MichelDiz · October 10, 2020, 5:43pm

No, as far as I know. We try to estimate based on the dataset size, but if you add more indexes, more directives it would require more resources - So, it is not an exact math. The minimum would be 29GB of RAM in my opinion. If you are just playing around you can use 4GB.

With my tests, I got a way better RAM usage results. So, 29GB is good enough for big datasets. Not the previous versions, those you need more resources.

Let us know if you still have any issues with this. We have a task force just for those issues with RAM and Perf.

Topic		Replies	Views
Fatal error: runtime: out of memory when bulk loader Dgraph bulkloader	13	1803	August 10, 2020
Using bulk to import 300G data leads to oom Dgraph dgraph	7	498	December 19, 2020
Some mistakes when running dgraph bulk Users	7	669	June 27, 2018
Bulk loader taking more than 100G virtual memory for 5.4G of data Dgraph mutation	8	1148	March 18, 2019
Bulk loader becomes slow when memory gets full Users	20	2217	December 17, 2017

How to import 300G data in cluster?

Related topics