How to import 300G data in cluster?

There is a cluster of three machines(CentOS 7), each machine has one zero and three alphas.
The three alphas of each machine are distributed in different groups.
The machine has 32G of memory. How to import 300G-data?
The 300G data is generated with code, divided into three 100G data sets(three predicates).
I tried to use BULK to import, and found that there was not enough memory, which would cause the import to fail.
When using LIVE import, the time was too long, and during the import process, the leader of zero changed and then the import failed.

3 Likes

It’ s hard.

That will eat up a TON of memory!!! Dgraph feeds on memory. With 300G it will laugh at 32Gb. I loaded about 1/2G and it took me 16Gb. It really depends on the schema though. Depending on how many fields are indexed, how the data is spread across predicates, etc.

Take a look at https://github.com/EnricoMi/dgraph-dbpedia Here he has a 12Gb dataset that used 64Gb if I understood it correctly.

1 Like

If you guys try the master version, you will see a nice improvement in RAM usage. Especially the jemalloc commits. I think these commits will come in the next major release, not sure.

2 Likes

I have successfully imported 1.5TB rdf on a machine with 1TB memory. Maybe you can generate the out directory through bulk import on a high-performance machine, and then copy the data in out directory to the three centos machines.

I presume docker image dgraph/dgraph:master contains that improvement? I’ll give it a try.

I guess it has. I saw that it was built a day ago. So probably it has. Also, Manish has increased the GC time to execute but that commit is really recent. But jemalloc is the most important here now.

Another question on this topic: is there a good post on how memory consumption of bulk loader scales with dataset size? In the sense of how it scales with the number of predicates, distinct uris and triples in your dataset.

Given these parameters (and any other that are relevant) of my dataset, can I compute / estimate the MEM required to successfully load the dataset with bulk?

No, as far as I know. We try to estimate based on the dataset size, but if you add more indexes, more directives it would require more resources - So, it is not an exact math. The minimum would be 29GB of RAM in my opinion. If you are just playing around you can use 4GB.

With my tests, I got a way better RAM usage results. So, 29GB is good enough for big datasets. Not the previous versions, those you need more resources.

Let us know if you still have any issues with this. We have a task force just for those issues with RAM and Perf.

1 Like