I am trying to ingest 100B RDF records into dgraph (~1TB gzip compressed). My setup has 3 x i3.2xlarge EC2 machine on AWS. The number of edges between nodes is much smaller than number of nodes in the graph.
I have tried running bulk loader using i3.16xlarge machine (64 CPU, 488 GB) and I ran it for 9 hours before giving up. The last output it had -
MAP 09h07m02s rdf_count:14.29G err_count:180.8k rdf_speed:435.5k/sec edge_count:36.67G edge_speed:1.117M/sec
processing file (836 out of 6186): /dgraph1/rdf/part-00442-9b4506f3-5b0d-402d-a471-a63ae01b6ec6.rdf.gz
I also tried running queries using Spark (200 executors running in parallel) and in 50 minutes it could only perform 13M mutations (I am sure there are repeats too given a few task failed).
I am just wondering if this is expected or whether I need to further fine tune dgraph. Let me know if you need more information
1.1M edges/sec is around what we expect from bulk load runs. 100B is a large data set. with 100B RDFs and 1.1M edges/sec, the bulk loader run time should take a bit longer than a full day to finish.
It takes 1 hour and 4 min to insert 1.9 billion RDFs in total runtime (Need to review whether the runtime today is higher or lower today, since the Dgraph has changed a lot since then.). Taking in consideration that you said 100 billion RDFs, so we can suppose it take 52~60 hours. That test in the blog post had 50GB of compressed RDFs. You have 1 Terabyte. 20x bigger.
Just to have a notion If you have 1Gbps of upload link, you would take about 2.27 Hours just to upload (450 GB/hr). So its a lot of data. Also it’s safe to say that you’ll need 8TB of storage available to be sure. It depends a lot on the content and indexing used. At that time Dgraph expanded from 50GB to more or less 200GB. I think today is a lot less, the latest versions are generating much smaller files in practice.
But what would your expectations be in real numbers?
It would be nice if it finishes within half a day or so. This will ensure that I can potentially do it whenever needed and repeat the process if it fails or there is a mistake from my end.
Let me ask you this, what if I run the bulk loader in parallel on multiple machines after manually dividing data. Is there an easier way to combine the output of bulk loader?
Not sure, it can create conflicts with UID generation.
This would be possible if you could manually change the UID range. So putting the second Bulkload for later range (very later in your case). But I think all this would be quite complex to deal with.
And in theory (without conflicts), by gear the second output group, Dgraph Zero would start balancing the load.