Data Ingestion very slow

amanmangal · September 24, 2018, 1:11pm

Hi,

I am trying to ingest 100B RDF records into dgraph (~1TB gzip compressed). My setup has 3 x i3.2xlarge EC2 machine on AWS. The number of edges between nodes is much smaller than number of nodes in the graph.

I have tried running bulk loader using i3.16xlarge machine (64 CPU, 488 GB) and I ran it for 9 hours before giving up. The last output it had -

MAP 09h07m02s rdf_count:14.29G err_count:180.8k rdf_speed:435.5k/sec edge_count:36.67G edge_speed:1.117M/sec
processing file (836 out of 6186): /dgraph1/rdf/part-00442-9b4506f3-5b0d-402d-a471-a63ae01b6ec6.rdf.gz

I also tried running queries using Spark (200 executors running in parallel) and in 50 minutes it could only perform 13M mutations (I am sure there are repeats too given a few task failed).

I am just wondering if this is expected or whether I need to further fine tune dgraph. Let me know if you need more information

Thanks
Aman Mangal

dmai · September 24, 2018, 11:38pm

1.1M edges/sec is around what we expect from bulk load runs. 100B is a large data set. with 100B RDFs and 1.1M edges/sec, the bulk loader run time should take a bit longer than a full day to finish.

amanmangal · September 25, 2018, 1:42am

But isn’t that just the MAP phase, what about the reduce phase?

MichelDiz · September 25, 2018, 4:02am

You’re using almost the same config as the used in this post https://blog.dgraph.io/post/bulkloader/

It takes 1 hour and 4 min to insert 1.9 billion RDFs in total runtime (Need to review whether the runtime today is higher or lower today, since the Dgraph has changed a lot since then.). Taking in consideration that you said 100 billion RDFs, so we can suppose it take 52~60 hours. That test in the blog post had 50GB of compressed RDFs. You have 1 Terabyte. 20x bigger.

Just to have a notion If you have 1Gbps of upload link, you would take about 2.27 Hours just to upload (450 GB/hr). So its a lot of data. Also it’s safe to say that you’ll need 8TB of storage available to be sure. It depends a lot on the content and indexing used. At that time Dgraph expanded from 50GB to more or less 200GB. I think today is a lot less, the latest versions are generating much smaller files in practice.

But what would your expectations be in real numbers?

amanmangal · September 25, 2018, 5:11am

It would be nice if it finishes within half a day or so. This will ensure that I can potentially do it whenever needed and repeat the process if it fails or there is a mistake from my end.

Let me ask you this, what if I run the bulk loader in parallel on multiple machines after manually dividing data. Is there an easier way to combine the output of bulk loader?

MichelDiz · September 25, 2018, 2:40pm

Not sure, it can create conflicts with UID generation.

This would be possible if you could manually change the UID range. So putting the second Bulkload for later range (very later in your case). But I think all this would be quite complex to deal with.

And in theory (without conflicts), by gear the second output group, Dgraph Zero would start balancing the load.

system · October 25, 2018, 2:40pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Loading close to 1M edges/sec into Dgraph - Dgraph Blog Blog	3	1464	November 15, 2018
Bulk loader becomes slow when memory gets full Users	20	2169	December 17, 2017
Improve throughput of bulk loader with distributed loading Dgraph dgraph , kind:enhancement , priority:p2 , status:accepted , popular	21	1026	February 6, 2020
I imported 50 billion rdf into dgraph in 15 hours Dgraph	8	571	June 6, 2020
Problem with dgraph bulk -r command Users	8	844	June 15, 2018

Data Ingestion very slow

Related topics