Moved from GitHub dgraph/2628
Posted by danielmai:
Experience Report
The dgraph bulk loader the fastest way to load data into Dgraph at close to 1M edges/sec. This currently satisfies most users, but for extremely large data sets on the order of terabytes, it takes on the order of days if not weeks to finish bulk loading the entire data set.
What you wanted to do
Complete a bulk load a multi-terabyte RDF triples data set in a timely manner.
What you actually did
Run the bulk loader on a multi-terabyte RDF triples data set on an i3.metal AWS instance with 14 TB of SSD space.
Why that wasn’t great, with examples
The bulk loader job did not finish on the i3.metal instance. Disk space ran out during the mapping phase.
What could be improved
Since the bulk loader mapping phase can’t be completed on a single machine for large data sets, then a distributed map reduce bulk loader would help make it at the very least possible while also increasing throughput to decrease the wait time from weeks to days or hours.