I am trying to ingest 100B RDF records into dgraph (~1TB gzip compressed). My setup has 3 x i3.2xlarge EC2 machine on AWS. The number of edges between nodes is much smaller than number of nodes in the graph.
I have tried running bulk loader using i3.16xlarge machine (64 CPU, 488 GB) and I ran it for 9 hours before giving up. The last output it had -
MAP 09h07m02s rdf_count:14.29G err_count:180.8k rdf_speed:435.5k/sec edge_count:36.67G edge_speed:1.117M/sec
processing file (836 out of 6186): /dgraph1/rdf/part-00442-9b4506f3-5b0d-402d-a471-a63ae01b6ec6.rdf.gz
I also tried running queries using Spark (200 executors running in parallel) and in 50 minutes it could only perform 13M mutations (I am sure there are repeats too given a few task failed).
I am just wondering if this is expected or whether I need to further fine tune dgraph. Let me know if you need more information