I went through the post how to import 4 billion rdf to dgraph quickly · Issue #1323 · dgraph-io/dgraph · GitHub and I found my situation is very similar with it except I already tried dgraph-bulk-loader.
I have a personal server with 32 core CPU and 64G memory and 1.5 T SSD and the data is around 3 billion edges.
At first edge_speed went very high and up to 2m edges /sec, however, when memory reached 90% of the whole (then it kept at this level and I believe it was the loader’s nice strategy to avoid out of memory), the edge_speed went down gradually.
I didn’t finish the import yet, so I could only share the report around 4 hours as below and I believe the speed will go below 100k when it is done:
MAP 04h15m28s rdf_count:570.9M rdf_speed:37.24k/sec edge_count:1.671G edge_speed:109.0k/sec
While the memory and SSD are busy , the CPU is idle
%Cpu(s): 4.4 us, 2.6 sy, 0.0 ni, 37.6 id, 55.3 wa, 0.0 hi, 0.1 si, 0.0 st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18464 hj 20 0 0.099t 0.058t 37216 S 347.4 94.1 1447:13 dgraph-bulk-loa
From the post, bulk showed its amazing import power, however, it is a little sorry that it doesn’t take the memory scalability into consideration. For many users like me and the one in issues/1323, we choose personal servers to run dgraph, and the memory is not to so easy to scale as cloud servers. Though I could maximize my memory to 128G but I am not sure if it will make much difference since it is still easy to reach full .
Just wonder if bulk loader could optimize swapping strategy or make more use of CPU when memory gets full