Hi, I’m having troubles with the Bulk Loader. In particular during the REDUCE phase. Simply can’t get it to perform.
The machine averages about 10% CPU load and 10 MB/s disk IO. That’s a machine with 48 processors and NVMe SSD:s in a raid configuration. The machine is not the limit. Just can’t get the Bulk Loader to make use of it. What could be holding it back?
Have tried many different combinations of --map_shards, --reduce_shards, --reducers & --badger.compression. Essentially those settings makes no (speed) difference. Is there some other magic switch?
The current tests I’m working on I have 100 rdf.gz files little more than 800MB each. (This represents about 5% of the total dataset.)
The MAP phase takes about 2h but seems to scale well with the number of available cores and the --num_go_routines setting. A bigger machine should help here.
The REDUCE phase takes 6-7h and I’m not able to do anything about this.
/Anders
Dgraph version : v20.11.1
Dgraph codename : tchalla-1
Dgraph SHA-256 : cefdcc880c0607a92a1d8d3ba0beb015459ebe216e79fdad613eb0d00d09f134
Commit SHA-1 : 7153d13fe
Commit timestamp : 2021-01-28 15:59:35 +0530
Branch : HEAD
Go version : go1.15.5
jemalloc enabled : true
The Loading close to 1M edges/sec into Dgraph - Dgraph Blog post mentions a map/shuffle/reduce paradigm and discusses how having multiple shufflers is critical to performance. When I run the Bulk Loader I don’t really see a shuffle phase and I’m not aware of a way to control the number of shufflers.
I’m guessing there has been a redesign of the Bulk Loader tool since that blog post (and the docs) were written, and those changes where not properly tested on large data sets and machines with many cores. The reduce phase doesn’t scale past what you’d have on an average desktop (or good laptop) machine.
That’s strange. Typically reduce phase is pretty fast. Is your dataset somehow skewed? As in, is there one predicate that’s overwhelmingly larger than others?
We’d need to get some profiles to understand what the CPU is doing, if you can get us those.
hey @apete, in the profile you shared, it looks like a lot of time is spent collecting information about the memory allocations that we do (via jemalloc). This change was introduced in
I’ll create a PR to fix this and share it here. That should improve the performance of the reduce phase.
Hi @apete , we are exploring multiple options to optimize this. We wanted it to be correct and performant solution. We should be available with more concrete update by end of this week. Please stay tuned :).
@apete please use this binary. It has disabled memory leak profiling and should resolve your performance issue. In the meantime, we are working on fixing this issue and we will update you once it is done.
Do let us know if you still face any issues.
Quick tip: If you face permission denied error using this binary just set the correct access permission of the binary using chmod command.
I have tried that binary. It’s faster than v20.11.2, but not by very much. The whole process now takes about 6.5h.
If I remember correctly v20.11.2 did it in about 7h and v20.11.1 took more than 8h.
Level Done
[06:07:00-0800] REDUCE 06h26m18s 100.00% edge_count:9.732G edge_speed:593.6k/sec plist_count:8.287G plist_speed:505.5k/sec. Num Encoding MBs: 0. jemalloc: 0 B
Total: 06h26m18s
2021/03/04 06:07:20 profile: cpu profiling disabled, /tmp/profile633888346/cpu.pprof
dgraph bulk -f /mnt/fast1/sentinelData/ExportGraphTriplesSentinel/test -s 346412.47s user 20974.90s system 1583% cpu 6:26:44.37 total
1583% cpu Is that the average cpu load? This machine has 24 cores. With hyperthreads that’s 48 “processors”. During the MAP phase cpu usage looks good, but in the REDUCE phase the machine has plenty power to spare.
Hey @apete Can you please share one more cpu profile. You can email it to amanbansal@dgraph.io. It would be helpful in understanding if there is anything that is blocking the performance. We have removed one of the major bottleneck in this binary.