Bulk Loader REDUCE problem - it's very slow

Hi, I’m having troubles with the Bulk Loader. In particular during the REDUCE phase. Simply can’t get it to perform.

The machine averages about 10% CPU load and 10 MB/s disk IO. That’s a machine with 48 processors and NVMe SSD:s in a raid configuration. The machine is not the limit. Just can’t get the Bulk Loader to make use of it. What could be holding it back?

Have tried many different combinations of --map_shards, --reduce_shards, --reducers & --badger.compression. Essentially those settings makes no (speed) difference. Is there some other magic switch?

The current tests I’m working on I have 100 rdf.gz files little more than 800MB each. (This represents about 5% of the total dataset.)

The MAP phase takes about 2h but seems to scale well with the number of available cores and the --num_go_routines setting. A bigger machine should help here.

The REDUCE phase takes 6-7h and I’m not able to do anything about this.


Dgraph version   : v20.11.1
Dgraph codename  : tchalla-1
Dgraph SHA-256   : cefdcc880c0607a92a1d8d3ba0beb015459ebe216e79fdad613eb0d00d09f134
Commit SHA-1     : 7153d13fe
Commit timestamp : 2021-01-28 15:59:35 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

The Loading close to 1M edges/sec into Dgraph - Dgraph Blog post mentions a map/shuffle/reduce paradigm and discusses how having multiple shufflers is critical to performance. When I run the Bulk Loader I don’t really see a shuffle phase and I’m not aware of a way to control the number of shufflers.

On the Bulk Loader docs page (https://dgraph.io/docs/deploy/fast-data-loading/bulk-loader/) there is a section where the console output of a simple execution is shown. Among the displayed configuration parameters I see:

"NumShufflers": 1,

I don’t see that when I run.

I’m guessing there has been a redesign of the Bulk Loader tool since that blog post (and the docs) were written, and those changes where not properly tested on large data sets and machines with many cores. The reduce phase doesn’t scale past what you’d have on an average desktop (or good laptop) machine.

That’s strange. Typically reduce phase is pretty fast. Is your dataset somehow skewed? As in, is there one predicate that’s overwhelmingly larger than others?

We’d need to get some profiles to understand what the CPU is doing, if you can get us those.


Is it ok if I just add --profile_mode=cpu to a dgraph bulk execution similar to the ones I’ve been doing lately?

Did that – will have the results in 10h or so.

Regarding possibly skewed dataset: Don’t think so. Will investigate if possible to share the actual data files.

@mrjn I now have a cpu.pprof file, but it’s too large (4.2MB) to upload here. How can I give this to you?

@apete Feel feel to email me the CPU profile ([email protected]) and we’ll take a look.

hey @apete, in the profile you shared, it looks like a lot of time is spent collecting information about the memory allocations that we do (via jemalloc). This change was introduced in

I’ll create a PR to fix this and share it here. That should improve the performance of the reduce phase.

1 Like

Sounds good. I’ll be very happy to test the new version.

1 Like

Any progress on this? I did try running with v20.11.2. It seemed a bit faster overall, but as far I could see this problem is not solved.

Hi @apete , we are exploring multiple options to optimize this. We wanted it to be correct and performant solution. We should be available with more concrete update by end of this week. Please stay tuned :).

1 Like

In the mean time, could we give @apete a binary which doesn’t do jemalloc profiling?

@apete please use this binary. It has disabled memory leak profiling and should resolve your performance issue. In the meantime, we are working on fixing this issue and we will update you once it is done.

Do let us know if you still face any issues.
Quick tip: If you face permission denied error using this binary just set the correct access permission of the binary using chmod command.

I have tried that binary. It’s faster than v20.11.2, but not by very much. The whole process now takes about 6.5h.

If I remember correctly v20.11.2 did it in about 7h and v20.11.1 took more than 8h.

Level Done
[06:07:00-0800] REDUCE 06h26m18s 100.00% edge_count:9.732G edge_speed:593.6k/sec plist_count:8.287G plist_speed:505.5k/sec. Num Encoding MBs: 0. jemalloc: 0 B 
Total: 06h26m18s
2021/03/04 06:07:20 profile: cpu profiling disabled, /tmp/profile633888346/cpu.pprof
dgraph bulk -f /mnt/fast1/sentinelData/ExportGraphTriplesSentinel/test -s      346412.47s user 20974.90s system 1583% cpu 6:26:44.37 total

1583% cpu Is that the average cpu load? This machine has 24 cores. With hyperthreads that’s 48 “processors”. During the MAP phase cpu usage looks good, but in the REDUCE phase the machine has plenty power to spare.

Hey @apete Can you please share one more cpu profile. You can email it to [email protected]. It would be helpful in understanding if there is anything that is blocking the performance. We have removed one of the major bottleneck in this binary.

Did you type that email address correctly? it didn’t work. I sent the file to [email protected] instead and asked him to forward it.

1 Like

Thanks @apete. Apologies it got mixed up in suggested text(I have corrected it now). I will connect with daniel and get back to you with update :slight_smile:

Can you guys try with this commit:

That sounds promising. I’ll see if I can build that tomorrow. Is it ok to build on Mac and then run on Linux?

Yes. But you need to set the GOOS and GOARCH appropriately. Lmk if you need help

I was just about to see if I could build a version…

That commit is from Jan 23 and as far as I can see it was merged to master on that day. Would this be something new?

I think I prefer if you provide me with a binary to try.