Bulk Loader REDUCE problem - it's very slow

Hi, I’m having troubles with the Bulk Loader. In particular during the REDUCE phase. Simply can’t get it to perform.

The machine averages about 10% CPU load and 10 MB/s disk IO. That’s a machine with 48 processors and NVMe SSD:s in a raid configuration. The machine is not the limit. Just can’t get the Bulk Loader to make use of it. What could be holding it back?

Have tried many different combinations of --map_shards, --reduce_shards, --reducers & --badger.compression. Essentially those settings makes no (speed) difference. Is there some other magic switch?

The current tests I’m working on I have 100 rdf.gz files little more than 800MB each. (This represents about 5% of the total dataset.)

The MAP phase takes about 2h but seems to scale well with the number of available cores and the --num_go_routines setting. A bigger machine should help here.

The REDUCE phase takes 6-7h and I’m not able to do anything about this.


Dgraph version   : v20.11.1
Dgraph codename  : tchalla-1
Dgraph SHA-256   : cefdcc880c0607a92a1d8d3ba0beb015459ebe216e79fdad613eb0d00d09f134
Commit SHA-1     : 7153d13fe
Commit timestamp : 2021-01-28 15:59:35 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

The https://dgraph.io/blog/post/bulkloader/ post mentions a map/shuffle/reduce paradigm and discusses how having multiple shufflers is critical to performance. When I run the Bulk Loader I don’t really see a shuffle phase and I’m not aware of a way to control the number of shufflers.

On the Bulk Loader docs page (https://dgraph.io/docs/deploy/fast-data-loading/bulk-loader/) there is a section where the console output of a simple execution is shown. Among the displayed configuration parameters I see:

"NumShufflers": 1,

I don’t see that when I run.

I’m guessing there has been a redesign of the Bulk Loader tool since that blog post (and the docs) were written, and those changes where not properly tested on large data sets and machines with many cores. The reduce phase doesn’t scale past what you’d have on an average desktop (or good laptop) machine.

That’s strange. Typically reduce phase is pretty fast. Is your dataset somehow skewed? As in, is there one predicate that’s overwhelmingly larger than others?

We’d need to get some profiles to understand what the CPU is doing, if you can get us those.


Is it ok if I just add --profile_mode=cpu to a dgraph bulk execution similar to the ones I’ve been doing lately?

Did that – will have the results in 10h or so.

Regarding possibly skewed dataset: Don’t think so. Will investigate if possible to share the actual data files.

@mrjn I now have a cpu.pprof file, but it’s too large (4.2MB) to upload here. How can I give this to you?

@apete Feel feel to email me the CPU profile (daniel@dgraph.io) and we’ll take a look.

hey @apete, in the profile you shared, it looks like a lot of time is spent collecting information about the memory allocations that we do (via jemalloc). This change was introduced in

I’ll create a PR to fix this and share it here. That should improve the performance of the reduce phase.

1 Like

Sounds good. I’ll be very happy to test the new version.

1 Like

Any progress on this? I did try running with v20.11.2. It seemed a bit faster overall, but as far I could see this problem is not solved.

Hi @apete , we are exploring multiple options to optimize this. We wanted it to be correct and performant solution. We should be available with more concrete update by end of this week. Please stay tuned :).

1 Like

In the mean time, could we give @apete a binary which doesn’t do jemalloc profiling?

@apete please use this binary. It has disabled memory leak profiling and should resolve your performance issue. In the meantime, we are working on fixing this issue and we will update you once it is done.

Do let us know if you still face any issues.
Quick tip: If you face permission denied error using this binary just set the correct access permission of the binary using chmod command.