Bulk Loader REDUCE problem - it's very slow

apete · February 9, 2021, 3:35pm

Hi, I’m having troubles with the Bulk Loader. In particular during the REDUCE phase. Simply can’t get it to perform.

The machine averages about 10% CPU load and 10 MB/s disk IO. That’s a machine with 48 processors and NVMe SSD:s in a raid configuration. The machine is not the limit. Just can’t get the Bulk Loader to make use of it. What could be holding it back?

Have tried many different combinations of --map_shards, --reduce_shards, --reducers & --badger.compression. Essentially those settings makes no (speed) difference. Is there some other magic switch?

The current tests I’m working on I have 100 rdf.gz files little more than 800MB each. (This represents about 5% of the total dataset.)

The MAP phase takes about 2h but seems to scale well with the number of available cores and the --num_go_routines setting. A bigger machine should help here.

The REDUCE phase takes 6-7h and I’m not able to do anything about this.

/Anders

Dgraph version   : v20.11.1
Dgraph codename  : tchalla-1
Dgraph SHA-256   : cefdcc880c0607a92a1d8d3ba0beb015459ebe216e79fdad613eb0d00d09f134
Commit SHA-1     : 7153d13fe
Commit timestamp : 2021-01-28 15:59:35 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

apete · February 10, 2021, 7:47am

The Loading close to 1M edges/sec into Dgraph - Dgraph Blog post mentions a map/shuffle/reduce paradigm and discusses how having multiple shufflers is critical to performance. When I run the Bulk Loader I don’t really see a shuffle phase and I’m not aware of a way to control the number of shufflers.

On the Bulk Loader docs page (https://dgraph.io/docs/deploy/fast-data-loading/bulk-loader/) there is a section where the console output of a simple execution is shown. Among the displayed configuration parameters I see:

"NumShufflers": 1,

I don’t see that when I run.

I’m guessing there has been a redesign of the Bulk Loader tool since that blog post (and the docs) were written, and those changes where not properly tested on large data sets and machines with many cores. The reduce phase doesn’t scale past what you’d have on an average desktop (or good laptop) machine.

mrjn · February 10, 2021, 6:18pm

That’s strange. Typically reduce phase is pretty fast. Is your dataset somehow skewed? As in, is there one predicate that’s overwhelmingly larger than others?

We’d need to get some profiles to understand what the CPU is doing, if you can get us those.

https://dgraph.io/docs/howto/retrieving-debug-information/#profiling-information

apete · February 11, 2021, 8:47am

Is it ok if I just add --profile_mode=cpu to a dgraph bulk execution similar to the ones I’ve been doing lately?

Did that – will have the results in 10h or so.

Regarding possibly skewed dataset: Don’t think so. Will investigate if possible to share the actual data files.

apete · February 11, 2021, 7:41pm

@mrjn I now have a cpu.pprof file, but it’s too large (4.2MB) to upload here. How can I give this to you?

dmai · February 11, 2021, 9:37pm

@apete Feel feel to email me the CPU profile (daniel@dgraph.io) and we’ll take a look.

ibrahim · February 22, 2021, 3:49pm

hey @apete, in the profile you shared, it looks like a lot of time is spent collecting information about the memory allocations that we do (via jemalloc). This change was introduced in

I’ll create a PR to fix this and share it here. That should improve the performance of the reduce phase.

apete · February 22, 2021, 8:43pm

Sounds good. I’ll be very happy to test the new version.

apete · March 2, 2021, 5:40pm

Any progress on this? I did try running with v20.11.2. It seemed a bit faster overall, but as far I could see this problem is not solved.

aman-bansal · March 2, 2021, 6:56pm

Hi @apete , we are exploring multiple options to optimize this. We wanted it to be correct and performant solution. We should be available with more concrete update by end of this week. Please stay tuned :).

mrjn · March 3, 2021, 4:38am

In the mean time, could we give @apete a binary which doesn’t do jemalloc profiling?

aman-bansal · March 3, 2021, 2:30pm

@apete please use this binary. It has disabled memory leak profiling and should resolve your performance issue. In the meantime, we are working on fixing this issue and we will update you once it is done.

Do let us know if you still face any issues.
Quick tip: If you face permission denied error using this binary just set the correct access permission of the binary using chmod command.

apete · March 4, 2021, 2:13pm

I have tried that binary. It’s faster than v20.11.2, but not by very much. The whole process now takes about 6.5h.

If I remember correctly v20.11.2 did it in about 7h and v20.11.1 took more than 8h.

Level Done
[06:07:00-0800] REDUCE 06h26m18s 100.00% edge_count:9.732G edge_speed:593.6k/sec plist_count:8.287G plist_speed:505.5k/sec. Num Encoding MBs: 0. jemalloc: 0 B 
Total: 06h26m18s
2021/03/04 06:07:20 profile: cpu profiling disabled, /tmp/profile633888346/cpu.pprof
dgraph bulk -f /mnt/fast1/sentinelData/ExportGraphTriplesSentinel/test -s      346412.47s user 20974.90s system 1583% cpu 6:26:44.37 total

1583% cpu Is that the average cpu load? This machine has 24 cores. With hyperthreads that’s 48 “processors”. During the MAP phase cpu usage looks good, but in the REDUCE phase the machine has plenty power to spare.

aman-bansal · March 4, 2021, 2:49pm

Hey @apete Can you please share one more cpu profile. You can email it to amanbansal@dgraph.io. It would be helpful in understanding if there is anything that is blocking the performance. We have removed one of the major bottleneck in this binary.

apete · March 4, 2021, 7:32pm

Did you type that email address correctly? it didn’t work. I sent the file to daniel@dgraph.io instead and asked him to forward it.

aman-bansal · March 4, 2021, 7:47pm

Thanks @apete. Apologies it got mixed up in suggested text(I have corrected it now). I will connect with daniel and get back to you with update

mrjn · March 8, 2021, 3:23pm

Can you guys try with this commit:

https://github.com/dgraph-io/dgraph/commit/889989f2dbecf9405e51891c507046bc19e6afca

apete · March 8, 2021, 10:58pm

That sounds promising. I’ll see if I can build that tomorrow. Is it ok to build on Mac and then run on Linux?

chewxy · March 9, 2021, 12:01am

Yes. But you need to set the GOOS and GOARCH appropriately. Lmk if you need help

apete · March 10, 2021, 8:55am

I was just about to see if I could build a version…

That commit is from Jan 23 and as far as I can see it was merged to master on that day. Would this be something new?

I think I prefer if you provide me with a binary to try.

Topic		Replies	Views
Bulk Loader performance Dgraph kind:question , dgraph	5	488	February 1, 2021
Improve throughput of bulk loader with distributed loading Dgraph dgraph , kind:enhancement , priority:p2 , status:accepted , popular	21	1025	February 6, 2020
Bulk loader becomes slow when memory gets full Users	20	2169	December 17, 2017
Bulk loader still OOM during reduce phase Dgraph area:bulk-loader	18	871	August 1, 2021
Bulk Loader - Deploy Documentation	0	892	December 16, 2020

Bulk Loader REDUCE problem - it's very slow

Related topics