Stuck at REDUCE 100% for more than 30 hours

I’m trying to convert for about 7.2 billion lines of rdf file to dgraph, it takes for about 5 hours reaching to REDUCE 100.00%, and then it stuck at REDUCE 100.00% for more than 35 hours. I didn’t see any error logs. Is that normal? Thanks!

The memory and CPU occupation of dgraph:

top
PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 217284 root      20   0  272.2g  17.6g   4856 S   0.0  28.1   2498:11 dgraph

free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi        30Gi       580Mi       234Mi        31Gi        31Gi
Swap:            0B          0B          0B

The current output stuck like:

{"log":"[07:59:55Z] REDUCE 04h54m48s 100.00% edge_count:17.64G edge_speed:2.098M/sec plist_count:5.194G plist_speed:617.8k/sec. Num Encoding MBs: 1169. jemalloc: 85 GiB \n","stream":"stdout","time":"2021-03-20T07:59:55.725052084Z"}
{"log":"[07:59:56Z] REDUCE 04h54m49s 100.00% edge_count:17.64G edge_speed:2.098M/sec plist_count:5.194G plist_speed:617.7k/sec. Num Encoding MBs: 1169. jemalloc: 84 GiB \n","stream":"stdout","time":"2021-03-20T07:59:56.725634432Z"}
{"log":"[07:59:57Z] REDUCE 04h54m50s 100.00% edge_count:17.64G edge_speed:2.097M/sec plist_count:5.194G plist_speed:617.7k/sec. Num Encoding MBs: 0. jemalloc: 85 GiB \n","stream":"stdout","time":"2021-03-20T07:59:57.725067779Z"}
{"log":"[07:59:58Z] REDUCE 04h54m51s 100.00% edge_count:17.64G edge_speed:2.097M/sec plist_count:5.194G plist_speed:617.6k/sec. Num Encoding MBs: 0. jemalloc: 85 GiB \n","stream":"stdout","time":"2021-03-20T07:59:58.725784392Z"}
{"log":"[07:59:59Z] REDUCE 04h54m52s 100.00% edge_count:17.64G edge_speed:2.097M/sec plist_count:5.194G plist_speed:617.5k/sec. Num Encoding MBs: 0. jemalloc: 86 GiB \n","stream":"stdout","time":"2021-03-20T07:59:59.72567529Z"}
{"log":"[08:00:00Z] REDUCE 04h54m53s 100.00% edge_count:17.64G edge_speed:2.097M/sec plist_count:5.194G plist_speed:617.4k/sec. Num Encoding MBs: 0. jemalloc: 86 GiB \n","stream":"stdout","time":"2021-03-20T08:00:00.72606354Z"}
{"log":"[08:00:01Z] REDUCE 04h54m54s 100.00% edge_count:17.64G edge_speed:2.096M/sec plist_count:5.194G plist_speed:617.4k/sec. Num Encoding MBs: 0. jemalloc: 86 GiB \n","stream":"stdout","time":"2021-03-20T08:00:01.725897345Z"}
{"log":"[08:00:02Z] REDUCE 04h54m55s 100.00% edge_count:17.64G edge_speed:2.096M/sec plist_count:5.194G plist_speed:617.3k/sec. Num Encoding MBs: 0. jemalloc: 86 GiB \n","stream":"stdout","time":"2021-03-20T08:00:02.72562498Z"}
{"log":"[08:00:03Z] REDUCE 04h54m56s 100.00% edge_count:17.64G edge_speed:2.096M/sec plist_count:5.194G plist_speed:617.2k/sec. Num Encoding MBs: 0. jemalloc: 86 GiB \n","stream":"stdout","time":"2021-03-20T08:00:03.725594303Z"}
{"log":"[08:00:04Z] REDUCE 04h54m57s 100.00% edge_count:17.64G edge_speed:2.096M/sec plist_count:5.194G plist_speed:617.1k/sec. Num Encoding MBs: 0. jemalloc: 86 GiB \n","stream":"stdout","time":"2021-03-20T08:00:04.725355753Z"}

....
....
....

[20:45:25Z] REDUCE 41h40m18s 100.00% edge_count:17.64G edge_speed:125.3k/sec plist_count:5.194G plist_speed:36.91k/sec. Num Encoding MBs: 0. jemalloc: 76 GiB 
[20:45:26Z] REDUCE 41h40m19s 100.00% edge_count:17.64G edge_speed:125.3k/sec plist_count:5.194G plist_speed:36.90k/sec. Num Encoding MBs: 0. jemalloc: 76 GiB 
[20:45:27Z] REDUCE 41h40m20s 100.00% edge_count:17.64G edge_speed:125.3k/sec plist_count:5.194G plist_speed:36.90k/sec. Num Encoding MBs: 0. jemalloc: 76 GiB 
[20:45:28Z] REDUCE 41h40m21s 100.00% edge_count:17.64G edge_speed:125.3k/sec plist_count:5.194G plist_speed:36.90k/sec. Num Encoding MBs: 0. jemalloc: 76 GiB 
badger 2021/03/21 20:45:29 INFO: Compaction backed off 2704000 times
badger 2021/03/21 20:45:29 INFO: Compaction backed off 2704000 times
[20:45:29Z] REDUCE 41h40m22s 100.00% edge_count:17.64G edge_speed:125.3k/sec plist_count:5.194G plist_speed:36.90k/sec. Num Encoding MBs: 0. jemalloc: 76 GiB 
badger 2021/03/21 20:45:29 INFO: Compaction backed off 2704000 times
badger 2021/03/21 20:45:29 INFO: Compaction backed off 2704000 times
[20:45:30Z] REDUCE 41h40m23s 100.00% edge_count:17.64G edge_speed:125.3k/sec plist_count:5.194G plist_speed:36.90k/sec. Num Encoding MBs: 0. jemalloc: 76 GiB 
[20:45:31Z] REDUCE 41h40m24s 100.00% edge_count:17.64G edge_speed:125.3k/sec plist_count:5.194G plist_speed:36.90k/sec. Num Encoding MBs: 0. jemalloc: 76 GiB 
[20:45:32Z] REDUCE 41h40m25s 100.00% edge_count:17.64G edge_speed:125.3k/sec plist_count:5.194G plist_speed:36.90k/sec. Num Encoding MBs: 0. jemalloc: 76 GiB 

Report a Dgraph Bug

What version of Dgraph are you using?

Dgraph Version
$ dgraph version
 
[Decoder]: Using assembly version of decoder
Page Size: 4096

Dgraph version   : v20.11.1
Dgraph codename  : tchalla-1
Dgraph SHA-256   : cefdcc880c0607a92a1d8d3ba0beb015459ebe216e79fdad613eb0d00d09f134
Commit SHA-1     : 7153d13fe
Commit timestamp : 2021-01-28 15:59:35 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

For Dgraph official documentation, visit https://dgraph.io/docs/.
For discussions about Dgraph     , visit http://discuss.dgraph.io.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.

Have you tried reproducing the issue with the latest release?

Tried to use dgraph:v20.11.1 and dgraph:v20.11.2, both have this issue

What is the hardware spec (RAM, OS)?

64GB RAM, Arch linux running with Docker, nvme ssd

Steps to reproduce the issue (command/config used to run Dgraph).

Expected behaviour and actual result.


Experience Report for Feature Request

Note: Feature requests are judged based on user experience and modeled on Go Experience Reports. These reports should focus on the problems: they should not focus on and need not propose solutions.

What you wanted to do

What you actually did

Why that wasn’t great, with examples

Any external references to support your case

Hi @parly, are you using the Live Loader, or Bulk Loader? It looks like you may be using the LiveLoader?

@ibrahim if it’s OP is using LiveLoader here is a related issue

The logs are from bulk loader.

1 Like

@parly - Please, share the command used, add this flag with more logs and give us a goroutine dump at verbosity level 2.

dgraph bulk (...) --v 3

We need some debug information https://dgraph.io/docs/howto/retrieving-debug-information/

I believe from bulkloader would be something like http://localhost:8080/debug/pprof/goroutine?debug=2

Hi Thanks, I will run again with verbosity. I will feedback to you once I finish it .

@chewxy It’s from bulk loader.

Hi,
I reran the reduce phase with the command:

dgraph bulk --v 3 --skip_map_phase --replace_out --mapoutput_mb=2048 --num_go_routines=12 --ignore_errors --out $DGRAPHOUT --tmp $DGRAPHTMP --map_shards=1 --reduce_shards=1 -z $ZEROURL -f $DATADIR.rdf.gz --format rdf -s $SCHEMA --http 0.0.0.0:8080

Attached is the goroutine stack dump after hang on REDUCE for 1 hour. Before it hanged for around 40 hours, and then I canceled the task.

From the dump I don’t see any errors, but a lot of handleBlock. My computer almost don’t have any CPU occupation for dgraph bulk load process, and the directories of out and tmp didn’t change since it stucked at 100%, which means no io write event. And my computer still have a lot of available memory.

By the way, I can successfully run the bulk loader if I use a much smaller dataset but same schema. And I removed all of the characters other than [a-zA-Z0-9.- ] when I preprocess the data.

Please let me know if you have any suggestions. Thanks!

goroutine?debug=2.gz (134.7 KB)

We have accepted your issue and someone will look closer.

Do you think it could be related? what are the special characters used?

Thanks. No I don’t think so, since the rdf file is already preprocessed without any special characters.

By the way, attached is the dump after 18 hours.goroutine?debug=2.gz (133.9 KB)

Hey @parly, Thanks for reporting the issue. The issue has been fixed in the v20.11.3 release.

What was the issue ? Link to PR?

The issue was compaction back-off. At some point of time, the LSM tree would have got completely filled up and would not be able to expand because of no compaction.

The compaction back-off mechanism was already removed in the badger pointed by 20.11. We have also removed it from the master.

https://github.com/dgraph-io/badger/pull/1686

1 Like

@parly, Can you please upgrade to 20.11.3 and let us know if you still see this issue.

v21.12.0
Data: 260 GB

We saw this error today
MP phase run for almost 32 hours.
REDUCE phase stuck at 1.5% for more than 8 hours

The zero config: 32 core, 256 GB, 2 TB
Below is the command

dgraph bulk -f /coldstart/data -s /coldstart/schema/my_schema.rdf --replace_out --num_go_routines=10 --reducers=2 --format=rdf --store_xids --map_shards=9 --reduce_shards=3 > check.log &