Dgraph do not respond to any requests when it logs " LOG Compact" when I perform online mutation or liveload

Moved from GitHub dgraph/4960

Posted by JimWen:

What version of Dgraph are you using?

Dgraph version : v1.2.1
Dgraph SHA-256 : 3f18ff84570b2944f4d75f6f508d55d902715c7ca2310799cc2991064eb046f8
Commit SHA-1 : ddcda9296
Commit timestamp : 2020-02-06 15:31:05 -0800
Branch : HEAD
Go version : go1.13.5

Have you tried reproducing the issue with the latest release?

With v1.2.1

What is the hardware spec (RAM, OS)?

128G mem & 1.8T SSD

Linux version 3.10.0-1062.9.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019

Steps to reproduce the issue (command/config used to run Dgraph).

3 nodes cluster

zero rebalance_interval is set here to disable auto rebalance
nohup $dgpath zero --my=$local_ip:$port -o $offset --idx $index --replicas $replica --config $cfpath --cwd zero$index --rebalance_interval 12000h > ./log/zero$index.log 2>&1 &

alpha
nohup $dgpath alpha --snapshot_after=1000 --abort_older_than 15m --my=$local_ip:$port -o $offset --zero=$zero_1:5080 --config $cfpath --cwd alpha$index > ./log/alpha$index.log 2>&1 &

mutation through api or liveload with about 10 billion nquards
comand line is like :

./bin/dgraph live -b 1000 -c 20 -f ./data/2020-03-01 --format rdf -a "172.21.46.35:9080,172.21.46.39:9080,172.21.46.40:9080"

Expected behaviour and actual result.

The server works fine at first, and then it is in a suspended state for a long time , then it resumes, and then suspends like this…
The log is as followings:
I0318 12:14:37.877077 272392 log.go:34] Rolling up Sent 4905402 keys
I0318 12:14:37.877101 272392 draft.go:1102] Rolled up 4905396 keys. Done
I0318 12:14:37.877110 272392 draft.go:445] List rollup at Ts 10799123: OK.
I0318 12:14:37.878363 272392 predicate_move.go:195] Was instructed to delete tablet: dgraph.password
I0318 12:14:37.878618 272392 index.go:1083] Dropping predicate: [dgraph.password]
I0318 12:14:37.878632 272392 log.go:34] Writes flushed. Stopping compactions now…
E0318 12:15:07.114528 272392 log.go:32] unable to process discardstats with error: failed to push discard stats to write channel: Writes are blocked, possibly due to DropAll or Close
I0318 12:15:07.293204 272392 log.go:34] LOG Compact 1->2, del 9 tables, add 9 tables, took 41.192207806s
I0318 12:15:07.293262 272392 log.go:34] Compaction for level: 1 DONE
I0318 12:15:07.293311 272392 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:}
I0318 12:15:07.293339 272392 log.go:34] Running for level: 0
I0318 12:15:24.745589 272392 log.go:34] LOG Compact 0->1, del 9 tables, add 4 tables, took 17.45222122s
I0318 12:15:24.745642 272392 log.go:34] Compaction for level: 0 DONE
I0318 12:15:24.802984 272392 log.go:34] Got compaction priority: {level:0 score:1.74 dropPrefix:[0 0 15 100 103 114 97 112 104 46 112 97 115 115 119 111 114 100]}
I0318 12:15:24.803061 272392 log.go:34] Running for level: 0
I0318 12:15:39.769220 272392 draft.go:1353] Skipping snapshot at index: 10672779. Insufficient discard entries: 1. MinPendingStartTs: 10825054
I0318 12:15:41.545892 272392 log.go:34] LOG Compact 0->1, del 6 tables, add 4 tables, took 16.742809109s
I0318 12:15:41.545935 272392 log.go:34] Compaction for level: 0 DONE
I0318 12:15:43.236355 272392 log.go:34] LOG Compact 1->1, del 1 tables, add 1 tables, took 1.690368584s
I0318 12:15:46.814592 272392 log.go:34] LOG Compact 2->2, del 1 tables, add 1 tables, took 3.578099591s
I0318 12:15:51.135498 272392 log.go:34] LOG Compact 3->3, del 1 tables, add 1 tables, took 4.320559547s
I0318 12:15:51.135522 272392 log.go:34] DropPrefix done
I0318 12:15:51.135540 272392 log.go:34] Resuming writes

It seems to be an old problem, but it is still there, and it totally affetct the online business and performace.

JimWen commented :

Anybody can help me?

MichelDiz commented :

Lower the -c 20 flag to -c 1 to see if the problem continues.

What replication factor is used?

Are you using a localhost zero instance? I don’t see the ‘–zero’ flag on the command used for the live load.

JimWen commented :

Lower the -c 20 flag to -c 1 to see if the problem continues.

What replication factor is used?

Are you using a localhost zero instance? I don’t see the ‘–zero’ flag on the command used for the live load.

replication factor is set to 1
–zero is default 127.0.0.1:5080
Why should Lower the -c 20 flag to -c 1.As i see, it is to reduce conflict, but there are fewer conflicts here.

MichelDiz commented :

I’m just troubleshooting by step with you until some engineer be available and take over.
I don’t see how to help, just someone with more in-depth knowledge of the core code to analyze your case. The more information you have, the better.

The flag -c 1 decreases the “pressure” in the load.

Some Profiling helps too. https://docs.dgraph.io/howto/#profiling-information

aakside commented :

I’m looking into this. A recent change prevents blocking writes on level 1 compaction, which should reduce unresponsiveness somewhat.