Why does dgraph not respond to any requests with "Roll up " or " LOG Compact" logs every so often when I perform an insert test

Dgraph version : v1.1.1
Dgraph SHA-256 : 20a10e9db316b565464e0a04cea09657eebbaea3b91460ab651997ada63243cc
Commit SHA-1 : 8994a5740
Commit timestamp : 2019-12-16 18:24:50 -0800
Branch : HEAD
Go version : go1.13.5

OS INFO:
8Cores, 15GB Memory, Linux ss1.bessystem.com 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

mutation client:
Upsert data: 1 billion. Thread number: 1. One batch for upsert: 100

When I perform a performance test on Dgraph, I start a client to continuously write data to dgraph。 After a period of time(10~20minutes)。The client is blocked by dgraph server with a long time and the server of “Alpha” output the following logs


I0111 15:41:20.345895   18569 log.go:34] Rolling up Created batch of size: 11 MB in 2.088280354s.
I0111 15:41:22.001260   18569 log.go:34] Rolling up Created batch of size: 6.6 MB in 1.641390541s.
I0111 15:41:22.001338   18569 log.go:34] Rolling up Sent 1796032 keys
I0111 15:41:22.084420   18569 draft.go:1076] Rolled up 1796003 keys. Done
I0111 15:41:22.084526   18569 draft.go:425] List rollup at Ts 163069: OK.
I0111 15:41:22.087491   18569 predicate_move.go:191] Was instructed to delete tablet: FriendTest
I0111 15:41:22.089882   18569 index.go:1000] Dropping predicate: [FriendTest]
I0111 15:41:22.089948   18569 log.go:34] Writes flushed. Stopping compactions now...
W0111 15:41:32.085107   18569 draft.go:1107] While sending membership to Zero. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0111 15:41:43.357438   18569 log.go:34] Got compaction priority: {level:0 score:1.74 dropPrefix:[0 0 10 70 114 105 101 110 100 84 101 115 116]}
I0111 15:41:43.357609   18569 log.go:34] Running for level: 0
I0111 15:45:50.311154   18569 log.go:34] LOG Compact 0->1, del 7 tables, add 3 tables, took 4m6.95344879s
I0111 15:45:50.311272   18569 log.go:34] Compaction for level: 0 DONE
I0111 15:46:14.950379   18569 draft.go:1168] Found 1 old transactions. Acting to abort them.
I0111 15:46:14.955593   18569 draft.go:1129] TryAbort 1 txns with start ts. Error: <nil>
I0111 15:46:14.955694   18569 draft.go:1152] TryAbort selectively proposing only aborted txns: txns:<start_ts:163406 >
I0111 15:47:14.950493   18569 draft.go:1168] Found 1 old transactions. Acting to abort them.
I0111 15:47:14.955148   18569 draft.go:1129] TryAbort 1 txns with start ts. Error: <nil>
I0111 15:47:14.955191   18569 draft.go:1152] TryAbort selectively proposing only aborted txns: txns:<start_ts:163406 >
I0111 15:47:17.256523   18569 log.go:34] LOG Compact 1->1, del 1 tables, add 1 tables, took 1m26.945152819s

and the server of “zero” output the following logs

I0111 15:18:25.568738   17864 oracle.go:107] Purged below ts:155885, len(o.commits):211, len(o.rowCommit):148400
W0111 15:18:25.568943   17864 raft.go:729] Raft.Ready took too long to process: Timer Total: 618ms. Breakdown: [{proposals 616ms} {sync 2ms} {disk 0s} {advance 0s}]. Num entries: 1. MustSync: true
I0111 15:18:26.320982   17864 raft.go:616] Writing snapshot at index: 76710, applied mark: 76939
I0111 15:21:22.039188   17864 oracle.go:107] Purged below ts:163069, len(o.commits):34, len(o.rowCommit):24500
W0111 15:21:22.039719   17864 raft.go:729] Raft.Ready took too long to process: Timer Total: 1.041s. Breakdown: [{proposals 1.035s} {sync 6ms} {disk 0s} {advance 0s}]. Num entries: 1. MustSync: true
I0111 15:21:26.328691   17864 raft.go:616] Writing snapshot at index: 80340, applied mark: 80399
I0111 15:21:45.202544   17864 zero.go:696] Tablet: FriendTest does not belong to group: 1. Sending delete instruction.
W0111 15:21:55.196113   17864 zero.go:660] While deleting predicates: rpc error: code = Canceled desc = context canceled
I0111 15:41:22.085781   17864 zero.go:696] Tablet: FriendTest does not belong to group: 1. Sending delete instruction.
W0111 15:41:32.085464   17864 zero.go:660] While deleting predicates: rpc error: code = Canceled desc = context canceled

During this time, Clients for performance testing cannot continue to insert data and are in a blocked state. Server is not processing any requests as if it is hung.
After a few minutes, when the server of “Aplha” output the following log

alpha-stdout-stderr.log

I0111 15:47:17.256605   18569 log.go:34] DropPrefix done
I0111 15:47:17.256697   18569 log.go:34] Resuming writes
I0111 15:47:17.256764   18569 schema.go:79] Deleting schema for predicate: [FriendTest]

The server resumes from the suspended state to the normal state and starts processing insert data requests.
My question is why the server is in a suspended state for a long time? Is it a problem I use with dgraph? The following commands are my commands to start dgraph

nohup ./dgraph zero >> zero.log &
nohup ./dgraph alpha --lru_mb 4096 --zero localhost:5080 >> alpha.log &

When the above problem appears, I try to modify the commands like follows

nohup ./dgraph zero >> zero.log
nohup ./dgraph alpha --max_retries 10 --pending_proposals 20 --snapshot_after 5000 --lru_mb 4096 --zero localhost:5080 >> alpha.log &

The problem also exists, How should i solve this problem? It seems like a serious problem with server hung.

@martinmr had fixed a similar issue where Zero was trying to delete the Dgraph types over and over again, but because Types are not predicates, was unable to delete them. And therefore, it kept on going in a loop about it. This sounds similar. Can you confirm that FriendTest is a type?

Yes, I can make sure. But my problem is why the alpha server caused the client with a blocked state for a long time when I used java-client to mutate the rdf data into the server continuously.

I had called commit operation in code block and close operation in finally block with every transaction

This is probably happening due to compaction happening at L0. Also current compression is taking a lot of time. I tried running @wangweidonghb’s script on @ibrahim’s branch which has fix for this. I ran the script for 2 hours and I wasn’t able to see this issue.

1 Like

yep,I met the same problem(Realtime streaming graph data). it would make my project lost data during blocked time.and the project performed badly.