Dgraph version : v1.1.1
Dgraph SHA-256 : 20a10e9db316b565464e0a04cea09657eebbaea3b91460ab651997ada63243cc
Commit SHA-1 : 8994a5740
Commit timestamp : 2019-12-16 18:24:50 -0800
Branch : HEAD
Go version : go1.13.5
OS INFO:
8Cores, 15GB Memory, Linux ss1.bessystem.com 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
mutation client:
Upsert data: 1 billion. Thread number: 1. One batch for upsert: 100
When I perform a performance test on Dgraph, I start a client to continuously write data to dgraph。 After a period of time(10~20minutes)。The client is blocked by dgraph server with a long time and the server of “Alpha” output the following logs
I0111 15:41:20.345895 18569 log.go:34] Rolling up Created batch of size: 11 MB in 2.088280354s.
I0111 15:41:22.001260 18569 log.go:34] Rolling up Created batch of size: 6.6 MB in 1.641390541s.
I0111 15:41:22.001338 18569 log.go:34] Rolling up Sent 1796032 keys
I0111 15:41:22.084420 18569 draft.go:1076] Rolled up 1796003 keys. Done
I0111 15:41:22.084526 18569 draft.go:425] List rollup at Ts 163069: OK.
I0111 15:41:22.087491 18569 predicate_move.go:191] Was instructed to delete tablet: FriendTest
I0111 15:41:22.089882 18569 index.go:1000] Dropping predicate: [FriendTest]
I0111 15:41:22.089948 18569 log.go:34] Writes flushed. Stopping compactions now...
W0111 15:41:32.085107 18569 draft.go:1107] While sending membership to Zero. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0111 15:41:43.357438 18569 log.go:34] Got compaction priority: {level:0 score:1.74 dropPrefix:[0 0 10 70 114 105 101 110 100 84 101 115 116]}
I0111 15:41:43.357609 18569 log.go:34] Running for level: 0
I0111 15:45:50.311154 18569 log.go:34] LOG Compact 0->1, del 7 tables, add 3 tables, took 4m6.95344879s
I0111 15:45:50.311272 18569 log.go:34] Compaction for level: 0 DONE
I0111 15:46:14.950379 18569 draft.go:1168] Found 1 old transactions. Acting to abort them.
I0111 15:46:14.955593 18569 draft.go:1129] TryAbort 1 txns with start ts. Error: <nil>
I0111 15:46:14.955694 18569 draft.go:1152] TryAbort selectively proposing only aborted txns: txns:<start_ts:163406 >
I0111 15:47:14.950493 18569 draft.go:1168] Found 1 old transactions. Acting to abort them.
I0111 15:47:14.955148 18569 draft.go:1129] TryAbort 1 txns with start ts. Error: <nil>
I0111 15:47:14.955191 18569 draft.go:1152] TryAbort selectively proposing only aborted txns: txns:<start_ts:163406 >
I0111 15:47:17.256523 18569 log.go:34] LOG Compact 1->1, del 1 tables, add 1 tables, took 1m26.945152819s
and the server of “zero” output the following logs
I0111 15:18:25.568738 17864 oracle.go:107] Purged below ts:155885, len(o.commits):211, len(o.rowCommit):148400
W0111 15:18:25.568943 17864 raft.go:729] Raft.Ready took too long to process: Timer Total: 618ms. Breakdown: [{proposals 616ms} {sync 2ms} {disk 0s} {advance 0s}]. Num entries: 1. MustSync: true
I0111 15:18:26.320982 17864 raft.go:616] Writing snapshot at index: 76710, applied mark: 76939
I0111 15:21:22.039188 17864 oracle.go:107] Purged below ts:163069, len(o.commits):34, len(o.rowCommit):24500
W0111 15:21:22.039719 17864 raft.go:729] Raft.Ready took too long to process: Timer Total: 1.041s. Breakdown: [{proposals 1.035s} {sync 6ms} {disk 0s} {advance 0s}]. Num entries: 1. MustSync: true
I0111 15:21:26.328691 17864 raft.go:616] Writing snapshot at index: 80340, applied mark: 80399
I0111 15:21:45.202544 17864 zero.go:696] Tablet: FriendTest does not belong to group: 1. Sending delete instruction.
W0111 15:21:55.196113 17864 zero.go:660] While deleting predicates: rpc error: code = Canceled desc = context canceled
I0111 15:41:22.085781 17864 zero.go:696] Tablet: FriendTest does not belong to group: 1. Sending delete instruction.
W0111 15:41:32.085464 17864 zero.go:660] While deleting predicates: rpc error: code = Canceled desc = context canceled
During this time, Clients for performance testing cannot continue to insert data and are in a blocked state. Server is not processing any requests as if it is hung.
After a few minutes, when the server of “Aplha” output the following log
alpha-stdout-stderr.log
I0111 15:47:17.256605 18569 log.go:34] DropPrefix done
I0111 15:47:17.256697 18569 log.go:34] Resuming writes
I0111 15:47:17.256764 18569 schema.go:79] Deleting schema for predicate: [FriendTest]
The server resumes from the suspended state to the normal state and starts processing insert data requests.
My question is why the server is in a suspended state for a long time? Is it a problem I use with dgraph? The following commands are my commands to start dgraph
nohup ./dgraph zero >> zero.log &
nohup ./dgraph alpha --lru_mb 4096 --zero localhost:5080 >> alpha.log &
When the above problem appears, I try to modify the commands like follows
nohup ./dgraph zero >> zero.log
nohup ./dgraph alpha --max_retries 10 --pending_proposals 20 --snapshot_after 5000 --lru_mb 4096 --zero localhost:5080 >> alpha.log &
The problem also exists, How should i solve this problem? It seems like a serious problem with server hung.