Experiencing hungs

Moved from GitHub dgraph/4818

Posted by FelixHolmes:

What version of Dgraph are you using?

Dgraph version   : v1.2.1
Dgraph SHA-256   : 3f18ff84570b2944f4d75f6f508d55d902715c7ca2310799cc2991064eb046f8
Commit SHA-1     : ddcda9296
Commit timestamp : 2020-02-06 15:31:05 -0800
Branch           : HEAD
Go version       : go1.13.5

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

CPU:16HC
RAM:64G
Disk:500G SSD
Linux version 3.13.0-141-generic (buildd@lgw01-amd64-018) (gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) ) #190-Ubuntu SMP Fri Jan 19 12:52:38 UTC 2018

Steps to reproduce the issue (command/config used to run Dgraph).

zero_config.yaml

my: "ip:5080"
# replicas: 3
# idx: 1
bindall: true

alpha_config.yaml

my: "ip:7080"
zero: "ip:5080"
lru_mb: 204800
snapshot_after: 5000
bindall: true
  1. Start db load 3.5*60M predicate(Stand-alone)
  2. Officially run for a while(Client rpc request)
  3. DB hung up not working(Appear periodically)
  4. the strange thing is:
    *** When I run an http request, the DB returns to normal ***
curl -H 'Content-Type: application/graphql+-' localhost:8080/query -XPOST -d '{
  UserRoot(func: eq(user_id,"xxxxxxx")) {
    count(~follow)
    count(follow)
  }
}'

Expected behaviour and actual result.

log info

0220 08:12:03.284002    1923 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0220 08:12:03.284385    1923 log.go:34] Running for level: 0
I0220 08:12:17.312002    1923 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 14.02743225s
I0220 08:12:17.312038    1923 log.go:34] Compaction for level: 0 DONE
I0220 08:13:39.919400    1923 draft.go:403] Creating snapshot at index: 29051684. ReadTs: 29506605.
I0220 08:15:39.962278    1923 draft.go:403] Creating snapshot at index: 29059170. ReadTs: 29514265.
I0220 08:16:40.022534    1923 draft.go:403] Creating snapshot at index: 29064950. ReadTs: 29520229.
I0220 08:16:40.898917    1923 log.go:34] Rolling up Time elapsed: 01s, bytes sent: 0 B, speed: 0 B/sec
I0220 08:16:41.212262    1923 log.go:34] Rolling up Created batch of size: 1.2 MB in 286.435µs.
I0220 08:16:41.898361    1923 log.go:34] Rolling up Time elapsed: 02s, bytes sent: 1.2 MB, speed: 611 kB/sec
I0220 08:16:42.898634    1923 log.go:34] Rolling up Time elapsed: 03s, bytes sent: 1.2 MB, speed: 408 kB/sec
I0220 08:16:43.898460    1923 log.go:34] Rolling up Time elapsed: 04s, bytes sent: 1.2 MB, speed: 306 kB/sec
I0220 08:16:44.251762    1923 log.go:34] Rolling up Created batch of size: 4.7 MB in 895.251µs.
I0220 08:16:44.898371    1923 log.go:34] Rolling up Time elapsed: 05s, bytes sent: 5.9 MB, speed: 1.2 MB/sec
I0220 08:16:45.898359    1923 log.go:34] Rolling up Time elapsed: 06s, bytes sent: 5.9 MB, speed: 990 kB/sec
I0220 08:16:46.898344    1923 log.go:34] Rolling up Time elapsed: 07s, bytes sent: 5.9 MB, speed: 848 kB/sec
I0220 08:16:46.961639    1923 log.go:34] Rolling up Created batch of size: 24 kB in 536.796µs.
I0220 08:16:47.898389    1923 log.go:34] Rolling up Time elapsed: 08s, bytes sent: 6.0 MB, speed: 745 kB/sec
I0220 08:16:48.898445    1923 log.go:34] Rolling up Time elapsed: 09s, bytes sent: 6.0 MB, speed: 662 kB/sec
I0220 08:16:49.536738    1923 log.go:34] Rolling up Created batch of size: 24 kB in 1.126048ms.
I0220 08:16:49.898485    1923 log.go:34] Rolling up Time elapsed: 10s, bytes sent: 6.0 MB, speed: 599 kB/sec
I0220 08:16:50.898363    1923 log.go:34] Rolling up Time elapsed: 11s, bytes sent: 6.0 MB, speed: 544 kB/sec
I0220 08:16:51.311832    1923 log.go:34] Rolling up Created batch of size: 900 kB in 4.458627ms.
I0220 08:16:51.898336    1923 log.go:34] Rolling up Time elapsed: 12s, bytes sent: 6.9 MB, speed: 574 kB/sec
I0220 08:16:52.898375    1923 log.go:34] Rolling up Time elapsed: 13s, bytes sent: 6.9 MB, speed: 530 kB/sec
I0220 08:16:53.898393    1923 log.go:34] Rolling up Time elapsed: 14s, bytes sent: 6.9 MB, speed: 492 kB/sec
I0220 08:16:54.277509    1923 log.go:34] Rolling up Created batch of size: 4.3 MB in 24.418724ms.
I0220 08:16:54.898375    1923 log.go:34] Rolling up Time elapsed: 15s, bytes sent: 11 MB, speed: 743 kB/sec
I0220 08:16:54.906653    1923 log.go:34] Rolling up Created batch of size: 292 kB in 3.221667ms.
I0220 08:16:55.149116    1923 log.go:34] Rolling up Created batch of size: 4.4 MB in 4.728967ms.
I0220 08:16:55.252344    1923 log.go:34] Rolling up Created batch of size: 446 kB in 3.589395ms.
I0220 08:16:55.462128    1923 log.go:34] Rolling up Created batch of size: 155 kB in 30.918885ms.
I0220 08:16:55.809269    1923 log.go:34] Rolling up Created batch of size: 4.2 MB in 4.255364ms.
I0220 08:16:55.901670    1923 log.go:34] Rolling up Time elapsed: 16s, bytes sent: 21 MB, speed: 1.3 MB/sec
I0220 08:16:56.427749    1923 log.go:34] Rolling up Created batch of size: 4.2 MB in 3.022367ms.
I0220 08:16:56.650256    1923 log.go:34] Rolling up Created batch of size: 948 kB in 2.328861ms.
I0220 08:16:56.898477    1923 log.go:34] Rolling up Time elapsed: 17s, bytes sent: 26 MB, speed: 1.5 MB/sec
I0220 08:16:57.106008    1923 log.go:34] Rolling up Created batch of size: 166 kB in 13.411763ms.
I0220 08:16:57.142121    1923 log.go:34] Rolling up Created batch of size: 4.2 MB in 6.250634ms.
I0220 08:16:57.191729    1923 log.go:34] Rolling up Created batch of size: 2.1 MB in 9.688053ms.
I0220 08:16:57.354832    1923 log.go:34] Rolling up Created batch of size: 39 kB in 246.94µs.
I0220 08:16:57.470197    1923 log.go:34] Rolling up Created batch of size: 2.8 MB in 13.836151ms.
I0220 08:16:57.562581    1923 log.go:34] Rolling up Created batch of size: 198 kB in 372.882µs.
I0220 08:16:57.676504    1923 log.go:34] Rolling up Created batch of size: 4.3 MB in 10.238222ms.
I0220 08:16:57.898854    1923 log.go:34] Rolling up Time elapsed: 18s, bytes sent: 40 MB, speed: 2.2 MB/sec
I0220 08:16:58.098687    1923 log.go:34] Rolling up Created batch of size: 4.2 MB in 2.745176ms.
I0220 08:16:58.484415    1923 log.go:34] Rolling up Created batch of size: 1.5 MB in 6.832797ms.
I0220 08:16:58.898399    1923 log.go:34] Rolling up Time elapsed: 19s, bytes sent: 45 MB, speed: 2.4 MB/sec
I0220 08:16:59.898344    1923 log.go:34] Rolling up Time elapsed: 20s, bytes sent: 45 MB, speed: 2.3 MB/sec
I0220 08:17:00.898279    1923 log.go:34] Rolling up Time elapsed: 21s, bytes sent: 45 MB, speed: 2.2 MB/sec
I0220 08:17:01.473237    1923 log.go:34] Rolling up Created batch of size: 2.5 MB in 10.933176ms.
I0220 08:17:01.473550    1923 log.go:34] Rolling up Sent 15240 keys
I0220 08:17:01.600877    1923 draft.go:1102] Rolled up 15222 keys. Done
I0220 08:17:01.601306    1923 draft.go:445] List rollup at Ts 29514265: OK.
I0220 08:17:39.949341    1923 draft.go:403] Creating snapshot at index: 29075363. ReadTs: 29530817.

FelixHolmes commented :

I have no way

FelixHolmes commented :

^@I0220 08:42:27.888278    1923 log.go:34] LOG Compact 1->2, del 7 tables, add 7 tables, took 25.697322143s
I0220 08:42:27.888778    1923 log.go:34] Compaction for level: 1 DONE
^@I0220 08:43:39.962043    1923 draft.go:1353] Skipping snapshot at index: 29202882. Insufficient discard entries: 1. MinPendingStartTs: 29660945
^@I0220 08:44:39.946749    1923 draft.go:1353] Skipping snapshot at index: 29202882. Insufficient discard entries: 1. MinPendingStartTs: 29660945
^@I0220 08:45:39.973555    1923 draft.go:1353] Skipping snapshot at index: 29202882. Insufficient discard entries: 1. MinPendingStartTs: 29660945
^@I0220 08:46:40.086277    1923 draft.go:1353] Skipping snapshot at index: 29202882. Insufficient discard entries: 1. MinPendingStartTs: 29660945

FelixHolmes commented :

I0220 08:47:40.000576    1923 draft.go:1353] Skipping snapshot at index: 29202882. Insufficient discard entries: 1. MinPendingStartTs: 29660945
I0220 08:47:40.001026    1923 draft.go:1194] Found 3 old transactions. Acting to abort them.
I0220 08:47:40.004242    1923 draft.go:1155] TryAbort 3 txns with start ts. Error: <nil>
I0220 08:47:40.004449    1923 draft.go:1178] TryAbort selectively proposing only aborted txns: txns:<start_ts:29660972 > txns:<start_ts:29661067 > txns:<start_ts:29660945 >
I0220 08:47:40.006262    1923 draft.go:1197] Done abortOldTransactions for 3 txns. Error: <nil>
I0220 08:47:52.407478    1923 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 13.1233301s
I0220 08:47:52.407556    1923 log.go:34] Compaction for level: 0 DONE
^@I0220 08:48:40.110161    1923 draft.go:403] Creating snapshot at index: 29230653. ReadTs: 29694315.
^@I0220 08:49:39.935132    1923 draft.go:1353] Skipping snapshot at index: 29230653. Insufficient discard entries: 0. MinPendingStartTs: 29689881

mangalaman93 commented :

Could you try increasing the snapshot_after to a larger number say 30000? It would be easier for us to look at the issue if you could provide us various profiles when the database is stuck. To take the profiles, you could run the following command:

curl http://<ip>:8080/debug/pprof/heap > heap.$(date +%s).pprof
curl http://<ip>:8080/debug/pprof/profile > cpu.$(date +%s).pprof
curl http://<ip>:8080/debug/pprof/blo > blo.$(date +%s).pprof

You could run this a few times and provide us multiple profiles. Feel free to email at aman[AT]dgraph.io.

It would be useful to have the output of the following link:
http://localhost:8080/debug/pprof/goroutine?debug=2

sleto-it commented :

Hi @FelixHolmes,

Thanks again for reporting this

Is the issue persisting? If yes, please could you increase snapshot_after and provide the profiles as @mangalaman93 suggested?

Many thanks,

sleto-it commented :

Hi @FelixHolmes,

Sending another quick follow-up. Please let us know if the issue persists

Many thanks,

Willem520 commented :

I think I meet the same problem. there is no error in log. but wen I use /health?all,I got some node unhealthy. and on these node, I can’t get any response by using curl http://127.0.0.1:8080/health.