I’m curious what’s happening with the snapshotting in the below logs. Dgraph is having a hard time ingesting data while Snapshotting is going on. Unfortunately, today, the snapshotting has not taken a break. There’s been (at least) 12 snapshots, each lasting about 50 minutes, in 12 hrs.
The container this dgraph alpha is running in has 68 CPUs, 128GB of RAM.
The dgraph alpha command has the following parameters:
--raft="snapshot-after-entries=300000; snapshot-after-duration=0" --cache="size-mb=1024" --badger="maxlevels=7" --v=2 --telemetry="reports=false; sentry=false;"
The below logs are between 00:21:05 and 00:21:19. There were two seconds between the end of one snapshot and the beginning of the next. Each snapshot takes about ~50 minutes.
Sending Snapshot [49m55s] Scan (1): ~1.9TiB/2.2TiB at 4.2 GiB/sec. Sent 534.1MiB at 450 KiB/sec. jemalloc: 652 MiB
Sending Snapshot Sent data of ize 534 MiB
Streaming done. Waiting for ACK...
Received ACK with done: true
Stream snapshot OK
Operation completed with id: opSnapshot
RaftComm: [0x1] Sending message of type MsgSnap 0x2
Operating started with id: opSnapshot
Got StreamSnapshot request context:<id:2 group:1 addr:"IP" > index 57653001 read_ts:69608381 since_ts:69117275
Waiting to reach timestamp: 69608381
Sending Snapshot Streaming about 2.2 TiB of uncompressed data 651 (GiB on disk)
Number of ranges found: 17
Sent range 0 for interation: [...] of size 324 GiB
...
Sent range 10 for iteration: [...] of size 324 GiB
I do also see this:
Block cache might be too small. Metrics: hit: 823054276 miss: 10013620741 keys-added: 4166130363 keys-updated: 18004022 keys-evicted: 4165968127 cost-added: 18495162823489 cost-evicted: 18494464892587 sets-dropped: 1956262490 sets-rejected: 1557855574 gets-dropped: 96745088 gets-kept: 10608107904 gets-total: 10836675020 hit-ratio: 0.08
Cache life expectancy (in seconds):
-- Histogram:
Min value: 0
Max value: 60288
Count: 3461823279
50p: 2.00
75p: 2.00
90p: 4.00
[0, 2) 3001753322 86.71% 86.71%
[2, 4) 224294364 6.48% 93.19%
[4, 8) 75360293 2.18% 95.37%
...
[4096, 8192) 0.00% 100.00%
[32768, 65536) 0.00% 100.00%
Is there anyway to speed-up a snapshot? Are we giving it too little cache? Other suggestion?
Thanks