Snapshotting Speed & Frequency - Can it be faster?

I’m curious what’s happening with the snapshotting in the below logs. Dgraph is having a hard time ingesting data while Snapshotting is going on. Unfortunately, today, the snapshotting has not taken a break. There’s been (at least) 12 snapshots, each lasting about 50 minutes, in 12 hrs.

The container this dgraph alpha is running in has 68 CPUs, 128GB of RAM.
The dgraph alpha command has the following parameters:
--raft="snapshot-after-entries=300000; snapshot-after-duration=0" --cache="size-mb=1024" --badger="maxlevels=7" --v=2 --telemetry="reports=false; sentry=false;"

The below logs are between 00:21:05 and 00:21:19. There were two seconds between the end of one snapshot and the beginning of the next. Each snapshot takes about ~50 minutes.

Sending Snapshot [49m55s] Scan (1): ~1.9TiB/2.2TiB at 4.2 GiB/sec.  Sent 534.1MiB at 450 KiB/sec. jemalloc: 652 MiB
Sending Snapshot Sent data of ize 534 MiB
Streaming done.  Waiting for ACK...
Received ACK with done: true
Stream snapshot OK
Operation completed with id: opSnapshot
RaftComm: [0x1] Sending message of type MsgSnap 0x2
Operating started with id: opSnapshot
Got StreamSnapshot request context:<id:2 group:1 addr:"IP" > index 57653001 read_ts:69608381 since_ts:69117275
Waiting to reach timestamp: 69608381
Sending Snapshot Streaming about 2.2 TiB of uncompressed data 651 (GiB on disk)
Number of ranges found: 17
Sent range 0 for interation: [...] of size 324 GiB
...
Sent range 10 for iteration: [...] of size 324 GiB

I do also see this:

Block cache might be too small.  Metrics: hit: 823054276 miss: 10013620741 keys-added: 4166130363 keys-updated: 18004022 keys-evicted: 4165968127 cost-added: 18495162823489 cost-evicted: 18494464892587 sets-dropped: 1956262490 sets-rejected: 1557855574 gets-dropped: 96745088 gets-kept: 10608107904 gets-total: 10836675020 hit-ratio: 0.08
Cache life expectancy (in seconds):
 -- Histogram:
Min value: 0
Max value: 60288
Count: 3461823279
50p: 2.00
75p: 2.00
90p: 4.00
[0, 2) 3001753322 86.71% 86.71%
[2, 4) 224294364 6.48% 93.19%
[4, 8) 75360293 2.18% 95.37%
...
[4096, 8192) 0.00% 100.00%
[32768, 65536) 0.00% 100.00%

Is there anyway to speed-up a snapshot? Are we giving it too little cache? Other suggestion?

Thanks

I have had this similar problem … I think most of these problems arise because of using Badger. I think you may want to reduce the levels in badger to speed this up … --badger=“maxlevels=7” <<- i believe this is the problem.

@MichelDiz do you have any insights here?

We restarted the cluster with 10GB of cache, to see if that made a difference:

--raft="snapshot-after-entries=300000; snapshot-after-duration=0" --cache="size-mb=10240" --badger="maxlevels=7" --v=2 --telemetry="reports=false; sentry=false;"

This got stuck in an on-going “opRollup”. Most (all?) queries and mutations are blocked during the opRollup.

16:47:09 - Rolled up 1000 keys
17:50:10 - Rolled up 107000 keys

In hopes of the alpha starting up in a better state and process any keys needing rolled-up faster, we restarted again, giving the docker instance more RAM and CPUs.

Looking at the cache configuration options, what are the advantages of changing the default cache sizes listed here: Dgraph CLI Reference - Deploy

      --cache string               Cache options
                                       percentage=0,65,35; Cache percentages summing up to 100 for various caches (FORMAT: PostingListCache,PstoreBlockCache,PstoreIndexCache)
                                       size-mb=1024; Total size of cache (in MB) to be used in Dgraph.
                                    (default "size-mb=1024; percentage=0,65,35;")

We most recently modified this to be: --cache="size-mb=32768; percentage:10,65,25;"

After restarting, we’re right back into an opRollup and transactions are being logged as cancelled in dgraph logs.