Dgraph is filling up my disk, and now I can't start the service

A lot of update operations do not trigger Compaction, my disk is already full with sst files.

Who can help me?

I Want to Do

Is it possible to manually Compact sst files?
Now I can only delete all the files and start the service to synchronize the data from the cluster. This is too slow!

Dgraph Metadata

dgraph version
Dgraph version   : v20.11.0-rc5
raph codename  : tchalla
Dgraph SHA-256   : 95d845ecec057813d1a3fc94394ba1c18ada80f584120a024c19d0db668ca24e
Commit SHA-1     : b65a8b10c
Commit timestamp : 2020-12-14 19:09:28 +0530
Branch           : HEAD
Go version       : go1.15.5
jemalloc enabled : true

Hey @zzl221000, we do not have a way to manually compact the SST files. Badger’s compaction should clean up the deleted data automatically.
Can you show me the contents of your data directory? I’m looking for the total number of SST files and their total size.

I listed one of the most abnormal nodes right now. The real data size should be about 50G.

total number

[root@zk02 p]# ll |wc -l
3079

total size

[root@zk02 p]# du -h --max-depth=1 ./
373G    ./

alpha3_p_file.txt (183.7 KB)

@zzl221000 would you be able to run dgraph debug on your data directory and share the output? The dgraph debug command will read all your data and print some statistics about it.

This may be related to this issue: Vlog files use lots of disk space: Add option to set LSMOnly option when opening p dir

Hey, @ibrahim ! Everything is fine with the cluster since I rebooted.
I’ll post the Debug output after the problem reappears.

@vnium would you like to run the debug tool and post the output? Maybe we’re both having the same problem.

I going to post the log in the next few days.

@ibrahim
The output of the debug tool is too large. I can’t post it.

posting dir

[root@zk04 dgraph]# du -h --max-depth=1 /dgraph/alpha1/p
337G    /dgraph/alpha1/p

end of debug

badger 2021/01/23 20:33:45 INFO: Badger.Stream Sent data of size 265 GiB
badger 2021/01/23 20:33:45 INFO: Lifetime L0 stalled for: 0s
badger 2021/01/23 20:33:45 INFO: 
Level 0 [ ]: NumTables: 02. Size: 37 MiB of 0 B. Score: 0.00->0.00 Target FileSize: 64 MiB
Level 1 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 Target FileSize: 2.0 MiB
Level 2 [B]: NumTables: 05. Size: 8.5 MiB of 10 MiB. Score: 0.00->0.00 Target FileSize: 2.0 MiB
Level 3 [ ]: NumTables: 21. Size: 56 MiB of 57 MiB. Score: 0.00->0.00 Target FileSize: 4.0 MiB
Level 4 [ ]: NumTables: 105. Size: 562 MiB of 566 MiB. Score: 0.00->0.00 Target FileSize: 8.0 MiB
Level 5 [ ]: NumTables: 483. Size: 5.2 GiB of 5.5 GiB. Score: 0.00->0.00 Target FileSize: 16 MiB
Level 6 [ ]: NumTables: 920. Size: 55 GiB of 55 GiB. Score: 0.00->0.00 Target FileSize: 32 MiB
Level Done

tail -n 7 output file

{d} attr: dgraph.type uid: 736325768  ts: 100794028 item: [49, b0100] sz: 49 dcnt: 1 key: 00000b6467726170682e7479706500000000002be37088
{d} attr: dgraph.type uid: 736325769  ts: 100794028 item: [49, b0100] sz: 49 dcnt: 1 key: 00000b6467726170682e7479706500000000002be37089
{d} attr: dgraph.type uid: 736325770  ts: 100794028 item: [48, b0100] sz: 48 dcnt: 1 key: 00000b6467726170682e7479706500000000002be3708a
{d} attr: dgraph.type uid: 736325771  ts: 100794148 item: [49, b0100] sz: 49 dcnt: 1 key: 00000b6467726170682e7479706500000000002be3708b
{d} attr: dgraph.type uid: 736325772  ts: 100794148 item: [84, b1000] sz: 84 dcnt: 0 isz: 84 icount: 1 key: 00000b6467726170682e7479706500000000002be3708c

Found 1634798107 keys

head -n 10 output file

[Decoder]: Using assembly version of decoder
Page Size: 4096
Listening for /debug HTTP requests at port: 8080
Opening DB: /dgraph/alpha1/p
prefix = 
I�<���vt�O��x��"��1\"o��  ts: 3162147 item: [61, b0100] sz: 61 dcnt: 1 key: 00000b526c4e6f64652e726c6964020b4d910d49e23caac4ec76748e4fb0a778d616e69a22bd9c1013315c226fb01ccf
Y܆eC*�$e�TW�C��_�3{�Ro�ؒf  ts: 333083 item: [61, b0100] sz: 61 dcnt: 1 key: 00000b526c4e6f64652e726c6964020b4d910d59dc860765432ab62465ed5457ca43c6e95f11c3337bc9526f8fd89266^"i�e���ߊ#e*{�b�^BF�����l7�  ts: 4552271 item: [61, b0100] sz: 61 dcnt: 1 key: 00000b526c4e6f64652e726c6964020b4d910d5e2269a4659a048dd8df8a23652a7bda62965e4246f08bf68f996c37a0
h����L���*����8e.rlid term: [11] M�
               �n�n���  ts: 2711668 item: [61, b0100] sz: 61 dcnt: 1 key: 00000b526c4e6f64652e726c6964020b4d910d689801ced2df1a4cabbceb2a95879c130589380ba0126e816e9a1dafb7
q�x�%���\���h���LI��*�c  ts: 29897333 item: [61, b0100] sz: 122 dcnt: 2 key: 00000b526c4e6f64652e726c6964020b4d910d7106bc78a125e9f49c5c9d82c1689a1ccacb4c491a0ea514cc072ae063

Hey @ibrahim, After observing for many days, I guess the problem is related to the skip policy of snapshot. How to avoid skipping snapshot? The logs are as follows:

I0125 03:46:50.357040      17 draft.go:606] Creating snapshot at Index: 139409320, ReadTs: 189197838
I0125 03:47:41.336820      17 draft.go:1611] Skipping snapshot at index: 139409320. Insufficient discard entries: 0. MinPendingStartTs: 170455756
I0125 03:48:41.336092      17 draft.go:1611] Skipping snapshot at index: 139409320. Insufficient discard entries: 0. MinPendingStartTs: 170455756
I0125 03:49:41.337446      17 draft.go:1611] Skipping snapshot at index: 139409320. Insufficient discard entries: 0. MinPendingStartTs: 170455756
I0125 03:50:41.337501      17 draft.go:1611] Skipping snapshot at index: 139409320. Insufficient discard entries: 0. MinPendingStartTs: 170455756
...
I0125 04:07:50.332705      17 draft.go:606] Creating snapshot at Index: 139419376, ReadTs: 189209100
I0125 04:08:41.334584      17 draft.go:1611] Skipping snapshot at index: 139419378. Insufficient discard entries: 2. MinPendingStartTs: 170455756
I0125 04:09:41.335293      17 draft.go:1611] Skipping snapshot at index: 139419378. Insufficient discard entries: 2. MinPendingStartTs: 170455756
I0125 04:10:41.338297      17 draft.go:1611] Skipping snapshot at index: 139419378. Insufficient discard entries: 2. MinPendingStartTs: 170455756
...
I0125 04:28:50.338235      17 draft.go:606] Creating snapshot at Index: 139429608, ReadTs: 189219985
I0125 04:29:41.336136      17 draft.go:1611] Skipping snapshot at index: 139429610. Insufficient discard entries: 2. MinPendingStartTs: 170455756
I0125 04:30:41.335993      17 draft.go:1611] Skipping snapshot at index: 139429610. Insufficient discard entries: 2. MinPendingStartTs: 170455756

Alpha keeps skipping snapshot until my hard drive is full.
This happens frequently when I slow down data updates and writes. Writing 5 to 10 RDFs per second took up 895G of disk after 6 hours.

@zzl221000 the snapshot calculates the number of entries that we can discard and based on that number can choose to create a snapshot.

I noticed that the MinPendingStartTs was at 170455756 for almost an hour. Were there no queries or mutations running at this time?

@ibrahim Mutations are being executed at the rate of one per second.