Database size and compression level

Some time ago I started to suspect that my dgraph database took up to much disk space. So, I decided to perform some experiments. The results seem quite confusing to me and raise few questions.

In the beginning the size of the database was 39gb (p and w folders together). I exported it in an rdf file and then imported using bulk loader. Compression rate was 3 during the import (as it was before for dgraph-alpha by default). After the import the databased reduced to 29gb.
Why did it happen? What had dgraph stored in 10gb before import?

The rdf file is about 5gb gzipped and 31gb unpacked. Taking into account that compression is enabled by default and there are just 3 hash indices for string predicates I expect the database to be far less than 29gb. Why does it take up so much space?

I tried to set different compression levels during import from 1 up to 20 using parameter --badger.compression_level. In every case the database size is 29gb. Does this parameter work?

Thanks!

Yeah, i have the same problems with my db.

Honestly, i cant understand meaning of .vlog files. What exactly does it contain?
During one of my experiment i deleted several .vlog files and… db still worked.

However, when i deleted all .vlog files, db felt sick. Can smbd explain this behavior?

Thanks

@Phill240 @Lash dgraph data directory (p or w) contains sst and vlog files. These are generated by Badger. SST file stores the (key,value) or (key, value pointer) where value pointer is (file, offset, len) in a vlog file. Your value will be stored in the LSM tree (the sst file) if it is less than a threshold (default is 1 KB). The vlog file is the Write-Ahead-Log. All operations are logged to the vlog file and then they are cleaned up later.

The compression option in dgraph affects the SST files only and since the majority of the disk space is occupied by the vlog files you’re not seeing any significant size difference. You should look at the total size of SST files with different compression settings, you should see a difference. @Phill240 if you have the data directories, can you share the total size of SST files with different compression levels?

Compressing vlog files isn’t easy and badger doesn’t support it yet. This is something we might support in the future.

We’re also working on making vlog files pure WAL so that we can reclaim disk space faster.

You should not delete the vlog files. Your values could be stored in the vlog files and if you delete it, you’ll have data loss. The DB would start, but you would see erroneous results. Dgraph runs ValueLogGC which is supposed to free up the disk space for you.

@ibrahim thank you for the detailed answer!
Now the things are clear to me. Significant part of my data is long strings which take up more than 1kb. So they aren’t compressed.
I also checked the total size of the sst files produced by bulk loader with different compression levels. It is the same for level 1 and level 20 - 4.8gb
Is there any way to increase the threshold value of 1kb?

1 Like

Interesting. Let me try to run this on my end and confirm this.

Is there any way to increase the threshold value of 1kb?

Unfortunately, no! The threshold is hardcoded in the code. You can modify the code and try running it but the badger GC takes time to reclaim vlog space so you wouldn’t see a reduction in disk space instantly.

@Phill240 Are you running the latest version of dgraph? if not, please use the latest version. There were some bugs in badger because of which we weren’t cleaning up the disk space. We have made some changes which should improve disk space reclaimation.

I use v20.07.0-rc1