Database size and compression level

Phill240 · July 30, 2020, 12:25pm

Some time ago I started to suspect that my dgraph database took up to much disk space. So, I decided to perform some experiments. The results seem quite confusing to me and raise few questions.

In the beginning the size of the database was 39gb (p and w folders together). I exported it in an rdf file and then imported using bulk loader. Compression rate was 3 during the import (as it was before for dgraph-alpha by default). After the import the databased reduced to 29gb.
Why did it happen? What had dgraph stored in 10gb before import?

The rdf file is about 5gb gzipped and 31gb unpacked. Taking into account that compression is enabled by default and there are just 3 hash indices for string predicates I expect the database to be far less than 29gb. Why does it take up so much space?

I tried to set different compression levels during import from 1 up to 20 using parameter --badger.compression_level. In every case the database size is 29gb. Does this parameter work?

Thanks!

Lash · July 30, 2020, 5:12pm

Yeah, i have the same problems with my db.

Honestly, i cant understand meaning of .vlog files. What exactly does it contain?
During one of my experiment i deleted several .vlog files and… db still worked.

However, when i deleted all .vlog files, db felt sick. Can smbd explain this behavior?

Thanks

ibrahim · July 31, 2020, 7:05am

@Phill240 @Lash dgraph data directory (p or w) contains sst and vlog files. These are generated by Badger. SST file stores the (key,value) or (key, value pointer) where value pointer is (file, offset, len) in a vlog file. Your value will be stored in the LSM tree (the sst file) if it is less than a threshold (default is 1 KB). The vlog file is the Write-Ahead-Log. All operations are logged to the vlog file and then they are cleaned up later.

The compression option in dgraph affects the SST files only and since the majority of the disk space is occupied by the vlog files you’re not seeing any significant size difference. You should look at the total size of SST files with different compression settings, you should see a difference. @Phill240 if you have the data directories, can you share the total size of SST files with different compression levels?

Compressing vlog files isn’t easy and badger doesn’t support it yet. This is something we might support in the future.

We’re also working on making vlog files pure WAL so that we can reclaim disk space faster.
https://github.com/dgraph-io/badger/pull/1445

You should not delete the vlog files. Your values could be stored in the vlog files and if you delete it, you’ll have data loss. The DB would start, but you would see erroneous results. Dgraph runs ValueLogGC which is supposed to free up the disk space for you.

Phill240 · July 31, 2020, 11:59am

@ibrahim thank you for the detailed answer!
Now the things are clear to me. Significant part of my data is long strings which take up more than 1kb. So they aren’t compressed.
I also checked the total size of the sst files produced by bulk loader with different compression levels. It is the same for level 1 and level 20 - 4.8gb
Is there any way to increase the threshold value of 1kb?

ibrahim · July 31, 2020, 12:32pm

Interesting. Let me try to run this on my end and confirm this.

Is there any way to increase the threshold value of 1kb?

Unfortunately, no! The threshold is hardcoded in the code. You can modify the code and try running it but the badger GC takes time to reclaim vlog space so you wouldn’t see a reduction in disk space instantly.

github.com

dgraph-io/dgraph/blob/a5c469d56741a4a6e2c668b5053dba17bbfa5860/worker/server_state.go#L158


      
          
          		s.WALstore, err = badger.Open(opt)
          		x.Checkf(err, "Error while creating badger KV WAL store")
          	}
          	{
          		// Postings directory
          		// All the writes to posting store should be synchronous. We use batched writers
          		// for posting lists, so the cost of sync writes is amortized.
          		x.Check(os.MkdirAll(Config.PostingDir, 0700))
          		opt := badger.DefaultOptions(Config.PostingDir).
          			WithValueThreshold(1 << 10 /* 1KB */).
          			WithNumVersionsToKeep(math.MaxInt32).
          			WithMaxCacheSize(1 << 30).
          			WithKeepBlockIndicesInCache(true).
          			WithKeepBlocksInCache(true).
          			WithMaxBfCacheSize(500 << 20) // 500 MB of bloom filter cache.
          		opt = setBadgerOptions(opt)
          
          		// Print the options w/o exposing key.
          		// TODO: Build a stringify interface in Badger options, which is used to print nicely here.
          		key := opt.EncryptionKey

@Phill240 Are you running the latest version of dgraph? if not, please use the latest version. There were some bugs in badger because of which we weren’t cleaning up the disk space. We have made some changes which should improve disk space reclaimation.

Phill240 · July 31, 2020, 1:17pm

I use v20.07.0-rc1

Topic		Replies	Views
Are vlog files still not compressed? Dgraph	3	685	January 24, 2022
High disk space usage by DGraph Dgraph	3	986	July 24, 2019
Vlog files use lots of disk space: Add option to set LSMOnly option when opening p dir Dgraph dgraph , kind:enhancement , priority:p2 , status:accepted , area:performance	20	2600	December 6, 2021
Why p folder size increase 100G after delete all nodes and releations？ Dgraph dgraph	6	782	January 17, 2023
Why should we keep all verisons and how to reduce vlog growing speed Badger	2	1007	May 16, 2020

Database size and compression level

Related topics