@beepsoft A DB consists of valid data (which you would query for) and deleted/expired data (which your queries won’t be able to see). The deleted/expired data is removed eventually by badger compactions.
I believe your first DB instance loaded data via live loader or mutations. In this case we will write each entry to vlog (this is the write ahead log) and sst files (which stores the keys and the values). Eventually, your vlog and ssts will have data that can be removed but compactions or value log GC hasn’t cleared them yet (these are background processes which are supposed to clean up things).
When you do an export, we export only valid data (not the deleted ones). When this export is imported via bulk loader, the bulk loader won’t create a vlog (wal file) unless it needs to. This is why you see less data on disk now.
To summarize, an export would give you only valid data and bulk loading this data would give you the minimal set of sst and vlogs files that are needed. Both the DB have the same amount of valid data but the old one has deleted/expired while the new one doesn’t have it.
This is a side effect of having stale data in the LSM tree (sst files). The new DB has only valid data and so it has to do less work while reading data. Less stale data == faster reads
Right now, we don’t have a way to do this but we’re working on this. We’re working on separating the vlog file so that the clean-up process becomes simpler (https://github.com/dgraph-io/badger/pull/1445). We’re planning to release this in dgraph v2011 release (in November). We’re also working on adding some tooling to badger so that it can be used to clean up disk space faster.
There are two hacky ways to clean things up. Please don’t try any of these unless you’re sure what you’re doing.
- We have
badger flatten
command which will compact your ssts (please don’t run this command on the p
directory or you’ll lose data).
- Snapshot - If you have a 3 alpha node cluster, you could delete the
p
and w
directory from one of the alphas and it will get a new snapshot. Snapshots have only valid data and work similar to bulk loader.