Best practices for serving immutable databases


I have an unusual use case and wanted to get some advices from you.

I’m working on a PoC to use Badger to serve static/immutable data, basically, I have batch jobs that run a few times everyday that produces millions of key-value pairs and I’m looking for a fast and cost effective way to make this data available for other applications.

A few more details:

  • the biggest of these jobs produces 19 million rows, gzipped the total file size is around 1GB. However, writing to a Badger database this becomes around 6 GB (value is a simple struct serialised using gob).
  • I’ll have several jobs (tens) that runs a few times a day each producing a few millions of rows and I intend to save the output of each of as its own Badger database (write to a database, create a backup of it and publish it somewhere else). After one job finishes, I can discard the old version of it and serve the new one, so after creation, data will never be modified.

So my “serving” application will deal with tens of “read-only” Bagder databases with several GB and millions of entries each. I have a goal to make this cost-effective, so the required server RAM should be (way) smaller than the sum of all Badger databases (so I’m accepting hitting disk for my reads).

With this long use case and description in mind, I would really appreciate advices in the following points:

  • Since the database will never be modified after creation, is there any way to compact LSM tree even more before creating the Backup?
  • Are there any knobs in Badger to optimize it for reading performance and minimize memory consumption?

I would also really appreciate other advices regarding this use case.

Thank you so much for your help. =)

Hey @agacera,
You should start by setting the LSMOnly options (badger.LSMOnlyOptions). This would mean your values are co-located with the keys and you don’t need to read values from the value log. Note: The maximum size of the value in LSM only mode is 1 MB.

You should set the ValueLogLoadingMode to FileIO which means we won’t use any memory for the value log.

The default loading mode for tables is MemoryMap which means the table will be loaded into the RAM only when it is read. You might need to set this also to FILEIO if your system has serious limitations on memory. If the TableLoadingMode is set to FILEIO, the read speed will be seriously affected because every time a key/value has to be read, it will be read from the disk (which is slow).

If the tables are loaded in MemoryMap or LoadToRam mode, then you can disable the cache (set MaxCacheSize = 0). By default, the cache takes up 1 GB of memory. But if the tables are opened in FILEIO mode, I suggest you keep the cache because the cache will keep the block in memory which should improve the read speed (if the block is found in the cache)

You should keep compression and encryption disabled (they’re disabled by default).

You should set KeepL0InMemory to false which means Level 0 tables won’t be kept in memory (usually level 0 has about 640 MB of data). This should also reduce some memory consumption.

You can call db.Flatten() before you create the backup. Flatten will move all tables to a single level (which should drop all the stale data).

Please try out these suggestions and let us know how it goes :slight_smile:

1 Like