Hey!
I have an unusual use case and wanted to get some advices from you.
I’m working on a PoC to use Badger to serve static/immutable data, basically, I have batch jobs that run a few times everyday that produces millions of key-value pairs and I’m looking for a fast and cost effective way to make this data available for other applications.
A few more details:
- the biggest of these jobs produces 19 million rows, gzipped the total file size is around 1GB. However, writing to a Badger database this becomes around 6 GB (value is a simple struct serialised using gob).
- I’ll have several jobs (tens) that runs a few times a day each producing a few millions of rows and I intend to save the output of each of as its own Badger database (write to a database, create a backup of it and publish it somewhere else). After one job finishes, I can discard the old version of it and serve the new one, so after creation, data will never be modified.
So my “serving” application will deal with tens of “read-only” Bagder databases with several GB and millions of entries each. I have a goal to make this cost-effective, so the required server RAM should be (way) smaller than the sum of all Badger databases (so I’m accepting hitting disk for my reads).
With this long use case and description in mind, I would really appreciate advices in the following points:
- Since the database will never be modified after creation, is there any way to compact LSM tree even more before creating the Backup?
- Are there any knobs in Badger to optimize it for reading performance and minimize memory consumption?
I would also really appreciate other advices regarding this use case.
Thank you so much for your help. =)