Our store is quite large, around 700GiB (SST and value logs), comprising 3000+ tables of the default size of 64MiB each. We use v2.2007.2, and we have no possibility of upgrading to master, nor to experiment with jemalloc (there are some logistical problems the way you’ve built it, using a custom prefix).
Unfortunately, we are seeing OOM during compactions, because a lot of tables get picked at once. A recent compaction looked like this:
Nov 19, 2020 @ 19:56:51.208 {"level":"info","ts":"2020-11-19T19:56:51.207Z","logger":"badgerbs","caller":"v2@v2.2007.2/levels.go:962","msg":"LOG Compact 3->4, del 381 tables, add 381 tables, took 5m3.856917917s\n"}
Nov 19, 2020 @ 19:51:47.347 {"level":"info","ts":"2020-11-19T19:51:47.346Z","logger":"badgerbs","caller":"v2@v2.2007.2/levels.go:962","msg":"LOG Compact 3->4, del 770 tables, add 770 tables, took 12m11.574382811s\n"}
770 tables to compact at once is quite a lot. At a size of 64MiB per table, this requires 49280MiB of RAM, since if I’m reading correctly, compaction of all selected tables happen entirely in memory before all tables are flushed to disk. The vast amount of retained heap ends up causing an OOM.
The question is: is there a way to limit how many tables get picked for compaction at once? We wouldn’t have these OOM-inducing memory spikes if badger picked, for example, 100 tables at once.