Badger compaction results in OOM and loss of data

What version of Go are you using (go version)?

$ go version
1.15.1

What operating system are you using?

Linux.

What version of Badger are you using?

v2.2007.2

Does this issue reproduce with the latest master?

Not sure, it’s not easy for us to reproduce this, as the error happens sporadically and spontaneously when a number of factors concur (all of which are not apparent to me). Are there any recent reliability fixes in badger that would fix this?

Steps to Reproduce the issue

I’m going to describe the setup and the result of our investigation here. We use Badger as our datastore in the Lotus Filecoin client. Filecoin is a blockchain that has been alive for ~3 months. It produces a lot of chain and state data, and the store size of Badger nodes that host the entire history (we call them “archive nodes”) is currently hovering around the 800-900GiB mark. Here’s the component that interfaces with badger. I think our API usage is pretty normal. Perhaps the only notable feature we started using recently is the WriteBatch badger gadget to serve the PutMany(blks []blocks.Block) operation.

sidenote: we are aware that badger, and generally LSM trees, are not a good fit for our use case, with such immense data volumes, completely random keys and random access, arbitrary blob sizes, and read intensive scenarios. Query duration increases with the store size, and CPU profiles show that 40% of time is spent on this hotspot: levelHandler.get => table.Iterator.seekFrom => binary.bigEndian.Uint32 on large stores. We’ll be moving to a hot/cold setup, where we use another B±based storage engine for the hot side (around 1-2GiB; read-predominant), and badger for archival (write predominant). All of this is orthogonal to the issue I’m describing, but my goal here is to illustrate that we’ve done quite a lot of research and are opening this issue after lengthy investigations.

On November 17th at 03:30:00 local time, the affected node synced a block from the network, and wrote an entry to the badger store. The entry remained accessible, until 04:40:00 (70 minute later) at which point it just simply vanished from badger. Note that we do not delete data from badger, ever. This is a feature of our system, which relies on content-address immutable blobs.

Looking at the logs and trying to grok what happened around the 04:40:00 timestamp, I zeroed in on compaction. It turns out that precisely at 04:39:50 - 04:43:00, there was a surge in compaction activity. The compactions prior to these had taken place at 02:03:00 - 02:14:00.

I’m attaching the compaction logs:

{"ts":"2020-11-17T02:03:27.953Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 0] Running compaction: {level:0 score:1 dropPrefixes:[]} for level: 0\\n"}
{"ts":"2020-11-17T02:03:34.425Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 0] Compaction for level: 0 DONE"}
{"ts":"2020-11-17T02:03:34.425Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 0] Running compaction: {level:1 score:1.8078400008380413 dropPrefixes:[]} for level: 1\\n"}
{"ts":"2020-11-17T02:03:35.597Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:2 score:1.0241438038647175 dropPrefixes:[]} for level: 2\\n"}
{"ts":"2020-11-17T02:03:35.611Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 0] Running compaction: {level:1 score:1.5476064756512642 dropPrefixes:[]} for level: 1\\n"}
{"ts":"2020-11-17T02:03:35.611Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 0] Compaction for level: 1 DONE"}
{"ts":"2020-11-17T02:04:19.111Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 0] Running compaction: {level:1 score:1.5476064756512642 dropPrefixes:[]} for level: 1\\n"}
{"ts":"2020-11-17T02:04:19.365Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:2 score:1.0241438038647175 dropPrefixes:[]} for level: 2\\n"}
{"ts":"2020-11-17T02:04:24.720Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 0] Running compaction: {level:1 score:1.2873733378946781 dropPrefixes:[]} for level: 1\\n"}
{"ts":"2020-11-17T02:04:24.720Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 0] Compaction for level: 1 DONE"}
{"ts":"2020-11-17T02:04:24.770Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 2 DONE"}
{"ts":"2020-11-17T02:04:24.770Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:2 score:1.0248969364911318 dropPrefixes:[]} for level: 2\\n"}
{"ts":"2020-11-17T02:04:32.641Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 0] Compaction for level: 1 DONE"}
{"ts":"2020-11-17T02:04:32.641Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 0] Running compaction: {level:1 score:1.0271391198039055 dropPrefixes:[]} for level: 1\\n"}
{"ts":"2020-11-17T02:04:32.660Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 2 DONE"}
{"ts":"2020-11-17T02:04:32.660Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:2 score:1.0242514505982399 dropPrefixes:[]} for level: 2\\n"}
{"ts":"2020-11-17T02:04:45.792Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:2 score:1.0240997947752475 dropPrefixes:[]} for level: 2\\n"}
{"ts":"2020-11-17T02:04:45.792Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 2 DONE"}
{"ts":"2020-11-17T02:04:45.795Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 0] Compaction for level: 1 DONE"}
{"ts":"2020-11-17T02:04:48.789Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 2 DONE"}
{"ts":"2020-11-17T02:04:48.789Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:3 score:1.0078707920014858 dropPrefixes:[]} for level: 3\\n"}
{"ts":"2020-11-17T02:07:52.358Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 3 DONE"}
{"ts":"2020-11-17T02:07:52.359Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:3 score:1.0052684531360865 dropPrefixes:[]} for level: 3\\n"}
{"ts":"2020-11-17T02:11:03.992Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 3 DONE"}
{"ts":"2020-11-17T02:11:03.993Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:3 score:1.0026661136746406 dropPrefixes:[]} for level: 3\\n"}
{"ts":"2020-11-17T02:14:31.992Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 3 DONE"}
{"ts":"2020-11-17T02:14:31.993Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:3 score:1.0001312834769487 dropPrefixes:[]} for level: 3\\n"}
{"ts":"2020-11-17T02:14:34.948Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 3 DONE"}
{"ts":"2020-11-17T04:39:54.310Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 0] Running compaction: {level:0 score:1 dropPrefixes:[]} for level: 0\\n"}
{"ts":"2020-11-17T04:40:02.244Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 0] Compaction for level: 0 DONE"}
{"ts":"2020-11-17T04:40:02.244Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 0] Running compaction: {level:1 score:1.7029224000871181 dropPrefixes:[]} for level: 1\\n"}
{"ts":"2020-11-17T04:40:04.028Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 0] Compaction for level: 1 DONE"}
{"ts":"2020-11-17T04:40:04.028Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 0] Running compaction: {level:1 score:1.4426887668669224 dropPrefixes:[]} for level: 1\\n"}
{"ts":"2020-11-17T04:40:04.065Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:2 score:1.024687835201621 dropPrefixes:[]} for level: 2\\n"}
{"ts":"2020-11-17T04:40:07.941Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 0] Compaction for level: 1 DONE"}
{"ts":"2020-11-17T04:40:07.941Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 0] Running compaction: {level:1 score:1.1824557594954967 dropPrefixes:[]} for level: 1\\n"}
{"ts":"2020-11-17T04:40:10.859Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:2 score:1.024789121374488 dropPrefixes:[]} for level: 2\\n"}
{"ts":"2020-11-17T04:40:10.859Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 2 DONE"}
{"ts":"2020-11-17T04:40:11.221Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 0] Compaction for level: 1 DONE"}
{"ts":"2020-11-17T04:40:48.412Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:2 score:1.029772574082017 dropPrefixes:[]} for level: 2\\n"}
{"ts":"2020-11-17T04:40:51.334Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:2 score:1.0039954263716937 dropPrefixes:[]} for level: 2\\n"}
{"ts":"2020-11-17T04:40:51.334Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 2 DONE"}
{"ts":"2020-11-17T04:40:53.912Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 2 DONE"}
{"ts":"2020-11-17T04:40:53.913Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:3 score:1.005156219266355 dropPrefixes:[]} for level: 3\\n"}
{"ts":"2020-11-17T04:40:54.742Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 3 DONE"}
{"ts":"2020-11-17T04:40:54.743Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:3 score:1.0025780947878957 dropPrefixes:[]} for level: 3\\n"}
{"ts":"2020-11-17T04:40:55.522Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 3 DONE"}
{"ts":"2020-11-17T04:40:55.522Z","caller":"v2@v2.2007.2/levels.go:1000","msg":"[Compactor: 1] Running compaction: {level:3 score:1.0000095254927874 dropPrefixes:[]} for level: 3\\n"}
{"ts":"2020-11-17T04:43:13.769Z","caller":"v2@v2.2007.2/levels.go:1010","msg":"[Compactor: 1] Compaction for level: 3 DONE"}

I correlated this with Lotus restarts, and it turns out that half-way through the compaction, the process was restarted. This happened as well during the 02:04:00 compaction surge. So there seems to be a causal link between compaction and restarts. My best guess is that the compaction caused a spike in memory usage, and the OOMKiller kicked in and killed the process. I’m investigating this right now.

To conclude, I would ask if compaction is (a) intended to be a malloc-heavy process, and (b) if that can be curtailed. It looks like badger’s eagerness for memory causes it to reproducibly OOM during compactions.

Finally, I am pretty confident that compaction should NEVER be a lossy process, no matter the circumstance. If this can’t be guaranteed, then one cannot guarantee that badger will not lose data, and this makes it very unreliable for a multitude of use cases, especially when it’s badger’s lack of memory frugality that leads it to crash in the first place.

What Badger options were set?

badger.DefaultOptions, overridden with:

	opts.DetectConflicts = false
	opts.CompactL0OnClose = false
	opts.Truncate = true
	opts.ValueLogLoadingMode = badger.MemoryMap
	opts.TableLoadingMode = badger.MemoryMap
	opts.ValueThreshold = 128
	opts.MaxTableSize = 64 << 20

What did you do?

Explained above.

What did you expect to see?

No data loss, even after a restart during compaction.

What did you see instead?

Data loss.

1 Like

Confirmed that badger compaction catapulted this instance to an OOM, i.e. it was not an external restart:

[Tue Nov 17 04:40:00 2020] Memory cgroup out of memory: Kill process 10858 (lotus) score 1804 or sacrifice child

We made a lot of memory updates to Badger recently. You should pick up Badger master and compile Badger with jemalloc.

@mrjn I reviewed some of the latest changes, and I don’t think they will solve the problem. The root cause is that badger picks an arbitrary number of tables for compaction, which may be 3, or may be 800 at a time, who knows.

Compactions require heavy usage of memory (all newly built tables are kept in memory simultaneously before being flushed to disk), file descriptors, and other resources.

That’s why you’re seeing so many problems related to compaction, with different manifestations:

Until badger can limit the amount of tables it processes at once, it will continue losing data as it is righit now (many of our users are now complaining about this).

This can be solved by either making badger (a) estimate how much memory and resources it has available, and sizing the compaction appropriately, OR (b) by allowing the user to specify the maximum number of tables to compact at once.

(b) would be good enough for now, (a) is likely an optimisation.

@mrjn
I’m working with @raulk, I’ve look into using jemalloc, the issue we are facing right now is that badger requires a custom (symbol prefixed) build of jemalloc in the system local path, which is sub-optimal to say the least.

Have you considered static linking with the jemalloc instead?
If this is an option instead of dynamic linking, I’m up for contributing it.

Our store is quite large, around 700GiB (SST and value logs), comprising 3000+ tables of the default size of 64MiB each. We use v2.2007.2, and we have no possibility of upgrading to master, nor to experiment with jemalloc (there are some logistical problems the way you’ve built it, using a custom prefix).

Unfortunately, we are seeing OOM during compactions, because a lot of tables get picked at once. A recent compaction looked like this:

Nov 19, 2020 @ 19:56:51.208 {"level":"info","ts":"2020-11-19T19:56:51.207Z","logger":"badgerbs","caller":"v2@v2.2007.2/levels.go:962","msg":"LOG Compact 3->4, del 381 tables, add 381 tables, took 5m3.856917917s\n"}
	
Nov 19, 2020 @ 19:51:47.347 {"level":"info","ts":"2020-11-19T19:51:47.346Z","logger":"badgerbs","caller":"v2@v2.2007.2/levels.go:962","msg":"LOG Compact 3->4, del 770 tables, add 770 tables, took 12m11.574382811s\n"}

770 tables to compact at once is quite a lot. At a size of 64MiB per table, this requires 49280MiB of RAM, since if I’m reading correctly, compaction of all selected tables happen entirely in memory before all tables are flushed to disk. The vast amount of retained heap ends up causing an OOM.

The question is: is there a way to limit how many tables get picked for compaction at once? We wouldn’t have these OOM-inducing memory spikes if badger picked, for example, 100 tables at once.

Ok, looks like there’s a throttler that limits concurrency by a factor of 5. Badger does open the indices of new tables, while the old indices are not released yet, so there’s at least that multiplier of memory by the arbitrary number of tables that Badger decided to compact.

I wonder if there’s something else at play by which the table builders could be retaining memory beyond their usage, due to a leak in manual memory management?

Maybe setting the fields of the builder to nil could assist the garbage collector. Alternatively all of this could be done in a memory arena? Or builders could be pooled and reset?

One mystery I haven’t resolved yet is why an OOM during compaction leads to data loss.

This problem was fixed in master. I can’t seem to find the commit right now (on my tablet).

The way to install jemalloc is in the Makefile of Dgraph/Dgraph in the Dgraph repo. @ibrahim We should have a similar Makefile in Badger as well.

If jemalloc doesn’t work for you, you can still build master purely with Go, but use GODEBUG=madvdontneed=1 so Go doesn’t hog onto memory.