Badger loses data if compaction runs out of file descriptors

What version of Go are you using (go version)?

$ go version

What operating system are you using?

Linux

What version of Badger are you using?

github.com/dgraph-io/badger/v2 v2.2007.2

Does this issue reproduce with the latest master?

Likely

Steps to Reproduce the issue

Have a large amount of data in badger
Do things.
Wait for badger compaction to run, and then run out of file descriptors
observe data that you’ve already written (that was Put with no error) is no longer available.

What Badger options were set?

Notably, SyncWrites is enabled, Truncate is enabled, and the value log threshold is set to 128.

What did you do?

Used badger, lost data.

What did you expect to see?

My data not get lost.

What did you see instead?

My data got lost.

Some relevant logs:

2020-10-29T10:44:38.817-0700	WARN	badger	v2@v2.2007.2/levels.go:1005	[Compactor: 0] LOG Compact FAILED with error: open /glob/lbench/repo: too many open files
While opening directory: /glob/lbench/repo.
github.com/dgraph-io/badger/v2.syncDir
	/home/why/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.2007.2/dir_unix.go:109
github.com/dgraph-io/badger/v2.(*DB).syncDir
	/home/why/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.2007.2/db.go:1767
github.com/dgraph-io/badger/v2.(*levelsController).compactBuildTables
	/home/why/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.2007.2/levels.go:710
github.com/dgraph-io/badger/v2.(*levelsController).runCompactDef
	/home/why/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.2007.2/levels.go:933
github.com/dgraph-io/badger/v2.(*levelsController).doCompact
	/home/why/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.2007.2/levels.go:1003
github.com/dgraph-io/badger/v2.(*levelsController).runCompactor
	/home/why/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.2007.2/levels.go:404
runtime.goexit
	/usr/lib/go/src/runtime/asm_amd64.s:1374
while running compactions for: {elog:0xc0181269c0 thisLevel:0xc000416060 nextLevel:0xc0004160c0 top:[0xc0febef880] bot:[0xc10ae987e0 0xc145945420 0xc18728a460 0xc149a300e0] thisRange:{left:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 68 79 84 66 90 84 86 66 65 65 67 51 77 90 87 77 67 87 83 66 78 75 54 55 89 53 65 89 65 84 75 67 66 74 82 72 80 54 90 70 55 78 72 88 69 88 77 50 74 79 54 65 0 0 0 0 0 0 0 0] right:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 69 76 88 88 73 50 68 68 79 69 50 69 77 81 85 75 86 75 70 71 86 75 90 55 82 78 79 85 55 87 53 84 76 55 82 89 90 81 71 71 83 79 68 51 79 82 79 87 51 85 84 89 255 255 255 255 255 255 255 255] inf:false} nextRange:{left:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 68 76 66 83 84 84 71 70 84 50 86 79 78 81 88 53 89 52 79 77 83 84 89 81 78 68 53 66 54 72 52 88 88 85 89 77 78 80 73 55 72 79 83 72 77 73 77 66 67 76 83 65 0 0 0 0 0 0 0 0] right:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 69 77 65 50 54 73 89 87 55 81 52 65 74 54 50 53 55 52 90 65 86 75 87 54 89 85 87 85 67 51 66 80 70 85 53 79 67 89 82 80 77 86 50 85 73 75 73 50 79 53 66 52 255 255 255 255 255 255 255 255] inf:false} thisSize:69208873 dropPrefixes:[]}
github.com/dgraph-io/badger/v2.(*levelsController).compactBuildTables
	/home/why/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.2007.2/levels.go:717
github.com/dgraph-io/badger/v2.(*levelsController).runCompactDef
	/home/why/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.2007.2/levels.go:933
github.com/dgraph-io/badger/v2.(*levelsController).doCompact
	/home/why/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.2007.2/levels.go:1003
github.com/dgraph-io/badger/v2.(*levelsController).runCompactor
	/home/why/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.2007.2/levels.go:404
runtime.goexit
	/usr/lib/go/src/runtime/asm_amd64.s:1374: {elog:0xc0181269c0 thisLevel:0xc000416060 nextLevel:0xc0004160c0 top:[0xc0febef880] bot:[0xc10ae987e0 0xc145945420 0xc18728a460 0xc149a300e0] thisRange:{left:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 68 79 84 66 90 84 86 66 65 65 67 51 77 90 87 77 67 87 83 66 78 75 54 55 89 53 65 89 65 84 75 67 66 74 82 72 80 54 90 70 55 78 72 88 69 88 77 50 74 79 54 65 0 0 0 0 0 0 0 0] right:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 69 76 88 88 73 50 68 68 79 69 50 69 77 81 85 75 86 75 70 71 86 75 90 55 82 78 79 85 55 87 53 84 76 55 82 89 90 81 71 71 83 79 68 51 79 82 79 87 51 85 84 89 255 255 255 255 255 255 255 255] inf:false} nextRange:{left:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 68 76 66 83 84 84 71 70 84 50 86 79 78 81 88 53 89 52 79 77 83 84 89 81 78 68 53 66 54 72 52 88 88 85 89 77 78 80 73 55 72 79 83 72 77 73 77 66 67 76 83 65 0 0 0 0 0 0 0 0] right:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 69 77 65 50 54 73 89 87 55 81 52 65 74 54 50 53 55 52 90 65 86 75 87 54 89 85 87 85 67 51 66 80 70 85 53 79 67 89 82 80 77 86 50 85 73 75 73 50 79 53 66 52 255 255 255 255 255 255 255 255] inf:false} thisSize:69208873 dropPrefixes:[]}
2020-10-29T10:44:38.817-0700	WARN	badger	v2@v2.2007.2/levels.go:411	While running doCompact: while running compactions for: {elog:0xc0181269c0 thisLevel:0xc000416060 nextLevel:0xc0004160c0 top:[0xc0febef880] bot:[0xc10ae987e0 0xc145945420 0xc18728a460 0xc149a300e0] thisRange:{left:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 68 79 84 66 90 84 86 66 65 65 67 51 77 90 87 77 67 87 83 66 78 75 54 55 89 53 65 89 65 84 75 67 66 74 82 72 80 54 90 70 55 78 72 88 69 88 77 50 74 79 54 65 0 0 0 0 0 0 0 0] right:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 69 76 88 88 73 50 68 68 79 69 50 69 77 81 85 75 86 75 70 71 86 75 90 55 82 78 79 85 55 87 53 84 76 55 82 89 90 81 71 71 83 79 68 51 79 82 79 87 51 85 84 89 255 255 255 255 255 255 255 255] inf:false} nextRange:{left:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 68 76 66 83 84 84 71 70 84 50 86 79 78 81 88 53 89 52 79 77 83 84 89 81 78 68 53 66 54 72 52 88 88 85 89 77 78 80 73 55 72 79 83 72 77 73 77 66 67 76 83 65 0 0 0 0 0 0 0 0] right:[47 98 108 111 99 107 115 47 85 68 83 65 69 73 69 77 65 50 54 73 89 87 55 81 52 65 74 54 50 53 55 52 90 65 86 75 87 54 89 85 87 85 67 51 66 80 70 85 53 79 67 89 82 80 77 86 50 85 73 75 73 50 79 53 66 52 255 255 255 255 255 255 255 255] inf:false} thisSize:69208873 dropPrefixes:[]}: While opening directory: /glob/lbench/repo.: open /glob/lbench/repo: too many open files

And then a short time later, my application experienced this error:

2020-10-29T10:44:52.332-0700	WARN	lotus-bench	lotus-bench/main.go:91	[from=f3vrulgk7cdr7pdmqfax4y6axwbl2auklbrqdghjscxo3mfj3lynkj4esc3jfmhjuusmjjle6bdovcsezsljpq,to=f08572,n=79084,m=6,h=63868] fatal error:
    github.com/filecoin-project/lotus/chain/vm.(*VM).ApplyMessage
        /home/why/code/lotus/chain/vm/vm.go:499
  - failed to get cbor object bafy2bzacecdxuxmn6be5kwh3ttya767tqc4bwgmzauuahzkzk44rd5clllkeg: failed to get block from blockstore (FATAL): blockstore: block not found (FATAL):
    github.com/filecoin-project/lotus/chain/vm.(*Runtime).StoreGet
        /home/why/code/lotus/chain/vm/runtime.go:121

Which is trying to read a value from badger that the application guarantees has already been written (and synced) to the datastore.

Hey @whyrusleeping

Welcome to the community and thank you for your first bug report. They say many eyes make all bugs shallow. So thank you for making bugs shallow. We hope you stick around too :slight_smile: .

So I checked the code… it seems all Badger does is writes warnings when there’s an error (levels.go:473 on master). This is not the best way of handling errors.

@naman can you chime in?

Welcome to the community and thank you for your first bug report.

heh

2 Likes

:sweat_smile:

Hi @whyrusleeping, thanks for reporting the issue. I have created a ticket for this and working on it.
Are you able to consistently reproduce this? Can you also specify the machine specs you are running on; the file limit and memory etc.

hey @whyrusleeping

  - failed to get cbor object bafy2bzacecdxuxmn6be5kwh3ttya767tqc4bwgmzauuahzkzk44rd5clllkeg: failed to get block from blockstore (FATAL): blockstore: block not found (FATAL):

This error is from the application. Can you show us what error did you get from badger? I was looking at the compaction code and it doesn’t seem like we would delete existing data if compaction fails.
From the logs you shared, it looks like the error was received on the first line of the following snippet

The tables are replaced on the last line of the snippet above.
@whyrusleeping can you help us with the error that you got from badger while doing a lookup? Were you able to read the key? or are you not able to read the value for a particular key?

1 Like

@ibrahim that error would be returned if badger returned an ErrNotFound

It might also be useful to note that the badger datastore in question is over 700GB

Its not easily reproducible, but i’ve seen a number of similar errors from our users.

2 posts were split to a new topic: Badger Panics with Index Out of Range

We seem to be experiencing the same thing on Badger v3.2103.0, e.g. if the process runs out of file descriptors while doing compactions, the database can get corrupted, losing random keys. We are using the Badger database in managed mode with the jemalloc allocator and after this happens some keys are gone (as in trying to fetch them returns ErrNotFound).

@ibrahim any thoughts?