Deadlock in Badger V3

What version of Go are you using (go version)?

$ go version 1.18

What operating system are you using?

Rhel 7

What version of Badger are you using?

v3.2103.2

Does this issue reproduce with the latest master?

could no test

Steps to Reproduce the issue

Got it in Prod, was only able to grab the stack trace

What Badger options were set?

NumMemtables=2
NumLevelZeroTables = 3
NumLevelZeroTablesStall = 5
SyncWrites = false
Compression = options.None
BlockCacheSize = 0
ValueLogFileSize = 32 << 20
MemTableSize = 16 << 20

What did you do?

What did you expect to see?

What did you see instead?

My app went into a deadlock in badger after running for a few days.
We already had similar issue in the past (once every few months) but the app was killed without any stacktraces. Application logs did not help.

This time I was able to get a full stack trace.
Here’s the interesting parts:

goroutine 219049701 [semacquire, 46 minutes]:
sync.runtime_SemacquireMutex(0x7?, 0x5?, 0xc09349a5c0?)
	/usr/local/go/src/runtime/sema.go:71 +0x25
sync.(*RWMutex).RLock(...)
	/usr/local/go/src/sync/rwmutex.go:63
github.com/dgraph-io/badger/v3.(*levelHandler).numTables(0xc0d92a3da0?)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/level_handler.go:223 +0x59
github.com/dgraph-io/badger/v3.(*levelsController).pickCompactLevels(0xc0896a4930)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:553 +0xe5
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor.func3()
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:478 +0x53
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor(0xc0896a4930, 0x3, 0xc1000d2840)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:518 +0x3a2
created by github.com/dgraph-io/badger/v3.(*levelsController).startCompact
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:355 +0x4e

goroutine 219049701 [semacquire, 46 minutes]:
sync.runtime_SemacquireMutex(0x7?, 0x5?, 0xc09349a5c0?)
	/usr/local/go/src/runtime/sema.go:71 +0x25
sync.(*RWMutex).RLock(...)
	/usr/local/go/src/sync/rwmutex.go:63
github.com/dgraph-io/badger/v3.(*levelHandler).numTables(0xc0d92a3da0?)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/level_handler.go:223 +0x59
github.com/dgraph-io/badger/v3.(*levelsController).pickCompactLevels(0xc0896a4930)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:553 +0xe5
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor.func3()
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:478 +0x53
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor(0xc0896a4930, 0x3, 0xc1000d2840)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:518 +0x3a2
created by github.com/dgraph-io/badger/v3.(*levelsController).startCompact
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:355 +0x4e

goroutine 219049700 [semacquire, 46 minutes]:
sync.runtime_SemacquireMutex(0x105?, 0x18?, 0x15f4706?)
	/usr/local/go/src/runtime/sema.go:71 +0x25
sync.(*RWMutex).Lock(0x0?)
	/usr/local/go/src/sync/rwmutex.go:144 +0x71
github.com/dgraph-io/badger/v3.(*levelHandler).deleteTables(0xc0d92a3da0, {0xc0df46eae0, 0x3, 0x10?})
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/level_handler.go:89 +0x45
github.com/dgraph-io/badger/v3.(*levelsController).runCompactDef(_, _, _, {0xc08c282500, 0x2, {0x5, {0xc0a0472e40, 0x7, 0x7}, {0xc0a0472e80, ...}}, ...})
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:1441 +0x3c5
github.com/dgraph-io/badger/v3.(*levelsController).doCompact(0xc0896a4930, 0x2, {0x0, 0x3ff0000000000000, 0x3ff25c4b9e988225, {0x0, 0x0, 0x0}, {0x5, {0xc0a0472e40, ...}, ...}})
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:1522 +0x510
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor.func2({0x0, 0x3ff0000000000000, 0x3ff25c4b9e988225, {0x0, 0x0, 0x0}, {0x5, {0xc0a0472e40, 0x7, 0x7}, ...}})
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:466 +0x78
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor.func3()
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:489 +0x158
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor(0xc0896a4930, 0x2, 0xc1000d2840)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:518 +0x3a2
created by github.com/dgraph-io/badger/v3.(*levelsController).startCompact
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:355 +0x4e

goroutine 219049698 [semacquire, 46 minutes]:
sync.runtime_SemacquireMutex(0x5376c6?, 0x66?, 0x9818a5082a387316?)
	/usr/local/go/src/runtime/sema.go:71 +0x25
sync.(*RWMutex).RLock(...)
	/usr/local/go/src/sync/rwmutex.go:63
github.com/dgraph-io/badger/v3.(*compactDef).lockLevels(0x1a35280)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:1098 +0x66
github.com/dgraph-io/badger/v3.(*levelsController).fillTablesL0ToL0(0xc0896a4930, 0xc0626f37a0)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:1123 +0x105
github.com/dgraph-io/badger/v3.(*levelsController).fillTablesL0(0x22453f8?, 0xc0000500b0?)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:1233 +0x47
github.com/dgraph-io/badger/v3.(*levelsController).doCompact(0xc0896a4930, 0x0, {0x0, 0x3ff0000000000000, 0x3fe19ee1acd1698f, {0x0, 0x0, 0x0}, {0x5, {0xc09349a180, ...}, ...}})
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:1506 +0x2c5
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor.func2({0x0, 0x3ff0000000000000, 0x3fe19ee1acd1698f, {0x0, 0x0, 0x0}, {0x5, {0xc09349a180, 0x7, 0x7}, ...}})
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:466 +0x78
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor.func3()
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:489 +0x158
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor(0xc0896a4930, 0x0, 0xc1000d2840)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:518 +0x3a2
created by github.com/dgraph-io/badger/v3.(*levelsController).startCompact
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:355 +0x4e

goroutine 219049699 [semacquire, 46 minutes]:
sync.runtime_SemacquireMutex(0x7?, 0x5?, 0xc09349a7c0?)
	/usr/local/go/src/runtime/sema.go:71 +0x25
sync.(*RWMutex).RLock(...)
	/usr/local/go/src/sync/rwmutex.go:63
github.com/dgraph-io/badger/v3.(*levelHandler).numTables(0xc0d92a3da0?)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/level_handler.go:223 +0x59
github.com/dgraph-io/badger/v3.(*levelsController).pickCompactLevels(0xc0896a4930)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:553 +0xe5
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor.func3()
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:478 +0x53
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor(0xc0896a4930, 0x1, 0xc1000d2840)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:518 +0x3a2
created by github.com/dgraph-io/badger/v3.(*levelsController).startCompact
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:355 +0x4e

additionnally there was hundreds of goroutines like these

goroutine 3126 [select]:
github.com/dgraph-io/badger/v3.(*levelsController).runCompactor(0xc001c518f0, 0x0, 0xc001c597a0)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:509 +0x30d
created by github.com/dgraph-io/badger/v3.(*levelsController).startCompact
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/levels.go:355 +0x4e

goroutine 5132 [select, 16 minutes]:
github.com/dgraph-io/badger/v3.(*vlogThreshold).listenForValueThresholdUpdate(0xc0019a1600)
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/value.go:1174 +0x11a
created by github.com/dgraph-io/badger/v3.Open
	/root/go/pkg/mod/github.com/ronanh/badger/v3@v3.0.0-20210716083501-e41fa3b54c7f/db.go:380 +0x16dc

I’ve spent some time reviewing the source code to understand what happened, but the problem is not obvious to me and I’m not familiar with the codebase.

I suspect that it may be related to a double RLock in LockLevels() coupled with a concurrent Lock in deleteTables().
see the answer from brad here: https://github.com/golang/go/issues/15418

This could happen if thisLevel and nextLevel are identical in LockLevels.
I saw this piece of code in runCompactDef: if thisLevel.level == nextLevel.level, which seems to indicate it’s a legit case.

Any thought about this?

1 Like