Based on the discussions with @vvbalaji and @Paras, we’re going ahead with reverting the 3 suspected commits (that are causing the crash) on master and releasing a new badger version v20.07.
Badger v20.07 will be used in dgraph v1.2.x, v20.03.x and v20.07 release
In the past few months, there have been multiple badger crashes and this document enlists the currently open crashes which we haven’t been able to reproduce (and fix).
- Crash in table.addHelper (Crash when online write · Issue #5752 · dgraph-io/dgraph · GitHub) first seen in v20.03.1 (based on sentry data)
This crash is caused by a uint32 (max size 4 GB) overflow. The user who originally reported the issue has SSTs as big as 1.8 GB. When compaction tries to merge two or more such tables, the uint32 in table.builder will overflow and it causes the slice index crash. We still don’t know how did the user has such a big table in badger.- Balaji Junior will raise a PR to add more asserts to avoid the int overflow. Badger will still crash because the table size is more than (or near) 4 GB but this would allow us to get a better sense of the crash and not deal with uint32 overflows.
- There are 3 events about this crash on sentry and all three of them originated from the same source. The version of dgraph reported in the github issue and the senty issue is the same and so I’m presuming we actually have only one deployment which has crashed trice.
- Crash in vlog.Read [Release Blocker] valueLog.Read slice index crash · Issue #1389 · dgraph-io/badger · GitHub first seen on version v20.03.0 (from sentry)
This crash also looks like an uint32 overflow[4294967295:17]
. It could be a result of data getting corrupted. The slice indices are read from the SST files and if the data is corrupted, the indices would be incorrect. The bug could be a result of this Compress/Encrypt Blocks in the background (#1227) · dgraph-io/badger@b13b927 · GitHub or it could also be happening if we’re misusing the slice pool used for compression/decompression. Since we haven’t been able to reproduce this we still don’t know the root cause.- There is no PR/fix for it right now. I’m running Live/Bulk/Flock/Benchmark_write/Bank tool on multiple computers and all of them are using a modified version of badger which verifies each table after creating it. I haven’t found any crashes so far in my tests.
- There are 8 events about this crash on sentry and all of them are from the same server.
- Crash in table.block [Release Blocker] t.Block slice index crash · Issue #1388 · dgraph-io/badger · GitHub first seen on version v20.03.2
This crash is another data corruption issue. The background compression/encryption could be causing this one as well. Since we haven’t been able to reproduce this, we don’t know for sure what’s causing this.- There is no PR/fix for it right now. I’m running Live/Bulk/Flock/Benchmark_write/Bank tool on multiple computers with all of them using a modified version of badger which verifies each table after creating it. I haven’t found any crashes so far in my tests.
- There is only 1 event about this crash on sentry.
- Illegal wiretype 6 error Proto: Illegal wiretype 6 crash on alpha · Issue #5789 · dgraph-io/dgraph · GitHub and https://dgraph.atlassian.net/browse/DGRAPH-1655
This crash could also be side effect of corruption. The protobuf is unmarshalled using the value read from badger. If the value is corrupted, the unmarshal would fail.
These are the currently open crashes and I don’t have a fix for them yet. These bugs could be serious ones if they’re corrupting the SST files. Martin and Balaji Junior have also looked into some of these issues and we haven’t found anything yet.
Since we cannot reproduce these issues and find the root cause, I’d like to revert the following changes on Badger master branch and cut a new release v20.05.0 (as discussed in Badger release process) so that we can do the dgraph release.
- Background compression/encryption Compress/Encrypt Blocks in the background (#1227) · dgraph-io/badger@b13b927 · GitHub . This is a performance improvement in building tables. Without this patch, badger compactions would be slower since building each table will take longer now.
- Buffer pool for decompression Buffer pool for decompression (#1308) · dgraph-io/badger@aadda9a · GitHub . This was a memory optimization to reduce the memory consumption of decompression. There were multiple reports about high memory usage of the decompression pool and this commit reduced the memory consumption. Without this commit, dgraph (and badger with compression) will use more memory (see Decompression uses too much memory · Issue #1239 · dgraph-io/badger · GitHub)
- Fix for buffer pool race condition fix: Fix race condition in block.incRef (#1337) · dgraph-io/badger@21735af · GitHub. There was a race condition in the reuse of the buffer pool which caused a crash. This revert wouldn’t affect anything if we revert the buffer pool commit as well.
The commits I’ve suggested above are my best guesses and I think reverting them is the best bet we have for now. The obvious question is do we reintroduce these improvements again in the future? I don’t know. If we’re corrupting data and we can’t reproduce it, we will never be confident enough to introduce these optimizations again.