Current state of badger crashes

Based on the discussions with @vvbalaji and @Paras, we’re going ahead with reverting the 3 suspected commits (that are causing the crash) on master and releasing a new badger version v20.07.
Badger v20.07 will be used in dgraph v1.2.x, v20.03.x and v20.07 release


In the past few months, there have been multiple badger crashes and this document enlists the currently open crashes which we haven’t been able to reproduce (and fix).

  1. Crash in table.addHelper (Crash when online write · Issue #5752 · dgraph-io/dgraph · GitHub) first seen in v20.03.1 (based on sentry data)
    This crash is caused by a uint32 (max size 4 GB) overflow. The user who originally reported the issue has SSTs as big as 1.8 GB. When compaction tries to merge two or more such tables, the uint32 in table.builder will overflow and it causes the slice index crash. We still don’t know how did the user has such a big table in badger.
    • Balaji Junior will raise a PR to add more asserts to avoid the int overflow. Badger will still crash because the table size is more than (or near) 4 GB but this would allow us to get a better sense of the crash and not deal with uint32 overflows.
    • There are 3 events about this crash on sentry and all three of them originated from the same source. The version of dgraph reported in the github issue and the senty issue is the same and so I’m presuming we actually have only one deployment which has crashed trice.
  2. Crash in vlog.Read [Release Blocker] valueLog.Read slice index crash · Issue #1389 · dgraph-io/badger · GitHub first seen on version v20.03.0 (from sentry)
    This crash also looks like an uint32 overflow [4294967295:17]. It could be a result of data getting corrupted. The slice indices are read from the SST files and if the data is corrupted, the indices would be incorrect. The bug could be a result of this Compress/Encrypt Blocks in the background (#1227) · dgraph-io/badger@b13b927 · GitHub or it could also be happening if we’re misusing the slice pool used for compression/decompression. Since we haven’t been able to reproduce this we still don’t know the root cause.
    • There is no PR/fix for it right now. I’m running Live/Bulk/Flock/Benchmark_write/Bank tool on multiple computers and all of them are using a modified version of badger which verifies each table after creating it. I haven’t found any crashes so far in my tests.
    • There are 8 events about this crash on sentry and all of them are from the same server.
  3. Crash in table.block [Release Blocker] t.Block slice index crash · Issue #1388 · dgraph-io/badger · GitHub first seen on version v20.03.2
    This crash is another data corruption issue. The background compression/encryption could be causing this one as well. Since we haven’t been able to reproduce this, we don’t know for sure what’s causing this.
    • There is no PR/fix for it right now. I’m running Live/Bulk/Flock/Benchmark_write/Bank tool on multiple computers with all of them using a modified version of badger which verifies each table after creating it. I haven’t found any crashes so far in my tests.
    • There is only 1 event about this crash on sentry.
  4. Illegal wiretype 6 error Proto: Illegal wiretype 6 crash on alpha · Issue #5789 · dgraph-io/dgraph · GitHub and https://dgraph.atlassian.net/browse/DGRAPH-1655
    This crash could also be side effect of corruption. The protobuf is unmarshalled using the value read from badger. If the value is corrupted, the unmarshal would fail.

These are the currently open crashes and I don’t have a fix for them yet. These bugs could be serious ones if they’re corrupting the SST files. Martin and Balaji Junior have also looked into some of these issues and we haven’t found anything yet.

Since we cannot reproduce these issues and find the root cause, I’d like to revert the following changes on Badger master branch and cut a new release v20.05.0 (as discussed in Badger release process) so that we can do the dgraph release.

The commits I’ve suggested above are my best guesses and I think reverting them is the best bet we have for now. The obvious question is do we reintroduce these improvements again in the future? I don’t know. If we’re corrupting data and we can’t reproduce it, we will never be confident enough to introduce these optimizations again.

What do you think @mrjn @vvbalaji @Paras @LGalatin ?

2 Likes

Are we running some or all of these tests with --race flag?

Identifying a stable point to make progress on Dgraph is our current priority. We can focus on the steps to introduce optimizations after we establish a stable point (and I am hopeful that we can do that)

You could avoid a user setting such big table sizes via options.

I’d be less worried about crashes coming from one server. Perhaps the users have modified the binary, or something. Do we know they’re only using the official release?

I reckon as you do the 1TB test of Badger this quarter, you’d be able to reproduce them more effectively. I don’t like to take a step back generally speaking – but I do agree about the urgency of getting Dgraph stable.

Would it adversely impact the memory requirements/recommendations for Badger and Dgraph?

Yes, the optimization was done to reduce the memory requirement. Without it, there should be increase in the amount of memory being used.

I don’t think there are many badger users who use compression. Dgraph uses it by default.

  1. could we quantify the increase? (10% or 50%)
  2. since this change is not part of 20.03.1 (last stable release) it is not a regression from that
  3. was any user/customer not able to use dgraph without this fix and did we add this to unblock that user?
1 Like

I’m assuming all the three questions are related to the decompression memory optimization Buffer pool for decompression (#1308) · dgraph-io/badger@aadda9a · GitHub

No, we have never measured it and it’s hard to measure these memory optimizations.

Correct. v20.03.1 does not have this memory optimization for decompression.

I remember there were some customers/open-source users reporting high memory usage but I don’t see any associated tickets on https://dgraph.atlassian.net/browse/BADGER-169 .

I’m skeptical that reverting the commit could solve the issue. Here are my finding for the issues, which is actually not related to the background compression and decompression.

Issue: 4GB table overflow.
finding: discardVersion is not set so table grows indefinitely. I’ve added a assert for this (add assert to check integer overflow for table size by captain-me0w · Pull Request #1402 · dgraph-io/badger · GitHub). I think snapshots are not sent to the alpha to set the discardVesion

Issue: Vlog.Read overflow.
finding: If streamwriter requests are big, badger will create a bigger vlog file with size greater than 4GB I’ve added a fix for this (return error if the vlog writes exceeds more that 4GB. by captain-me0w · Pull Request #1400 · dgraph-io/badger · GitHub)

These above fix are not related to background compression and decompression. So, I believe that reverting will not solve the problem entirely.

There is one more bug fix which related to background compression and decompression (Fix assert in background compression and encryption. by captain-me0w · Pull Request #1366 · dgraph-io/badger · GitHub).

The biggest problem is that we are not able to reproduce. It’s because, these bug are hard to catch with our existing unit or integration test.

We need to improve our testing cycle, I’m proposing that we should create some chaos framework where we insert bigger dataset which run every night. Actually at my previous company, we had such chaos test to catch such bugs in our loadbalancer product.

CC: @mrjn @vvbalaji

Yes, that could be one of the reasons for int overflow. I believe this conclusion was made based on the size of SSTs seen in Crash when online write · Issue #5752 · dgraph-io/dgraph · GitHub

-rw------- 1 work work 501M Jul 6 10:01 034725.sst
-rw------- 1 work work 719M Jul 6 10:56 035595.sst
-rw------- 1 work work 1.1G Jul 6 10:12 034913.sst
-rw------- 1 work work 1.3G Jul 6 12:02 036595.sst

But l see snapshots were taken "snapshotTs":"54489876". See zero status in Crash when online write · Issue #5752 · dgraph-io/dgraph · GitHub (I couldn’t figure out if there was a significant delay in snapshots from the status).

Correction, this fixes the assert, not the underlying issue. Badger will still crash if we fail that assert (which we’ve presumed to be the cause of the crash).

I came up with those 3 commits because we have not seen crashes before those changes were introduced (and we cannot reproduce them). There is a chance that someone started using dgraph on scale with a couple of TBs of data and so we’re seeing all these int overflows which we’ve not seen before.

So currently we have two possible explanations

  1. The code has bugs which are causing the int overflows (because of corruption) or
  2. The integer overflows are because someone is using dgraph (and badger) at scale which is pushing the limits.

Explanation 1 might be fixed by reverting those commits and explanation 2 points to a bigger problem that doesn’t have a straight forward solution, yet.