Managed mode and lost keys after crash

We are using BadgerDB (v2.2007.2) in managed mode and are occasionally seeing a strange issue during our test suite runs where we randomly kill processes to test crash recovery. We use SetDiscardTs to configure the timestamp below which any deleted/expired key versions can be pruned.

It seems that for some reason after a restart a historic key above the version specified by SetDiscardTs is lost.

In this case the discard timestamp was set to 21263 before the crash. Using badger info to dump all key versions we see the following (the key was deleted at version 22064 and the latest version is 22223):

Key: 003d7d2a19724a5029a48cf9f71be0cfc728e191947afa6958006f6cd4f01c7b6d	version: 22064	size: 33	meta: b0000	{deleted}

The correct variant (from a process that was not killed) should look like this as the key should be live between versions 21572 and 22064:

Key: 003d7d2a19724a5029a48cf9f71be0cfc728e191947afa6958006f6cd4f01c7b6d version: 22064  size: 33        meta: b0000     {deleted}
Key: 003d7d2a19724a5029a48cf9f71be0cfc728e191947afa6958006f6cd4f01c7b6d version: 21572  size: 141       meta: b0000

Any idea what could be the reason for losing keys? NumVersionsToKeep is set to 1, but I assume that this only affects versions below or at the discard timestamp?

Thank you for investigating this @kostko! This seems like a bug to me.

yes, it should affect only versions below the discardTs. The following code is what removes the keys and it shouldn’t drop any keys as long as the discardTs is set correctly.
https://github.com/dgraph-io/badger/blob/d5a25b83fbf4f3f61ff03a9202e36f5b75544426/levels.go#L610-L647

@kostko can you confirm that discardTs was set correctly? I don’t see any obvious reason why we would drop that key.

The issue is hard to reproduce, but we’ll add some more logging to make sure that the discard timestamp was set correctly. But based on the existing logs, the discard timestamp was set correctly (as specified above) and there is only a single point where we set it.

Based on my reading of the Badger code, the timestamp is initially set to zero (when opening the database), so it shouldn’t cause anything to get pruned, right?

Ok just confirmed that the discard timestamp was definitely set correctly, e.g. to 21263. Also looked at the database dumps a bit closer and found the following:

  1. There seems to be a lot of older deleted entries which were not discarded (versions that are even below the set discard timestamp). E.g., a key that got created at version 20823 and deleted at version 20854 was not discarded. So the discard timestamp may not be responsible as otherwise all the other keys would be discarded as well? Or is discarding more selective so this is not unusual?

  2. There are actually lots of other keys that were lost in a similar way. ALL of the missing keys were created at version 21572 and deleted at some later version. BUT there are also keys that correctly exist, created at version 21572 and deleted at some later version, so it’s not like all of the keys at that version were lost.

I checked and we are killing processes multiple times during the test run and the previous time that the process was killed was right around when version 21572 was committed. So it could be that killing the process caused loss of recent data – which is fine as the data would have been replied from a different source in this case. But it seems that the metadata indicated that the version was successfully committed, so the replay did not happen.

To explain, we are keeping two kinds of data in Badger – metadata (always at version 1) and actual versioned data (at its respective version). We always first commit the data updates (at its respective version) and then as the very last step commit the updated metadata (at version 1).

Do I understand correctly that given this order of commits in case of a crash:

  • Recent data and metadata key updates may be lost (fine, we’ll just replay as metadata indicates we don’t have the given version).
  • Recent metadata key updates may be lost (again fine, we’ll just replay even if data is already there).
  • Recent data key updates are lost while the metadata key updates are not lost (this would be problematic as then we wouldn’t know that something has actually been lost and needs replay).

E.g., is the last case a possibility?

After the upgrade to v3 we no longer seem to be experiencing these issues in our stress tests. :crossed_fingers:

2 Likes