Race condition: Missing entries in key value store

I am playing around with using Badger as an event store for an app I’m building. My unit tests uncovered something unsettling, but at the moment I am assuming this is an ID10T error (problem lies between my head and keyboard).

I’m using Badger v3.2011.1, in memory for unit testing. Only other pertinent point would be using it with CGO_ENABLED=0. I have two tests so far, and one is failing for a reason I am unsure of:

  • Test 1: append an event for the aggregate, retrieve it from the key store: passing
  • Test 2: for two different aggregates (separate prefixes), store a different number of events and validate that they were all retrieved: failing

It’s the latter case I am concerned with. My approach is to use composite keys for the events: {fact}:{ulid}. I can successfully query the set of events for each {fact} or aggregate, but the set of events I am expecting are not getting written properly.

The test creates 3 events for the aggregate “test.1” and 9 events for the aggregate “test.2”. I’m able to retrieve all 3 events for “test.1” (most of the time), but only 2 of the 9 events for “test.2”. When I dump the list of keys, I’m seeing data that supports the test. This leads me to believe I am querying correctly, but not writing correctly.

This snippet does not return any errors, and it is the pattern I am using to write each event, with the assumption that each transaction is fully committed:

if err = db.Update(func(txn *badger.Txn) error {
	return txn.Set(key, value)
}); err != nil {
	return err
}

This is embedded inside of an “Append” method that I wrote to abstract the details away, and ensure the keys are created and queried appropriately. To be fair, the unit test is writing faster than the system will ever deal with in production, but it is disconcerting to not have entries written and no errors returned.

Please advise on where I may be making incorrect assumptions.

Additional information: when the first key fails (“test.1”) it also only has 2 items in the key store.

Example:

[test.1:01F6SW8K1J4YT7V8VQHFRF8VR6 
 test.1:01F6SW8K1NVT5Q282G66W0BNMP 
 test.2:01F6SW8K1P4VPCD8MYQMRHJ5R8 
 test.2:01F6SW8K1PY9XDBPNCEF6KXDP8]

New information: this appears to be a race condition happening. If I add the following to my append method then all the data is stored and retrieved:

time.Sleep(10 * time.Millisecond)

I hate having random “sleep” calls in my code, because the “fix” is fragile and in the right conditions it will happen again. The real question then is what is the correct thing for me to do to ensure that the write is complete?

Anyone have insight on this? It’s rather perplexing, and if there is actual synchronization work I have to do in my application to use badger correctly, and robustly, I would very much like to know.

time.Sleep() commands are a serious code smell, and I don’t trust they will keep working when running on a constrained device (i.e. slower, fewer cores) or container.

Hey @bloritsch , thanks for reaching out. Can you please post an entire snippet of the Unit test after removing your business logic, please? You can mail it to me at naman@dgraph.io, if you can’t share it here.

Nothing super proprietary here. It’s a work in progress, and I had to get some things updated after changing how I wanted to store items in Badger. The example is in a repository I made public here:

This is a small part of a bigger project, and I want the project event sourced. I’m using Badger to store the events and pull them back later. Again, on my machine (Windows 10), I had to use CGO_ENABLED=0 since git bash does not include GCC–and I prefer to keep implementations pure when possible.

The committed code has the failure in place. I do have a commented out line with a //FIXME: in the Append() method. If you uncomment it, you see how the slight delay does let something happen. I just wish there was something better than time.Sleep()

Hey @bloritsch , I ran the tests of your repo on my Linux machine. It ran just fine, even with a race detector. It might be some issue with Windows I think. We no longer support windows.

I would not be surprised to hear that. Unfortunately I develop on windows. It wouldn’t be the first time that something that works on Linux/Mac doesn’t work on Windows. Had a similar issue with a Java library that abstracted blob storage. The filesystem implementation worked on any unix/linux based platform, but had issues on Windows. Between speed and permissions differences, it’s really hard to properly address it all.

Target deployment environment is a Linux container, I was just hoping I didn’t need special “workaround” code. I can work with it though if it compiles and runs on Linux without the dropped messages.