Lost transactions on old hardware: No errors returned

I’ve built more around my project thinking that the “if windows, add delay” on writes was sufficient, since Windows is not officially supported by the Badger team. It turns out the issue may be a bit deeper than that. My target environment will be to host this in a Linux container, and so I set up a project with a Dockerfile to build and test my solution. It turns out that the problem still occurs on a Linux kernel.

The problem is specifically related to writing transactions to the Badger KV store in rapid succession. My repository is in Github: GitHub - D-Haven/fact-totem: Fact Totem is an event store designed to allow authenticated access to different aggregates. and the familiar code from the last time I raised the issue is in the eventstore package.

Potentially relevant stats:

  • CPU 4 core i5-4690K
  • Go 1.16
  • Badger v3.2103.0
  • SSD reads: ~810 MB/s 1MB sequential, 670 MB/s random 4K
  • SSD writes: ~740 MB/s 1MB sequential, 520 MB/s random 4K
  • Run inside docker container goboring/golang:1.16.5b7

This is a data loss on write issue, as once the data is written, the reads work perfectly. Additionally for my tests, I’m using memory rather than disk. Not sure if that is relevant as well.

I have re-introduced my blanket 1ms delay on write transactions and that seems to have addressed the issue for now, but I’m not positive about how the thing will perform when dealing with fractional CPU restrictions in a Kubernetes pod. I don’t know how much headroom the 1ms delay gives me as it relates to additional pressure.

On a lark I did attempt to use a RWMutex, however the problem isn’t concurrent reads and writes in my test, so that made no difference. I still am losing transactions without this call after every write:

time.Sleep(1 * time.Millisecond)

It’s always whole transactions at least, so Badger never gets corrupted, but the net result is that state is lost and there are no error messages suggesting the write failed. That causes me great concern.

Can you provide a reproducible working example of this?

GitHub - D-Haven/fact-totem at demo-racecondition I added the branch demo-racecondition so you can see it in action.

On my machine I can reproduce it every time both inside the Docker builder and out. If you run go test ./... directly you may see the race condition (the eventstore package will fail with missing entries). However there was at least one person on your team who tried something like that and it worked without problem. The assumption was that because I’m developing on Windows and that is not something being supported directly then it wasn’t something to look into further. My docker builder is running on a linux kernel and I still see the errors.

I would run it through the docker build like this:

docker build . -t local/fact-totem

You should see the error. If you don’t then it is related to my older hardware. Easier said to “upgrade” than done in today’s market. I also have an archived project I used to demo the problem–but it wasn’t reproduceable on your teammates machine–and it doesn’t have a dockerfile GitHub - bloritsch/eventstore: Demo project for potential race condition

OK… So on further investigation, the problem isn’t so much Badger, but my ID generator which was not incrementing like I wanted.

It would be really be helpful for me to set certain entries as Write Once so I can get an error message in my use case. That will be a separate topic though…

That also explains why the 1ms delay worked regardless of platform.