I’ve built more around my project thinking that the “if windows, add delay” on writes was sufficient, since Windows is not officially supported by the Badger team. It turns out the issue may be a bit deeper than that. My target environment will be to host this in a Linux container, and so I set up a project with a Dockerfile to build and test my solution. It turns out that the problem still occurs on a Linux kernel.
SSD reads: ~810 MB/s 1MB sequential, 670 MB/s random 4K
SSD writes: ~740 MB/s 1MB sequential, 520 MB/s random 4K
Run inside docker container goboring/golang:1.16.5b7
This is a data loss on write issue, as once the data is written, the reads work perfectly. Additionally for my tests, I’m using memory rather than disk. Not sure if that is relevant as well.
I have re-introduced my blanket 1ms delay on write transactions and that seems to have addressed the issue for now, but I’m not positive about how the thing will perform when dealing with fractional CPU restrictions in a Kubernetes pod. I don’t know how much headroom the 1ms delay gives me as it relates to additional pressure.
On a lark I did attempt to use a RWMutex, however the problem isn’t concurrent reads and writes in my test, so that made no difference. I still am losing transactions without this call after every write:
time.Sleep(1 * time.Millisecond)
It’s always whole transactions at least, so Badger never gets corrupted, but the net result is that state is lost and there are no error messages suggesting the write failed. That causes me great concern.
On my machine I can reproduce it every time both inside the Docker builder and out. If you run go test ./... directly you may see the race condition (the eventstore package will fail with missing entries). However there was at least one person on your team who tried something like that and it worked without problem. The assumption was that because I’m developing on Windows and that is not something being supported directly then it wasn’t something to look into further. My docker builder is running on a linux kernel and I still see the errors.
I would run it through the docker build like this:
docker build . -t local/fact-totem
You should see the error. If you don’t then it is related to my older hardware. Easier said to “upgrade” than done in today’s market. I also have an archived project I used to demo the problem–but it wasn’t reproduceable on your teammates machine–and it doesn’t have a dockerfile GitHub - bloritsch/eventstore: Demo project for potential race condition
OK… So on further investigation, the problem isn’t so much Badger, but my ID generator which was not incrementing like I wanted.
It would be really be helpful for me to set certain entries as Write Once so I can get an error message in my use case. That will be a separate topic though…
That also explains why the 1ms delay worked regardless of platform.