Discussion: BadgerDB should offer arbitrarily sized, atomic transactions

Moved from GitHub badger/1325

Posted by mvanderkroon:

Observation

The current BadgerDB implementation keeps in-flight transactions in memory until txn.Commit() is called. If the amount of operations exceeds a predetermined amount, an ErrTxnTooBig is yielded. The de-facto way of dealing with this seems to be to commit the pending transaction there and then, to create a new transaction and continue the sequence of operations.

Problem statement

This is unacceptable for many applications for two reasons:

  1. It will leave the database in an inconsistent state; perhaps not at the physical level - but certainly at the application level.
  2. Concurrent transactions will use up available RAM quickly and unpredictably

In my opinion, the current situation essentially breaks the Atomicity guarantee for any transaction larger than (some function of) available RAM, or combination of transactions that exceeds (some function of) available RAM.

Real-world examples

To illustrate: our use cases dictate regular updates of blocks of large amounts of keys, which includes deleting the previously existing block (currently used by the application and its users) and then inserting the new block. The number of keys is not a fixed amount. One can easily see how inconvenient it is to the users of our application if the ErrTxnTooBig errors happens right at the moment we finished deleting the previously existing keyset. It would cause the removal of the entire block that is currently being used by the application, before replacing that block with new data. This is what I mean by leaving the database in an inconsistent state at the application level.

Additionally, we can imagine another situation: many blocks in various parts of the application are written to simultaneously, as external users might. Once any combination of these actions at some time exceeds the available RAM it crashes the application. Perhaps the ErrTxnTooBig mechanism takes this into account, Iā€™m not sure.

Discussion

I believe the author of issue #1242 was also hinting at this problem, but I think the issue goes beyond ErrTxnTooBig and a solution (if this were accepted as a problem to begin with!) could/would entail potentially far reaching changes in how BadgerDB stores in-flight transactions. Postgres could serve as a good example of how to achieve it: How Postgres Makes Transactions Atomic ā€” brandur.org.

My understanding is that many other databases follow similar practices; for Neo4J I know this is the case for certain. I also donā€™t mean to imply this is a bad choice per-se by any means; perhaps BadgerDB/DGraph and others are aimed at those use cases where atomicity on arbitrarily sized transactions is not needed, where raw performance has priority and that is OK!

However, the (optional) ability to deal with arbitrarily sized transactions would make BadgerDB/DGraph a suitable solution for a wider domain of real-world applications.

Iā€™m curious to learn:

  • Whether it is perceived as a problem at all
  • Is a solution to this problem part of the future of BadgerDB
  • If the above answers are both yes, how should it be approached
  • Is there anything I could do to help
1 Like

jarifibrahim commented :

Hey @mvanderkroon, thank you for the detailed write-up. The Postgres article was definitely helpful. The transaction limits are set based on the max table size

The transaction can be at most 15% of the table size. The default size of the table is 64 MB so the max batch size would be about 9.6 MB. I donā€™t know why we have a 15% limit. It would be worth experimenting with some other numbers (maybe 50%?). You might be able to work around the transaction size by setting a high value for max table size.

Fixing the transaction too big issue isnā€™t easy. Postgres stores the uncommitted transactions separately and then adds them to the main WAL/BTree. A similar change in badger would be a big change and it might take a decent amount of time to build it.

Whether it is perceived as a problem at all

I believe a lot of users might be facing this problem and we should fix this.

Is a solution to this problem part of the future of BadgerDB

Yes, weā€™re always working on improving badger. This seems like a useful improvement. Iā€™ll add this to our backlog.

If the above answers are both yes, how should it be approached

I believe the first step would be to study how postgres/mysql/rocksdb deal with this problem and then a detailed write-up about how we could tackle the problem in badger will be a good starting point.

Is there anything I could do to help

Badger is open source and we encourage people to contribute. Anything you can contribute to improving badger/fixing transaction too big issue will be helpful for all the contributors :slight_smile:

mvanderkroon commented :

@jarifibrahim thanks for the equally detailed response. Iā€™m happy to learn this is something that could contribute to BadgerDB.

Thank you for the pointer towards the max table size, I looked into the mechanics and ran a few tests, I believe this would help extending the number of operations before receiving a ErrTxnTooBig (essentially allowing for a dial to regulate the transaction size where atomicity is guaranteed), but it would still be in RAM. For our use case this latter part is problematic; users may send variable sized payloads to the service and thus itā€™s difficult to dimension the hardware appropriately.

I am, however, working on a ā€œpoor-manā€™sā€™ā€-solution: essentially a layer on top of BadgerDBsā€™ transactional layer. By assigning ā€˜blocksā€™ of data unique IDā€™s (versions) and making those part of the keys, I can stream data to the database, committing whenever I hit ErrTxnTooBig. Once the process is done, I switch over the ā€˜active block IDā€™ to the latest inserted and initiate a job that deletes the previous ā€˜versionā€™. Now, this is by no means a ā€˜goodā€™ solution that I would present back to this repository (for one, it feels like Iā€™m reimplementing logic that is already present in some form or another) - but for our (narrow) use case it will do fine. At the same time, this process will allow me to gain more knowledge and understanding of BadgerDB as well as an intuition for a proper way forward, Iā€™ll circle back once I reach this point.

jarifibrahim commented :

@mvanderkroon WriteBatch might be useful to you badger/batch.go at ef28ef36b5923f12ffe3a1702bdfa6b479db6637 Ā· dgraph-io/badger Ā· GitHub It is a layer on top of transactions which allows for inserting data at a higher speed (and also takes care of ErrTxnTooBig). The only caveat is that you cannot read data using write batch but Iā€™m guessing youā€™re not reading any data.

mvanderkroon commented :

Thank you, that looks very useful for my use case

Hey folks,

Iā€™m affected by the same issue as the original poster. I see this was ported over from Github back in 2020. @jarifibrahim mentioned theyā€™d add it to the backlog, but I imagine as with most nontrivial projects, you probably get new tickets and feature requests quite often, and it can be hard to get to older things like this one. Is there any chance you could please revisit this?

I found BadgerDB recently, and was very impressed by it. Iā€™d like to use it to replace SQLite as the catalog layer for a storage system. Iā€™ve gotten to a point where the SQL queries are a significant portion of the systemā€™s complexity, and Badger seems to be a great solution for that.

I need to store metadata about files, some of which are streamed in and stored in fragments. I need to reference individual fragments, for deduplication purposes. There can easily be 100k fragments in a file, potentially more. Of course, I need atomicity for storing a given file. However, with the limits Iā€™ve seen described here, it seems thereā€™s a good chance I wonā€™t be able to store some files in a single transaction. Which is a non-starter as the DB would be in an inconsistent state with dangling references, if the process dies at the wrong time.

I also have other operations which need atomicity and can grow quite large, e.g. deleting a ā€œpartitionā€ of sorts. A partition would typically have 100k-1M files, possibly more. This would be much easier to implement correctly with atomicity; I could probably get away with something like db.DropPrefix for the part of it if I design the schema just right, but there will be dangling references (e.g. indexes) which need updating. Without atomicity, that again means the process might die at the wrong time, and the DB will be left inconsistent.

I could try to implement a journaling layer on top of Badger, but that really defeats the whole purpose. Iā€™m bound to get it subtly wrong in hard to detect ways, and would basically just be implementing a DBMS on top of another DBā€¦

Some of this might be doable if we could write a tree of keys on a separate namespace (different prefix) and then atomically move it (i.e. rename all keys with that prefix). That way, we could do things in an RCU style, as long as the atomic rename didnā€™t have size limitations. The ideal solution, however, would be to have arbitrarily large transactions.

Could you please revive this request?

Friendly ping, any comments please?