Supporting transactions in Badger

Transactions

• Take inspiration from BoltDB’s APIs for transactions.
• For Update transaction, serialize them via a channel.
• For Read-only transaction, run them directly (if we can solve the problem while committing).

Update Txn:
• Say, update reads /lrr keys and writes to /lrw keys.
• If we serialize txns, we don’t need /tor (timestamp oracle)
◦ But we need them to solve the replay issue below.
• If a txn does a Put/Set, we can write them out to value log.
• The only discretion would be for updating the vptrs into LSM tree.
• Just before commit, run value log sync if required.
◦ This can be better achieved by only writing to value log once.
◦ And keeping a local write map to serve any reads.
• Update LSM tree at commit to make the keys point to updated values/vptrs.

Problem while committing:
• When committing a transaction, we’d write to /lrw keys in a loop.
• We need to ensure that while we’re updating /lrw key vptrs, no reads happen for these keys.
• Otherwise, the same read-only txn would read some rows for a previous commit, and others for the latest commit.
• In effect, the reads should be either before or after these rows are updated.

Solution:
◦ For that, we’d need to acquire a Write lock for Skiplist.
◦ And run all the reads via a Read lock over Skiplist.
◦ This can have a significant impact on the read-write latencies, because currently skiplist is atomic, and allows reads while writes are happening elsewhere.
◦ Alternatively, we have a separate hash map, which can monitor the list of rows being updated, and pause reads on those rows.

Impact on other components

• All Update txns are serialized, we won’t have any aborted txns.
• We don’t need to incorporate the timestamps into keys, the number of keys generated is the same as before.
• The writes to value log act the same way. We don’t have any extra writes to value log.
• If sync writes are set, then every txn would cause a sync on value log, which would impact write performance.

Replay issue caused by application crash:
• Txns can abort, if there were disk errors, or the user returned an error causing Txn to abort, instead of nil.
• On replay, we shouldn’t pick up partial writes from transactions (i.e. only some keys were updated).

Solution:
• This might require us to incorporate a /tor txn timestamp in value log.
• Append a /tor txn timestamp /tts on every write to value log.
• The last entry to value log must be a commit entry, with the same /tts.
• On replay, as we serially over the updates, we wait for commit entry with the same /tts before applying to LSM tree.
• If some of the writes were lost to disk issues, then none of the updates would be applied.
• Txns dont’ abort, so no other cases are possible.

Value log GC and replay:
• When we rewrite a value log, mark all the rewritten entries with a flag.
◦ In fact, we should mark all the entries from a txn as well with a flag.
• On a replay (after a crash), if we see any entry with this flag, we apply to LSM tree.
• We don’t need to look for a commit entry before applying to LSM tree.

3 Likes

This is the biggest point of contention towards transactions I think. So, ideas are welcome.

If we add versions to every key, then all our reads can be served from a timestamp < latest commit ts /tsc, and only the txn is committed, would the read ts update to /tsc. That way, we don’t need to acquire locks.

The downside is that we’d need to store lot more data – every key-value update would be stored as a version. While this is common when doing writes concurrently, it might be avoidable in our case, where writes are happening serially.

On further thought, maybe read-only txns don’t really care - if they see Ri-1 and Ri+1, having two different versions while a commit is happening, it’s probably alright. It matters only if they decide to do a write based on these reads; but then they’d become an update txn - which is run serially with all the other update txns, which are guaranteed to not conflict.

This means, that we can serially do the updates to multiple rows in an update txn, and have these updates be visible to read-only txns as they become available, instead of only after all of them have become available.

My concern here is that for our purposes we need to think about how to do transactions for Dgraph, not how to do transactions for Badger. It doesn’t seem clear to me that Badger transactions is something that would help the implementation of Dgraph transactions. It might be that there’s some particular new feature of Badger that could help - for example, having an externally introduced version number, and being able to tell Badger it can insta-expire old versions of keys below some threshold, is something that might be useful to a distributed transaction system.

If we are interested in Badger-only transactions:

Ultimately there’s some kind of multi-versioning to be done – either multiple keys with different version numbers (and we already have version numbers a.k.a. cas counter values in Badger) or some kind of pure-functional-datastructure-esque overarching DB snapshot like what the backup implementation does (which is slow the same way that making an iterator is, because it has to iterate through all the tables up-front).

We could also have versioned tables, instead of versioned key/values – each table has a version interval describing when it was created and deleted, either [low, infinity) or [low, deletionVersion). The upside to this is that we don’t have old versions of keys percolating around waiting to get deleted in some compaction (which would be costly, I think, because we’d have to be able to figure out if each key is an old version or not). The downside is that instead of a simple binary search in table lookup when performing a get, we have to do something a bit more fussy.

I’m curious about this point:

From that specific perspective, I don’t see how it would impact performance, which most likely means I’m missing details. If syncs are on, I’d expect writes to already be synchronizing before writes return (somewhat implied by SyncWrites). So in that sense, a transaction mechanism might actually help the performance in some cases as inside the transaction the logic is free not to synchronize.

In other words, some of the performance today comes from batching independent writes, but if you have logic which does a sequence of reads and writes, the database might synchronize N times in an attempt to persist more. With transactions, larger logical changes may be grouped and ignore syncing for longer, since we don’t care about having just half of those entries persisted.

The performance impact would only come up if independent transactions were serialized and not grouped together, because then the positive impact of not synchronizing inside the transaction would be dwarfed by the negative impact of making every other write wait. But the implementation might encourage merging of independent transactions, similar to what BoltDB does with the Batch method.

Just as a data point, in some simple experiments I’ve run with large keys (~600kb) and SSDs, tuning concurrency via Batch brings BoltDB much closer to Badger, which we should expect as it greatly reduces the number of syncs per operation. BoltDB performed 208 adds/sec vs. 279 adds/sec from Badger. Both using the same concurrency setup. Without Batch, BoltDB was at least twice as slow, and that’s even disregarding the benefits mentioned above, since each transaction was doing a single put operation.

That’s a good point. My comment was assuming each transaction runs serially in a loop, and syncs to disk on commit. Surely, batching up multiple of them, if they have no deps on each other, would amortize the cost of sync; just like it happens today.

This is probably off-topic, but wanted to clarify. If you run Badger in a loop with 1 set per iteration, no batching would happen in Badger, and each set would force a disk sync (assuming syncwrites was set to true). This isn’t something we’ve optimized for; which is why you probably see BoltDB’s writes much closer to Badger. In our experience and benchmarks, Badger’s writes are significantly faster.

Apologies in advance if I’m missing something, but that’s not what I see in a quick skim through the code:

  1. Set calls BatchSet at kv.go:826 with a single item
  2. BatchSet calls sendToWriteCh at kv.go:775
  3. sendToWriteCh does some fiddling, but ultimately sends the entries via writeCh in kv.go:739 or kv.go:749
  4. Now, here is the trick: doWrites loops attempting to take requests out of writeCh in kv.go:677. If writeCh happens to return quickly enough, or writeRequests happens to be blocked writing (or more likely, syncing), it will create a set of requests until len(reqs) >= 3*kvWriteChCapacity.
  5. Those requests go down to writeRequests in kv.go:707 which calls KV.writeRequests with the list of grouped requests.
  6. writeRequests sends those requests to the value log at kv.go:618.
  7. The ValueLog groups those entries together in the same buffer in value.go:869, and sends them toDisk together.

So that looks pretty analogous to what happens when we use Batch in BoltDB: we have multiple unrelated requests being merged together in a buffer and syncing to disk at once. I’ve also consciously used a single set inside the Batch operation in BoltDB, to make the comparison even more apple-to-apples.

I’ve had a quick look in the benchmarks and apparently they disable all syncing to disk, which is something a bit curious to have completely omitted from a database performance article. I’m sure there’s a class of use cases where this would shine such as initial bulk loading, but at least for the multiple use cases I’ve been responsible for over the years, that’s a non-starter. We need to have some form of persistence guarantees, and we need to understand what these are clearly. Dgraph also makes the situation slightly worse as it splits even the individual elements of the row/doc into multiple entities and potentially multiple writes, so as it is we’d really not get even close to it for production cases unless we end up with some form of read-only or non-live updating use case.

If you notice here: kv.go:690, as soon as pending channel can accept it, we’ll goto write.

Now the loop iterator case, I was talking about where you do Set or BatchSet in a loop, that has to return before the next Set is called. Therefore, the request would only have one entry in it, and then it would do the write. The write must finish before the next request is queued.

So, in effect, only one write is happening at a time. Of course, in a real-world use case, you’d be running multiple goroutines, therefore you can have multiple pending requests which can be picked up by that loop. But, typically, I’ve seen devs new to Badger, do Set in a loop, and complain about performance. We even have an FAQ for it.

Our benchmark strategy was this: Load up a dataset at least 2x the amount of RAM, using a production level server (so not pick up a 2GB machine).

Bolt took 8h20m to load up 250M keys, with 128B values. This is with sync writes set to false. If we were to run this job with sync writes set to true, assuming it takes at least 3x that, we’re looking at a loading time of a day. That’s a no-go, particularly given the goal of loading up a big data, where each request doesn’t need to be flushed to disk before returning.

But, I can understand why you and other folks think that sync writes are a must in benchmarks. We might run populate just with sync writes set to true (but for a smaller data set), to get those numbers out. In that case, Badger would have an even bigger win against Bolt.

That’s assuming that multiple such entries = multiple writes to disk, which would be slower (compared to ??). That’d be a wrong assumption. Multiple such updates would get batched up and the cost to disk write amortized. In fact, LSM trees are adept at providing a very high write throughput, which is what Badger benchmarks show as well.

Yes, as soon as an event happens, it will let another batch in. How does that change the fact that it batches writes?

Yes, exactly. I’m indeed talking about real world usage.

Right, as I said there are certainly some use cases for in-memory and for a databases with weak persistence guarantees, and I apologize if I sounded like pushing an agenda on you. After all, it’s your project and you need to do what’s best for the users and for the market you’re aiming it at.

Where I’ve tested, which by the way uses a meaningful delta (15%) on top of a pre-populated database rather than trying to rush an initial data set in an empty database (my real world), I can’t reproduce the major advantage demonstrated in the blog post benchmarks, per earlier messages. In fact, if I tuned Badger to be even more realistic with multiple writes per batch which I’d argue is the typical use case, the figure on BoltDB would necessarily improve further.

Not sure about why you’re bringing up performance as an answer to that point. The remark was entirely about the lack of predictable persistence when streaming multiple items, which is a larger issue for us because each item is actually an individual value in a document/row.

Can you share your benchmark code? With SyncWrites the write throughput should only widen, not decrease between Badger and Bolt.

I hope you’re not convoluting the benchmarks with the applications for Badger. Everyone’s use case is different. If a database is using WAL, then there might not be a need for sync writes. The important thing is, an option is available to users, so they can choose which way they want to go. There’s no right answer here.

My apologies! I thought this comment tied into the rest of the conversation about benchmarks.

If a write from Dgraph comes back with a success, then that write has been persisted. You won’t lose it. That’s a guarantee we make (as long as FS doesn’t get corrupted). So, that should make it easy to think about the behavior if you’re streaming multiple items.

The test was nothing special. Just loaded a little over 12GB in a database with ~600kb per key and then benchmarking the jump to 15GB with 8 workers. As mentioned, one item per set, same setup for both. I don’t have something at hand that I can push right away to reproduce it as I’ve repeatedly changed the sandbox, but I can look into it again if it ends up being mysterious.

About persistence, given the strategies we discussed above, it does sound like if we’re adding multiple edges, there’s a chance for only some of the edges to be persisted in a case of a crash. Of course, that may be considered corruption, but that’s the essence of the issue at hand.

I’m about to go offline on a trip which should take me a couple of days, so apologies for the lack of further feedback over the next several hours. If someone is going to be in NY next week, let’s meet.

The only case when this happens is if you sent the write (with all the edges for the same node/row), and before it returned successfully, the system crashed. (In fact, no other case should cause this – if the write came back successful, then all of these edges were persisted.) That way, you don’t know what was persisted, and what not. That case can only be solved with transactions (which we’re looking into).

Have a nice trip! We’re in Sydney, Australia – so if your trip takes you down under, we’ll be up for a beer :-).

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.