• Take inspiration from BoltDB’s APIs for transactions.
• For Update transaction, serialize them via a channel.
• For Read-only transaction, run them directly (if we can solve the problem while committing).
• Say, update reads /lrr keys and writes to /lrw keys.
• If we serialize txns, we don’t need /tor (timestamp oracle)
◦ But we need them to solve the replay issue below.
• If a txn does a Put/Set, we can write them out to value log.
• The only discretion would be for updating the vptrs into LSM tree.
• Just before commit, run value log sync if required.
◦ This can be better achieved by only writing to value log once.
◦ And keeping a local write map to serve any reads.
• Update LSM tree at commit to make the keys point to updated values/vptrs.
Problem while committing:
• When committing a transaction, we’d write to /lrw keys in a loop.
• We need to ensure that while we’re updating /lrw key vptrs, no reads happen for these keys.
• Otherwise, the same read-only txn would read some rows for a previous commit, and others for the latest commit.
• In effect, the reads should be either before or after these rows are updated.
◦ For that, we’d need to acquire a Write lock for Skiplist.
◦ And run all the reads via a Read lock over Skiplist.
◦ This can have a significant impact on the read-write latencies, because currently skiplist is atomic, and allows reads while writes are happening elsewhere.
◦ Alternatively, we have a separate hash map, which can monitor the list of rows being updated, and pause reads on those rows.
Impact on other components
• All Update txns are serialized, we won’t have any aborted txns.
• We don’t need to incorporate the timestamps into keys, the number of keys generated is the same as before.
• The writes to value log act the same way. We don’t have any extra writes to value log.
• If sync writes are set, then every txn would cause a sync on value log, which would impact write performance.
Replay issue caused by application crash:
• Txns can abort, if there were disk errors, or the user returned an error causing Txn to abort, instead of nil.
• On replay, we shouldn’t pick up partial writes from transactions (i.e. only some keys were updated).
• This might require us to incorporate a /tor txn timestamp in value log.
• Append a /tor txn timestamp /tts on every write to value log.
• The last entry to value log must be a commit entry, with the same /tts.
• On replay, as we serially over the updates, we wait for commit entry with the same /tts before applying to LSM tree.
• If some of the writes were lost to disk issues, then none of the updates would be applied.
• Txns dont’ abort, so no other cases are possible.
Value log GC and replay:
• When we rewrite a value log, mark all the rewritten entries with a flag.
◦ In fact, we should mark all the entries from a txn as well with a flag.
• On a replay (after a crash), if we see any entry with this flag, we apply to LSM tree.
• We don’t need to look for a commit entry before applying to LSM tree.