Introducing Badger: A fast key-value store written natively in Go - Dgraph Blog


(Manish R Jain) #1

We have built an efficient and persistent log structured merge (LSM) tree based key-value store, natively in Go language. It is based upon WiscKey paper included in USENIX FAST 2016. This design is highly SSD-optimized and separates keys from values to minimize I/O amplification; leveraging both the sequential and the random performance of SSDs.

We call it Badger. Based on benchmarks, Badger is at least 3.5x faster than RocksDB when doing random reads. For value sizes between 128B to 16KB, data loading is 0.86x - 14x faster compared to RocksDB, with Badger gaining significant ground as value size increases. On the flip side, Badger is currently slower for range key-value iteration, but that has a lot of room for optimization.

Background and Motivation

Word about RocksDB

RocksDB is the most popular and probably the most efficient key-value store in the market. It originated in Google as SSTable which formed the basis for Bigtable, then got released as LevelDB. Facebook then improved LevelDB to add concurrency and optimizations for SSDs and released that as RocksDB. Work on RocksDB has been continuously going on for many years now, and it’s used in production at Facebook and many other companies.

So naturally, if you need a key-value store, you’d gravitate towards RocksDB. It’s a solid piece of technology, and it works. The biggest issue with using RocksDB is that it is written in C++; requiring the use of Cgo to be called via Go.

Cgo: The necessary evil

At Dgraph, we have been using RocksDB via Cgo since we started. And we’ve faced many issues over time due to this dependency. Cgo is not Go, but when there are better libraries in C++ than Go, Cgo is a necessary evil.

The problem is, Go CPU profiler doesn’t see beyond Cgo calls. Go memory profiler takes it one step further. Forget about giving you memory usage breakdown in Cgo space, Go memory profiler fails to even notice the presence of Cgo code. Any memory used by Cgo would not even make it to the memory profiler. Other tools like Go race detector, don’t work either.

Cgo has caused us pthread_create issues in Go1.4, and then again in Go1.5, due to a bug regression. Lightweight goroutines become expensive pthreads when Cgo is involved, and we had to modify how we were writing data to RocksDB to avoid assigning too many goroutines.

Cgo has caused us memory leaks. Who owns and manages memory when making calls is just not clear. Go, and C are at the opposite spectrums. One doesn’t let you free memory, the other requires it. So, you make a Go call, but then forget to Free(), and nothing breaks. Except much later.

Cgo has given us a unmaintainable code. Cgo makes code ugly. The Cgo layer between RocksDB was the one piece of code no one in the team wanted to touch.

Surely, we fixed the memory leaks in our API usage over time. In fact, I think we have fixed them all by now, but I can’t be sure. Go memory profiler would never tell you. And every time someone complains about Dgraph taking up more memory or crashing due to OOM, it makes me nervous that this is a memory leak issue.

Huge undertaking

Everyone I told about our woes with Cgo, told me that we should just work on fixing those issues. Writing a key-value store which can provide the same performance as RocksDB is a huge undertaking, not worth our effort. Even my team wasn’t sure. I had my doubts as well.

I have great respect for any piece of technology which has been iterated upon by the smartest engineers on the face of the planet for years. RocksDB is that. And if I was writing Dgraph in C++, I’d happily use it.

But, I just hate ugly code.

And I hate recurring bugs. No amount of effort would have ensured that we would no longer have any more issues with using RocksDB via Cgo. I wanted a clean slate, and my profiler tools back. Building a key-value store in Go from scratch was the only way to achieve it.

I looked around. The existing key-value stores written in Go didn’t even come close to RocksDB’s performance. And that’s a deal breaker. You don’t trade performance for cleanliness. You demand both.

So, I decided we will replace our dependency on RocksDB, but given this isn’t a priority for Dgraph, none of the team members should work on it. This would be a side project that only I will undertake. I started reading up about B+ and LSM trees, recent improvements to their design, and came across WiscKey paper. It had great promising ideas. I decided to spend a month away from core Dgraph, building Badger.

That’s not how it went. I couldn’t spend a month away from Dgraph. Between all the founder duties, I couldn’t fully dedicate time to coding either. Badger developed during my spurts of coding activity, and one of the team members’ part-time contributions. Work started end January, and now I think it’s in a good state to be trialed by the Go community.

LSM trees

Before we delve into Badger, let’s understand key-value store designs. They play an important role in data-intensive applications including databases. Key-value stores allow efficient updates, point lookups and range queries.

There are two popular types of implementations: Log-structured merge (LSM) tree based, and B+ tree based. The main advantage LSM trees have is that all the foreground writes happen in memory, and all background writes maintain sequential access patterns. Thus they achieve a very high write thoughput. On the other hand, small updates on B+-trees involve repeated random disk writes, and hence are unable to maintain high throughput write workload1.

To deliver high write performance, LSM-trees batch key-value pairs and write them sequentially. Then, to enable efficient lookups, LSM-trees continuously read, sort and write key-value pairs in the background. This is known as a compaction. LSM-trees do this over many levels, each level holding a factor more data than the previous, typically size of Li+1 = 10 x size of Li.

Within a single level, the key-values get written into files of fixed size, in a sorted order. Except level zero, all other levels have zero overlaps between keys stored in files at the same level.

Each level has a maximum capacity. As a level Li fills up, its data gets merged with data from lower level Li+1 and files in Li deleted to make space for more incoming data. As data flows from level zero to level one, two, and so on, the same data is re-written multiple times throughout its lifetime. Each key update causes many writes until data eventually settles. This constitutes write amplification. For a 7 level LSM tree, with 10x size increase factor, this can be 60; 10 for each transition from L1->L2, L2->L3, and so on, ignoring L0 due to special handling.

Conversely, to read a key from LSM tree, all the levels need to be checked. If present in multiple levels, the version of key at level closer to zero is picked (this version is more up to date). Thus, a single key lookup causes many reads over files, this constitutes read amplification. WiscKey paper estimates this to be 336 for a 1-KB key-value pair.

LSMs were designed around hard drives. In HDDs, random I/Os are over 100x slower than sequential ones. Thus, running compactions to continually sort keys and enable efficient lookups is an excellent trade-off.

However, SSDs are fundamentally different from HDDs. The difference between their sequential and random reads are not nearly as large as HDDs. In fact, top of the line SSDs like Samsung 960 Pro can provide 440K random read operations per second, with 4KB block size. Thus, an LSM-tree that performs a large number of sequential writes to reduce later random reads is wasting bandwidth needlessly.

Badger

Badger is a simple, efficient, and persistent key-value store. Inspired by the simplicity of LevelDB, it provides Get, Set, Delete, and Iterate functions. On top of it, it adds CompareAndSet and CompareAndDelete atomic operations (see GoDoc). It does not aim to be a database and hence does not provide transactions, versioning or snapshots. Those things can be easily built on top of Badger.

Badger separates keys from values. The keys are stored in LSM tree, while the values are stored in a write-ahead log called the value log. Keys tend to be smaller than values. Thus this set up produces much smaller LSM trees. When required, the values are directly read from the log stored on SSD, utilizing its vastly superior random read performance.

Guiding principles

These are the guiding principles that decide the design, what goes in and what doesn’t in Badger.

  • Write it natively in Go language.
  • Use the latest research to build the fastest key-value store.
  • Keep it simple, stupid.
  • SSD-centric design.

Key-Value separation

The major performance cost of LSM-trees is the compaction process. During compactions, multiple files are read into memory, sorted, and written back. Sorting is essential for efficient retrieval, for both key lookups and range iterations. With sorting, the key lookups would only require accessing at most one file per level (excluding level zero, where we’d need to check all the files). Iterations would result in sequential access to multiple files.

Each file is of fixed size, to enhance caching. Values tend to be larger than keys. When you store values along with the keys, the amount of data that needs to be compacted grows significantly.

In Badger, only a pointer to the value in the value log is stored alongside the key. Badger employs delta encoding for keys to reduce the effective size even further. Assuming 16 bytes per key and 16 bytes per value pointer, a single 64MB file can store two million key-value pairs.

Write Amplification

Thus, the LSM tree generated by Badger is much smaller than that of RocksDB. This smaller LSM-tree reduces the number of levels, and hence number of compactions required to achieve stability. Also, values are not moved along with keys, because they’re elsewhere in value log. Assuming 1KB value and 16 byte keys, the effective write amplification per level is (10*16 + 1024)/(16 + 1024) ~ 1.14, a much smaller fraction.

You can see the performance gains of this approach compared to RocksDB as the value size increases; where loading data to Badger takes factors less time (see Benchmarks below).

Read Amplification

As mentioned above, the size of LSM tree generated by Badger is much smaller. Each file at each level stores lots more keys than typical LSM trees. Thus, for the same amount of data, fewer levels get filled up. A typical key lookup requires reading all files in level zero, and one file per level from level one and onwards. With Badger, filling fewer levels means, fewer files need to be read to lookup a key. Once key (along with value pointer) is fetched, the value can be fetched by doing random read in value log stored on SSD.

Furthermore, during benchmarking, we found that Badger’s LSM tree is so small, it can easily fit in RAM. For 1KB values and 75 million 22 byte keys, the raw size of the entire dataset is 72 GB. Badger’s LSM tree size for this setup is a mere 1.7G, which can easily fit into RAM. This is what causes Badger’s random key lookup performance to be at least 3.5x faster, and Badger’s key-only iteration to be blazingly faster than RocksDB.

Crash resilience

LSM trees write all the updates in memory first in memtables. Once they fill up, memtables get swapped over to immutable memtables, which eventually get written out to files in level zero on disk.

In the case of a crash, all the recent updates still in memory tables would be lost. Key-value stores deal with this issue, by first writing all the updates in a write-ahead log. Badger has a write-ahead log, it’s called value log.

Just like a typical write-ahead log, before any update is applied to LSM tree, it gets written to value log first. In the case of a crash, Badger would iterate over the recent updates in value log, and apply them back to the LSM tree.

Instead of iterating over the entire value log, Badger puts a pointer to the latest value in each memtable. Effectively, the latest memtable which made its way to disk would have a value pointer, before which all the updates have already made their way to disk. Thus, we can replay from this pointer onwards, and reapply all the updates to LSM tree to get all our updates back.

Overall size

RocksDB applies block compression to reduce the size of LSM tree. Badger’s LSM tree is much smaller in comparison and can be stored in RAM entirely, so it doesn’t need to do any compression on the tree. However, the size of value log can grow quite quickly. Each update is a new entry in the value log, and therefore multiple updates for the same key take up space multiple times.

To deal with this, Badger does two things. It allows compressing values in value log. Instead of compressing multiple key-values together, we only compress each key-value individually. This provides the best possible random read performance. The client can set it so compression is only done if the key-value size is over an adjustable threshold, set by default to 1KB.

Secondly, Badger runs value garbage collection. This runs periodically and samples a 100MB size of a randomly selected value log file. It checks if at least a significant chunk of it should be discarded, due to newer updates in later logs. If so, the valid key-value pairs would be appended to the log, the older file discarded, and the value pointers updated in the LSM tree. The downside is, this adds more work for LSM tree; so shouldn’t be run when loading a huge data set. More work is required to only trigger this garbage collection to run during periods of little client activity.

Pricing

But, given the fact that SSDs are getting cheaper and cheaper, using extra space in SSD is almost nothing compared to having to store and serve a major chunk of LSM tree from memory. Consider this:

For 1KB values, 75 million 16 byte keys, RocksDB’s LSM tree is 50GB in size. Badger’s value log is 74GB (without value compression), and LSM tree is 1.7GB. Extrapolating it three times, we get 225 million keys, RocksDB size of 150GB and Badger size of 222GB value log, and 5.1GB LSM tree.

Using Amazon AWS US East (Ohio) datacenter:

  • To achieve a random read performance equivalent of Badger (at least 3.5x faster), RocksDB would need to be run on an r3.4xlarge instance, which provides 122 GB of RAM for $1.33 per hour; so most of its LSM tree can fit into memory.
  • Badger can be run on the cheapest storage optimized instance i3.large, which provides 475GB NVMe SSD (fio test: 100K IOPS for 4KB block size), with 15.25GB RAM for $0.156 per hour.
  • The cost of running Badger is thus, 8.5x cheaper than running RocksDB on EC2, on-demand.
  • Going 1-year term all upfront payment, this is $6182 for RocksDB v/s $870 for Badger, still 7.1x cheaper. That’s a whopping 86% saving.

Benchmarks

Setup

We rented a storage optimized i3.large instance from Amazon AWS, which provides 450GB NVMe SSD storage, 2 virtual cores along with 15.25GB RAM. This instance provides local SSD, which we tested via fio to sustain close to 100K random read IOPS for 4KB block sizes.

The data sets were chosen to generate sizes too big to fit entirely in RAM, in either RocksDB or Badger.

Value size Number of keys (each key = 22B) Raw data size 128B 250M 35GB 1024B 75M 73GB 16KB 5M 76GB

We then loaded data one by one, first in RocksDB then in Badger, never running the loaders concurrently. This gave us the data loading times and output sizes. For random Get and Iterate, we used Go benchmark tests and ran them for 3 minutes, going down to 1 minute for 16KB values.

All the code for benchmarking is available in this repo. All the commands ran and their measurements recorded are available in this log file. The charts and their data is viewable here.

Results

In the following benchmarks, we measured 4 things:

  • Data loading performance
  • Output size
  • Random key lookup performance (Get)
  • Sorted range iteration performance (Iterate)

All the 4 measurements are visualized in the following charts.

Data loading performance: Badger’s key-value separation design shows huge performance gains as value sizes increase. For value sizes of 1KB and 16KB, Badger achieves 4.5x and 11.7x more throughput than RocksDB. For smaller values, like 16 bytes not shown here, Badger can be 2-3x slower, due to slower compactions (see further work).

Store size: Badger generates much smaller LSM tree, but a larger value size log. The size of Badger’s LSM tree is proportional only to the number of keys, not values. Thus, Badger’s LSM tree decreases in size as we progress from 128B to 16KB. In all three scenarios, Badger produced an LSM tree which could fit entirely in RAM of the target server.

Random read latency: Badger’s Get latency is only 18% to 27% of RocksDB’s Get latency. In our opinion, this is the biggest win of this design. This happens because Badger’s entire LSM tree can fit into RAM, significantly decreasing the amount of time it takes to find the right tables, check their bloom filters, pick the right blocks and retrieve the key. Value retrieval is then a single SSD file.pread away.

In contrast, RocksDB can’t fit the entire tree in memory. Even assuming it can keep the table index and bloom filters in memory, it would need to fetch the entire blocks from disk, decompress them, then do key-value retrieval (Badger’s smaller LSM tree avoids the need for compression). This obviously takes longer, and given lack of data access locality, caching isn’t as effective.

Range iteration latency: Badger’s range iteration is significantly slower than RocksDB’s range iteration, when values are also retrieved from SSD. We didn’t expect this, and still don’t quite understand it. We expected some slowdown due to the need to do IOPS on SSD, while RocksDB does purely serial reads. But, given the 100K IOPS i3.large instance is capable of, we didn’t even come close to using that bandwidth, despite pre-fetching. This needs further work and investigation.

On the other end of the spectrum, Badger’s key-only iteration is blazingly faster than RocksDB or key-value iteration (latency is shown by the almost invisible red bar). This is quite useful in certain use cases we have at Dgraph, where we iterate over the keys, run filters and only retrieve values for a much smaller subset of keys.

Further work

Speed of range iteration

While Badger can do key-only iteration blazingly fast, things slow down when it also needs to do value lookups. Theoretically, this shouldn’t be the case. Amazon’s i3.large disk optimized instance can do 100,000 4KB block random reads per second. Based on this, we should be able to iterate 100K key-value pairs per second, in other terms six million key-value pairs per minute.

However, Badger’s current implementation doesn’t produce SSD random read requests even close to this limit, and the key-value iteration suffers as a result. There’s a lot of room for optimization in this space.

Speed of compactions

Badger is currently slower when it comes to running compactions compared to RocksDB. Due to this, for a dataset purely containing smaller values, it is slower to load data to Badger. This needs more optimization.

LSM tree compression

Again in a dataset purely containing smaller values, the size of LSM tree would be significantly larger than RocksDB because Badger doesn’t run compression on LSM tree. This should be easy to add on if needed, and would make a great first-time contributor project.

B+ tree approach

1 Recent improvements to SSDs might make B+-trees a viable option. Since WiscKey paper was written, SSDs have made huge gains in random write performance. A new interesting direction would be to combine the value log approach, and keep only keys and value pointers in the B+-tree. This would trade LSM tree read-sort-merge sequential write compactions with many random writes per key update and might achieve the same write throughput as LSM for a much simpler design.

Conclusion

We have built an efficient key-value store, which can compete in performance against top of the line key-value stores in market. It is currently rough around the edges, but provides a solid platform for any industrial application, be it data storage or building another database.

We will be replacing Dgraph’s dependency on RocksDB soon with Badger; making our builds easier, faster, making Dgraph cross-platform and paving the way for embeddable Dgraph. The biggest win of using Badger is a performant Go native key-value store. The nice side-effects are ~4 times faster Get and a potential 86% reduction in AWS bills, due to less reliance on RAM and more reliance on ever faster and cheaper SSDs.

So try out Badger in your project, and let us know your experience.

Top image: Juno spacecraft is the fastest moving human made object, traveling at a speed of 265,00 kmph relative to Earth.


This is a companion discussion topic for the original entry at https://open.dgraph.io/post/badger/

(Wind85) #2

Very interesting, I haven’t looked at the code, I just read the article, the results are impressive, congrats! I have a simple question, how does it perform with frequent delete/updates operations, does it degrade?


(Ahmad Baitalmal) #3

Very impressive and a good undertaking indeed. How distributable is Badger at the moment? Would would a DHT implementation like Riak fit with Badger’s architecture now?


(Allen Belletti) #4

Hi Manish, this is great-looking work. You clearly weren’t afraid to roll your own when needed!

I was thinking about your surprisingly low performance when interating through K-V pairs. Is it possible that you’re not generating enough concurrency to achieve 100K random read IOPS? For example it’s rare to see more than ~10K at queue depth 1.

If that’s the case, bet you could deliver some crazy performance on Optane, which loves QD 1.


(Manish R Jain) #5

No. LSM trees don’t really degrade with more updates to the same key. Even the underlying SSD shouldn’t degrade with LSM trees, because of sequential write patterns.

Comparatively, in B±trees, for a long-running job doing many random block rewrites, SSD GC will kick in later, and the random write performance and endurance of SSD would suffer.

Badger is embeddable. So, it won’t distribute on its own. You’d need to write something on top of Badger to build a distributed key-value store.

I’ve tried with a queue size of 100K, and using a Goroutine per key to retrieve value. But, given that the client would iterate over the values in order, more concurrency doesn’t really help (Goroutines are not FIFO, they are unordered). In fact, anything over like 10 Goroutines would give the same throughput.

Channel + 10 goroutines approach is something worth trying. So, is looking carefully into SSD read latency to determine its effect on generating enough read requests, given an ordered queue of fixed size.

Octane looks promising but currently expensive. Might have to wait a bit before it hits AWS.


(Wind85) #7

Thanks for answering so promptly, I am sorry about it, I didn’t explain what I meant in detail, I didn’t mean frequent updates on the same key, what I meant was since it’s gc’ed I supposed it uses something like thumb-stoning and doesn’t actually delete the key therefore the delete operation is an update not actual delete. So in this case does the gc kicking in degrade performance?


(Agniva De Sarker) #8

Hi,

Just a small question -

If we don’t set the syncWrite option, and if the machine loses power, then the data which is not synced to disk will be lost right ?

How much does the write performance degrade when you set the syncWrite option ? Just want to know how much is the tradeoff here.


(Hyc) #9

You’re wrong by a huge margin as to “RocksDB is the most efficient key-value store”. I could’ve saved you a lot of trouble.

http://lmdb.tech/bench/microbench/
http://lmdb.tech/bench/hyperdex/
http://lmdb.tech/bench/inmem/
http://lmdb.tech/bench/ondisk/

RocksDB wastes RAM and CPU cycles in local cache/buffer management, as well as suffering from the write amplification and compaction delays you discussed. But your approach still suffers from compaction delays. LMDB has O(logN) write amplifcation and no write delays, as well as no wasted RAM or CPU cycles. Its performance is completely deterministic. It is demonstrably the most efficient key-value store in existence.

The LSM approach also suffers from extravagant use of file descriptors; in the RocksDB case tens of thousands of file descriptors open simultaneously for larger data sets. This makes it completely unsuitable for use in a lot of network-facing applications, where the system is already competing for resources for descriptors for sockets.

Docs, presentations, more benchmarks… https://symas.com/lightning-memory-mapped-database/


(Ahmad Baitalmal) #10

I see, this is more like ETS in Erlang which is something that’s been missing in Go. Awesome work, thx.


(Manish R Jain) #11

There’s no GC going on for tombstoned keys. They would remain within the store.

That’s right.

It’s quite significant. But, you can amortize the cost by doing batched writes. That’s what we do in Dgraph. We easily push 10,000 key-values in one write. That way, only one synced file write is required to push all those updates to value log.

I’ll read up. Any pointers to detailed design doc?


(Wind85) #12

As always thanks for answering, therefore if I put in 10.000 k/v with value size 1Mb and then I delete half of them, the space occupied by the deleted values will not be reclaimed; If I have understood correctly is this a deliberate design decision?


(Hyc) #13

Scroll down to Project Presentations https://symas.com/lightning-memory-mapped-database/ and get the 2011 LDAPCon paper and slides. Or any of the later ones like the SDC presentation, or the talk at CMU.


(Gero) #14

Hi, are you aware or know any project that uses Badger in production? Thanks!


(Manish R Jain) #15

It just got released. So, no. But, Dgraph v0.8 release and onwards would be based on Badger, not RocksDB.


(Manish R Jain) #16

The values would be stored in the value log. The value garbage collector would realize that a large chunk of the values are no longer valid, so it would rewrite the value log discarding invalid entries. So, you’ll get your space back.

The LSM tree would only store the keys and the value pointers. When you delete these keys, they’d still be within the LSM tree, but would have a tombstone marker. Having said that, keys don’t take much space, they tend to be smaller, and we do delta encoding.


(Wind85) #17

Thanks for answering, yes I know about the value log, I have read the article :slight_smile: thanks for specifying it anyway. So the value garbage collection invalidates the keys kept in the LSM tree, because values are going to be written in a new value log, therefore the tree needs to be updated as well? Am I right or I am missing something?


(Manish R Jain) #18

Yes. That’s right. The value pointers of valid keys would be updated in the lsm tree.


(Wind85) #19

Have you ever considered adopting the following strategy, for example keeping a separate file with a pointer to the deleted value and the value size, let’s call it a “recycle bin”, that would introduce some minimal fragmentation but it would avoid another LSM tree rebuilding or the creation of a new value log altogether, basically on deletion no operation would be performed on the actual value just thumb-stoning the key in the LSM and add a pointer and value size of the deleted value on the recycle-bin? Just wanted to know your thoughts on this kind of recycle strategy.


(Wind85) #20

Anyway that’s sort of what Neo4j uses a reclamation strategy, it’s also very interesting how they use native graph storage instead of a key value store, it leads to greater performance. Have you have beckmark dgraph against Neo4j?


(Manish R Jain) #21

Having another separate file is another layer of complexity that you need to build, ensure persistence for, crash resilience etc. LSM tree + value log is sufficiently qualified to do this job, and in fact, do it pretty close to what you’re suggesting there. We store the deletion record in value log, and then set a tombstone in LSM tree.

Yes, we did. And Dgraph is way faster. https://open.dgraph.io/post/benchmark-neo4j/