Dgraph’s next release v21.12, codenamed Zion, has been a very special release for many reasons — some of which might not be obvious right now.
Part of what makes this release special is introduction of Roaring Bitmaps. It has been a TODO for a long time to improve Dgraph’s posting list structures. For context, Dgraph assigns a unique 64-bit unsigned integer to every node in the graph. These nodes are kept grouped in a sorted list, known as posting list — a term typically used by search engines.
Since inception, Dgraph has been using group varint encoding for storage and just plain Go lists of integers via protocol buffers for transportation. This conversion from a compressed structure to an uncompressed structure isn’t efficient. Moreover, uncompressed plain integer lists consume a lot of space (8 bytes per integer) and are expensive to keep in RAM or ship over network.
divided up into fixed sized blocks of say 128 integers. The first integer is stored whole, and the rest are stored as diffs. Because the diffs are smaller integers, using a var encoding reduces their space consumption.
Group Varint is a popular encoding mechanism, still in production at various big companies. But, Roaring Bitmaps has been making waves. It is being used by Google Procella, Apache Lucene including Elastic Search, Netflix Atlas and many more. We at Dgraph have also been impressed by its simplicity and efficiency. As mentioned in Dgraph Paper, it has been on the cards to try to switch to roaring bitmaps, but posting list structures are integrated so deep into Dgraph’s storage, communication and query processing that previous attempts hit various roadblocks despite months of work.
This time around, we were able to push through the various challenges and achieve our goal of integrating Roaring Bitmaps (henceforth, RB). At the same time, we realized that the official RB Go library had a few shortcomings related to memory usage, which negatively affected our integration significantly. So, we went down a path of rewriting the Go library to be more memory efficient.
In this blog post, I explain our motivation for rewrite, the techniques we used and the impressive results we were able to achieve with this new library called sroar – Serialized Roaring Bitmaps.
Requirements and Motivation
The main requirements for roaring bitmaps was that the on-disk representation and the in-memory representation should be the same. This way, a posting list can be operated upon or shipped over network by just reading from Badger and doing a memory copy. Moreoever, all the set, remove, clone, union and intersection operations then wouldn’t require expensive encoding / decoding step.
That wasn’t the case with the Go implementation of roaring bitmaps, RoaringBitmaps/Roaring. The on-disk format and the in-memory format were different. The on-disk format couldn’t do any operations and had to be converted to the in-memory format, which required an expensive decoding step.
Moreover, the in-memory format allocated memory on heap for every container in the bitmap, which led to a much higher memory consumption and lots of allocations, hence increasing the work for Go garbage collector.
As detailed in our blog post manual memory management in Go, we replaced Go memory with jemalloc in all critical paths to remove the consistent OOM errors we were getting, because Go GC was unable to keep up with the pace of memory allocations and deallocations. We don’t expect this to be the case for typical Go systems, but running a high performance database system, this became a very big issue for us. So, we replaced all smaller allocations on heap with bigger byte slice allocations via jemalloc.
This learning when applied to RBs was to avoid allocating memory for containers separately. Instead, the idea was to lay the entire RB out onto one bigger byte slice, hence converting memory allocations from O(N) to O(1), where N = number of containers in the RB.
We looked at how we could modify the available Go implementation to achieve this, but didn’t see a good simple way to achieve that. This layout would impact everything from how containers are created, accessed, grown, shrunk and removed. So, we decided it was best to write an implementation of roaring bitmaps from scratch. The goal for this rewrite was to ensure that the in-memory representation and the on-disk representation is one and the same, with all data lying on a single byte slice, avoid any encoding / decoding steps.
Per GitHub, this library is being used by many other Go projects like InfluxDB, SourceGraph, Bleve and others. Seeing its importance not just for us at Dgraph, but across the wider Go community, we saw there was a major room for improvement and felt the effort was worth investing in.
Go Roaring Bitmaps lead to memory fragmentation
Let’s look at how Roaring Bitmaps work. They use two kinds of containers — arrays and bitmaps. To give a quick primer about roaring bitmap containers, each container stores all the integers having a common 48-bit base key (mask of 0xFFFFFFFFFFFF0000), with the lower 16-bits being represented as uint16s (mask of 0xFFFF).
An array container just stores lists of uint16s. Each integer takes 2-bytes (16 bits). Array containers are of variable length and can start as small as holding just one uint16. When an array container grows to 4096 uint16s, its size hits 8 kB. At that point, it makes sense to convert it to a bitmap container instead. A bitmap container uses 1 bit each to represent each uint16 integer. To store all 2^16 integers thus takes a maximum space of 2^16 bits / 8 bits per byte = 8 kB.
In the original implementation of roaring bitmaps, each container allocates its own memory on the heap. Thus, each RB does O(N) allocations on heap, where N = number of containers. When RB is growing, the array containers are growing, hence leading to more memory re-allocations and more work for the Go GC. Not only that – cloning, encoding or decoding these to move from in-memory to on-wire or on-disk is an expensive step consuming CPU cycles.
Introducing Sroar — Serialized Roaring Bitmaps
So we built sroar — Serialized Roaring Bitmaps. Sroar operates on 64-bit integers and uses a single byte slice to store all the keys and containers. This byte slice can then be stored on disk, operated on directly in memory, or transmitted over wire. There’s no encoding / decoding step required. For all practical purposes, sroar can be treated just like a byte slice.
Sroar consolidates memory usage to one allocation
With sroar, we changed how RB is generated and laid out. Sroar allocates one long byte array — actually using a uint16 slice in Go. It uses the first portion of the array to store the base keys and the offset for their corresponding containers. The base keys are all kept sorted for faster search, but the containers can be located anywhere in the array.
Each container starts with a header which stores its size, type of the container (array or bitmap) and its cardinality. Size and type each take 1 uint16. The maximum cardinality of a container can be 2^16, so it actually needs 2 uint16s. Thus the header is 4 uint16s (or 8 bytes) long.
The logic for storing uint16s in these containers is typical to the structure. The array container keeps the uint16s sorted, while the bitmap container uses bits to mark whether the uint16 is present or absent.
What’s more interesting is how these containers grow and shrink. To make the
size changes happen, we added two functions, scootRight
and scootLeft
. These
functions take an offset
and a bySize
and can move the contents of the data
slice left or right accordingly. They also clear out the internal space created
at the offset by using runtime.memclrNoHeapPointers
. Note that the underlying
data slice might have to be grown as well to make space. These moves can get
expensive, Sroar keeps track of how many bytes of movements have happened as
efficiency metrics, which we used to improve our implementation.
Fewer scoots means faster Sroar
Sroar has been optimized for the various intersection and union operations
expected from RBs. A lot of this optimization hinges on avoiding scoots and
getting the keys and containers laid out upfront. Let’s look at FastOr
, which is
a union of arbitrary number of Sroars. To make this operation efficient, all the
base keys from all source Sroars are first accumulated along with their
cardinalities. Cardinalities per container are then just summed up assuming the
worst case with no integer overlap.
If the cardinality is below 4096 integers, an array container is created upfront. For anything above, a bitmap container is created upfront. This the output SRB gets its memory allocated upfront – avoiding the need to scoot memory right later. And then, this created SRB is ORed with each of the source SRBs in a loop. To optimize this further, the cardinalities are not calculated during the union, instead they are calculated once all the unions are done. This improves performance significantly.
Intersections tend to be much cheaper than unions. For intersections, Sroar iterates over the Bitmaps container-by-container, intersects them, and if the resulting container has any elements remaining, adds the container to the destination Bitmap. This is a relatively straightforward approach, and has worked really well in our tests so far.
Sroar is 7x faster using 15x less memory
To benchmark Sroar, we ran benchmarks
- Using the real data set as described in RoaringBitmaps.
- On the 64-bit version of roaring bitmaps (roaring64).
- On FastOr
, which is the more expensive operation than And
or
equivalent (BenchmarkRealDataFastOr
).
- On AMD Ryzen Threadripper 2950X 16-Core Processor.
- Using Go benchmarks serially.
Based on the benchmarks, Sroar consumes 15x less memory and does 25x fewer memory allocations. And it achieves this, while also being 7x faster! See results here.
These results were impressive on their own. But, we saw even better performance results when integrated with Dgraph.
We ran Dgraph v21.12 (Zion) which has sroar (with posting list cache turned off) on the heaviest Dgraph Cloud user workload. We sampled the queries which used to take >100ms with v21.03. With 32 concurrent requests against v21.03, we saw massive improvements.
Over 40% of the queries were at least 20x faster, while over 50% of the queries were at least 10x faster. This showed that Dgraph Zion release performs exceptionally well under heavier workloads.
Another benchmark for lighter workloads with only a single pending request saw that over 50% of the queries were at least 2x faster.
Moreover, Dgraph with Sroar uses 10x less memory. In heavy load scenarios, where Dgraph was previously using over 50 GBs of memory, Dgraph with Sroar used under 5 GB.
So, not only is Zion release 10x faster, it’s also 10x more memory efficient than Rocket. These are incredibly delightful results, all thanks to Sroar!
Sroar is open source. Try it out
Serialized Roaring Bitmaps is a ground up rewrite of Go Roaring Bitmaps, which results in a factor faster operations, lower memory usage and fewer memory allocations than existing Go implementation. Furthermore, Sroar shows even better performance and memory benefits when used within Dgraph. We think any user of Sroar should see improvements compared to the official library.
Sroar is available under Apache v2.0 license. Do try it out in your project and let us know your experiences! Sroar is being used in the upcoming Dgraph release v21.12 codenamed Zion.
Special thanks to Ahsan Barkati for his amazing work on Sroar.
This is a companion discussion topic for the original entry at https://dgraph.io/blog/post/serialized-roaring-bitmaps-golang/