Some time ago, I started reading a bit more about RocksDB. Let me talk about some topics that I find interesting.
Prefix databases and using a different memtable
By default, RocksDB keeps everything sorted using a skip list. However, some applications don’t really need this sorting, and in general, we expect an unordered map to be faster than an ordered map. RocksDB supports not-so-sorted databases. The idea is that you will specify a prefix_extractor which returns a prefix given any key. With these key prefixes, you can perform some “interesting optimizations”. Quote:
- Define prefix bloom filters, which can reduce read amplification of prefix range queries (e.g., give me all keys that start with prefix XXX).
- Use hash-map-based memtables to avoid binary search costs in memtables.
- Add hash index to table files to avoid binary search costs in table files.
Earlier on, I was trying to use RocksDB to store triplets instead of posting lists. Using hash-map based memtables offers a vast speedup when I try to iterate over keys. I guess it reduces a lot of climbing up and down of a larger skip list.
There are three different memtables: SkipList (default), HashSkipList and HashLinkedList. The second one seems to make most sense for prefix databases, and seems to perform the best for me. It’s basically a map of key prefix to skip lists.
For our use case, we could consider prefix being “predicate|”.
I believe that when RocksDB is conceived, it is mainly as a flash or disk store, not a memory store. What difference does it make you may ask. Well, since the data is on a much slower medium than memory, it uses an in-memory cache, aka block cache. This is not needed when your main store is already on memory. That’s the basic idea of the PlainTableFormat. It does away with the cache and also the block-based format.
PlainTable supports prefix-based databases only. It also does no compression, and does not support delta encoding, which means you want shorter keys, e.g., fingerprint predicates.
I feel that if we want to sell Dgraph as the graph DB with the lowest possible latency (great for some applications say finance), then we might want to support the PlainTable format and do some special tuning for in-memory stores. Otherwise, we can just ignore it for now.
Read, space/size, write amplifications
In the tuning guide, there is a good example. Say L0, L1 each has 500M. The level multiplier is 10. Size of database is 500G. Then total space needed is 500M + 500M + 5G + 50G + 500G = 556G where 1K=1000 bytes for simplicity. Space amplification is 556G/500G ~ 1.1. We use 10% more space. Space amplification is easy to understand.
Write amplification is about how much you write compared to how much you actually want to store. L2 is 10X bigger than L1. When you compact one byte from L1 to L2, you need to merge with 10 bytes from L2, so the write amplification here is about 10. So for one byte to go from L0 to L1 to L2 to the last level eventually, we need to write out 1+2+10+10+10 = 33 times, which is pretty large!
You might ask: Why not just reduce the level multipler to reduce both space and write amplifications? The number of levels will increase and worsen read amplification. During a read, you might need to check every level.
Universal compaction (whatever it means…) reduces write amplification but it doesn’t seem as polished and can cause memory spike (as much as 2X memory). But there seems to be some discussion on how to improve on this.
Using bigger bloom filters worsens space and write amplification but improves read amplification, obviously.
Using larger block sizes means our index can be smaller, so it improves space amplification but worsens read amplification as we might load more unused data (in the same block) into our in-memory cache.
There is this parameter
max_size_amplification_percent which limits space amplification. Increasing this limit causes more compaction (compact everything to the last level will reduce space amplification) but will worsen write amplification (since we need to do more compaction).