Introducing Badger: A fast key-value store written natively in Go - Dgraph Blog

(Wind85) #22

Awesome software! Thanks for answering…

(Sashas Lb) #23

Awesone software, small question though regards write amplification.
In your blog and also in the paper write amplification is only been calculated on LSM, and since LSM is quite small write amplification is also small. However amplification on value log is not considered at all, for example is half of value log is still valid when it is re-written write amplification on value is 1 (when value is originally written) + 1/2 (half of values are still valid when value log is re-written) + 1/4 (same second time) + 1/8 + 1/16 ~= 2. Can you please provide some insight on this, or correct me if i am wrong.

(Aaron Hopkins) #24

I suspect that by copying values around on disk into key order, RocksDB’s range iteration benefits from Linux’s buffer cache and readahead mechanisms. When reading from disk, Linux always asks for the full 4k block and stores the result in the buffer cache. And when files appear to be accessed roughly sequentially, Linux will read more blocks than you asked for. The result for values accessed in roughly sequential order is that the next accesses can be answered directly out of the cache without waiting on storage; reading from RAM is still substantially lower latency than reading from fast SSD.

Badger will be at a disadvantage for values much smaller than 4k, since Linux will be fetching the 4k block anyway, handing Badger the chunk of that it asked for, and wasting the rest unless Badger gets around to asking for it soon after. Try changing Badger’s Iterate to return in log order, rather than key order and benchmark again. At least for iterating over the whole dataset, it’ll substantially increase the odds of a buffer cache hit.

For ranges substantially smaller than the whole dataset, this likely won’t help, though. The only way to get cache locality there is to try to arrange for the values to be in key order more often; perhaps log compaction is a good time to reorder the values by key? Even if chunks of the log were in key order, it would likely help.

(Manish R Jain) #25

That’s true, but not the whole story. Assuming 16B keys, and 1 KB value, with 10x amplification factor in LSM tree, the calculation is (10 * 16 + 1024) / (16 + 1024) ~ 1.14. This includes the write of the value to value log.

The thing about value log rewrite is that you can control it. It’s not inherent part of the design. Even assuming that you’re rewriting once the validity reaches 50%, which gives you an amplification of 2 for values, that’s still only ~ 2.12 write amplification factor.

Now compare this with a typical LSM tree. To go one level down, that’s 10x write amplification. Both keys and values need to be rewritten, period. This is still close to 5x less writes than typical LSM tree.

Sure that could be done, and we might actually do that. But, in general cases, you need the keys to be in sorted order, and not random order.

However, irrespective of the “wastage” that occurs during a read lower than 4KB, an SSD can easily do 100K reads per second for a 4KB block. At that rate, the iteration of Badger is as good as the linear iteration done by typical LSM tree like RocksDB. So, that’s not a disadvantage, it’s the design of Badger. We’re trading more random reads for much lower read/write amplification, and smaller LSM tree sizes.

During log compaction / rewrite, we do order the logs by keys.

(Aaron Hopkins) #26

In the 128 byte value case, RocksDB is effectively fetching something like 32 values per (hundreds of microseconds) round-trip to the SSD and then using the much lower latency (tens of microseconds) buffer cache reads to get the remainder of the values out of the cache. This gives it a much lower average read latency during Iterate. And single-threaded disk throughput (which I think is what both RocksDB and Badger are doing with Iterate) is just the inverse of the average read latency.

You’re right that SSDs can do > 100K requests/sec, but only when driven by tens of concurrent requests. If you want to take advantage of the full throughput available an SSD during an Iterate, you’ll need to concurrently prefetch, perhaps with a channel asking a worker pool to fetch the next n values which get passed back to Iterate to walk through in-order.

(Manish R Jain) #27

That’s being done in Badger. When you iterate, it prefetches values via goroutines.

(nichoc) #28

Hi, great-looking work firstly. I’m interested in Dgraph and Badger, and I have tried some experiments on dgraph v0.8.0 recently. There is a problem, after I loaded about 200 million rdfs into dgraph (raw data about 18 GB), size of dir p grows to 2.7 TB (vlog files occupied most of the space), looks extraordinarily large
comparing with v0.7.7 , is that normal or I missed something?

(Manish R Jain) #29

Is this 18GB uncompressed raw data, or compressed?

We have tried loading up 500M RDFs (including indexes and reverses), and that consumed 800GB on disk. So, if your 200M (original) RDFs are consuming 2.7TB, that must mean the actual mutations (with indices) must be a big factor more. You can see how many mutations are you doing with /debug/vars. You can sum up predicateStats to get that.

vlog can grow fast, but we also have a garbage collector which would rewrite the value log files, once half of their entries are discarded. So, in a long running Dgraph instance, the value log will reduce significantly.

There’s more work going on in v0.8.1, where we compress our posting lists – that should also alleviate this size growth rate.

(nichoc) #30

Thanks for your answering. 18GB dataset is uncompressed. I checked again that the number of RDFs processed is 300M, I cannot find the /debug/vars (just manually download dgraph), the endline of my dgraph log is below:
Writing snapshot to WAL: {Data:[9 1 0 0 0 0 0 0 0 16 1 26 15 108 111 99 97 108 104 111 115 116 58 49 50 51 52 54] Metadata:{ConfState:{Nodes:[1] XXX_unrecognized:} Index:323478 Term:3 XXX_unrecognized:} XXX_unrecognized:}
does that index metric explain something?
my schema listed below:
field_1: string @index(exact, term, fulltext) .
field_2: int @index(int) .
field_3: uid @count .
field_4: uid @count .

(Manish R Jain) #31

/debug/vars is on http port. If you ran the default, it would be on localhost:8080/debug/vars. You can pick these up via Prometheus and render them in Grafana.

So, your field_1 would add a LOT of mutations. You’re running with both term and full-text, which are both expensive, generating multiple mutations (as many as the number of terms) per text mutation. And if you have long strings, then exact would be expensive as well (see if hash index is more applicable to you).

If you need more help, feel free to create a Github issue and we can help you debug this further.

(nichoc) #32

I see, many thanks! I’ll try it.

(Deepak Jois) #33

Hi Howard

I am Deepak from Dgraph, and I am currently writing some code to benchmark lmdb (via lmdb-go bindings) and Badger. I have written small functions to benchmark some simple operations.

Do you think you could take a look at snippets of Go code below and let me know if I am doing something egregiously wrong when I am using the API exposed by LMDB? Thank you in advance!