Introducing Ristretto: A High-Performance Go Cache - Dgraph Blog


(Manish R Jain) #1

With over six months of research and development, we’re proud to announce the initial release of Ristretto: A High Performance, Concurrent, Memory-Bound Go cache. It is contention-proof, scales well and provides consistently high hit-ratios.

Preface

It all started with needing a memory-bound, concurrent Go cache in Dgraph. We looked around for a solution, but we couldn’t find a great one. We then tried using a sharded map, with shard eviction to release memory, which caused us memory issues. We then repurposed Groupcache’s LRU, using mutex locks for thread safety. After having it around for a year, we noticed that the cache suffered from severe contention. A commit to remove that cache caused our query latency to dramatically improve by 5-10x. In essence, our cache was slowing us down!

We concluded that the concurrent cache story in Go is broken and must be fixed. In March, we wrote about the State of Caching in Go, mentioning the problem of databases and systems requiring a smart memory-bound cache which can scale to the multi-threaded environment Go programs find themselves in. In particular, we set these as the requirements for the cache:

  1. Concurrent
  2. High cache-hit ratio
  3. Memory-bounded (limit to configurable max memory usage)
  4. Scale well as the number of cores and goroutines increases
  5. Scale well under non-random key access distribution (e.g. Zipf).

After publishing the blog post, we built a team to address the challenges mentioned therein and create a Go cache library worthy of being compared to non-Go cache implementations. In particular, Caffeine which is a high performance, near-optimal caching library based on Java 8. It is being used by many Java-based databases, like Cassandra, HBase, and Neo4j. There’s an article about the design of Caffeine here.

Ristretto: Better Half of Espresso

We have since read the literature, extensively tested implementations and discussed every variable there is to consider in writing a cache library. And today, we are proud to announce that it is ready for the wider Go community to use and experiment with.

Before we begin explaining the design of Ristretto, here’s a code snippet which shows how to use it:

func main() {
	cache, err := ristretto.NewCache(&ristretto.Config{
		NumCounters: 1e7,     // Num keys to track frequency of (10M).
		MaxCost:     1 << 30, // Maximum cost of cache (1GB).
		BufferItems: 64,      // Number of keys per Get buffer.
	})
	if err != nil {
		panic(err)
	}
	cache.Set("key", "value", 1) // set a value
	// wait for value to pass through buffers
	time.Sleep(10 * time.Millisecond)
	value, found := cache.Get("key")
	if !found {
		panic("missing value")
	}
	fmt.Println(value)
	cache.Del("key")
}
Guiding Principles

Ristretto is built on three guiding principles:

  1. Fast Accesses
  2. High Concurrency and Contention Resistance
  3. Memory Bounding.

In this blog post, we’ll discuss these three principles and how we achieved them in Ristretto.

Fast Access

As much as we love Go and its opinionated stance on features, some of Go design decisions prevented us from squeezing out all the performance we wanted to. The most notable one is Go’s concurrency model. Due to the focus on CSP, most other forms of atomic operations have been neglected. This makes it hard to implement lock-free structures that would be useful in a cache library. For example, Go does not provide thread-local storage.

At its core, a cache is a hash map with rules about what goes in and what goes out. If the hash map doesn’t perform well, then the whole cache will suffer. As opposed to Java, Go does not have a lockless concurrent hashmap. Instead, thread safety in Go is achieved via explicitly acquiring mutex locks.

We experimented with multiple implementations (using the store interface within Ristretto) and found sync.Map performs well for read-heavy workloads but deteriorates for write workloads. Considering there’s no thread-local storage, we found the best overall performance with sharded mutex-wrapped Go maps. In particular, we chose to use 256 shards to ensure that this would perform well even with a 64-core server.

With a shard based approach, we also needed to find a quick way to calculate which shard a key should go in. This requirement and the concern about long keys consuming too much memory led us to using uint64 for keys, instead of storing the entire key. The rationale was that we’ll need the hash of the key in multiple places and doing it once at entry allowed us to reuse that hash, avoiding any more computation.

To generate a fast hash, we borrowed runtime.memhash from Go Runtime. This function uses assembly code to quickly generate a hash. Note that the hash has a randomizer that is initialized whenever the process starts, which means the same key would not generate the same hash on the next process run. But, that’s alright for a non-persistent cache. In our experiments, we found that it can hash 64-byte keys in under 10ns.

BenchmarkMemHash-32 200000000	 8.88 ns/op
BenchmarkFarm-32    100000000	 17.9 ns/op
BenchmarkSip-32      30000000	 41.1 ns/op
BenchmarkFnv-32      20000000	 70.6 ns/op

We then used this hash as not only the stored key but also to figure out the shard the key should go into. This does introduce a chance of key collision, that’s something we plan to deal with later.

Concurrency and Contention Resistance

Achieving high hit ratios requires managing metadata about what’s present in the cache and what should be present in the cache. This becomes very hard when balancing the performance and scalability of the cache across goroutines. Luckily, there’s a paper called BP-Wrapper written about a system framework making any replacement algorithms almost lock contention-free. The paper describes two ways to mitigate contention: prefetching and batching. We only use batching.

Batching works pretty much how you’d think. Rather than acquiring a mutex lock for every metadata mutation, we wait for a ring buffer to fill up before we acquire a mutex and process the mutations. As described in the paper, this lowers contention considerably with little overhead.

We apply this method for all Gets and Sets to the cache.

Gets

All Gets to the cache are, of course, immediately serviced. The hard part is to capture the Get, so we can keep track of the key access. In an LRU cache, typically a key would be placed at the head of a linked list. In our LFU based cache, we need to increment an item’s hit counter. Both operations require thread-safe access to a cache global struct. BP-Wrapper suggests using batching to process the hit counter increments, but the question is how do we implement this batching process, without acquiring yet another lock.

This might sound like a perfect use case of Go channels, and it is. Unfortunately, the throughput performance of channels prevented us from using them. Instead, we devised a nifty way to use sync.Pool to implement striped, lossy ring buffers that have great performance with little loss of data.

Any item stored in the Pool may be removed automatically at any time without notification. That introduces one level of lossy behavior. Each item in Pool is effectively a batch of keys. When the batch fills up, it gets pushed to a channel. The channel size is deliberately kept small to avoid consuming too many CPU cycles to process it. If the channel is full, the batch is dropped. This introduces a secondary level of lossy behavior. A goroutine picks up this batch from the internal channel and processes the keys, updating their hit counter.

AddToLossyBuffer(key):
  stripe := b.pool.Get().(*ringStripe)
  stripe.Push(key)
  b.pool.Put(stripe)
Once buffer fills up, push to channel:
  select {
  case p.itemsCh <- keys:
      p.stats.Add(keepGets, keys[0], uint64(len(keys)))
      return true
  default:
      p.stats.Add(dropGets, keys[0], uint64(len(keys)))
      return false
  }
p.itemCh processing:
  func (p *tinyLFU) Push(keys []uint64) {
    for _, key := range keys {
      p.Increment(key)
    }
  }

The performance benefits of using a sync.Pool over anything else (slices, striped mutexes, etc.) are mostly due to the internal usage of thread-local storage, something not available as a public API to Go users.

Sets

The requirements for Set buffer is slightly different from Get. In Gets, we buffer up the keys, only processing them once the buffer fills up. In Sets, we want to process the keys as soon as possible. So, we use a channel to capture the Sets, dropping them on the floor if the channel is full to avoid contention. A couple of background goroutines pick sets from the channel and process the Set.

This approach, as with Gets, is designed to optimize for contention resistance. But, comes with a few caveats, described below.

select {
case c.setBuf <- &item{key: hash, val: val, cost: cost}:
    return true
default:
    // drop the set and avoid blocking
    c.stats.Add(dropSets, hash, 1)
    return false
}
Caveats

Sets in Ristretto are queued into a buffer, control is returned back to the caller, and the buffer is then applied to the cache. This has two side-effects:

  1. There is no guarantee that a set would be applied. It could be dropped immediately to avoid contention or could be rejected later by the policy.
  2. Even if a Set gets applied, it might take a few milliseconds after the call has returned to the user. In database terms, it is an eventual consistency model.

If however, a key is already present in the cache, Set would update the key immediately. This is to avoid a cached key holding a stale value.

Contention Resistance

Ristretto is optimized for contention resistance. This performs really well under heavy concurrent load, as we’ll see with throughput benchmarks below. However, it would lose some metadata in exchange for better throughput performance.

Interestingly, that information loss doesn’t hurt our hit ratio performance because of the nature of key access distributions. If we do lose metadata, it is generally lost uniformly while the key access distribution remains non-uniform. Therefore, we still achieve high hit ratios and the hit ratio degradation is small as shown by the following graph.

Memory Bounding

Key Cost

An infinitely large cache is practically impossible. A cache must be bounded in size. Many cache libraries would consider cache size to be the number of elements. We found that approach naive. Surely it works in a workload where values are of identical size. Most workloads, however, have variable-sized values. One value could cost a few bytes, another a few kilobytes and yet another, a few megabytes. Treating them as having the same memory cost isn’t realistic.

In Ristretto, we attach a cost to every key-value. Users can specify what that cost is when calling Set. We count this cost against the MaxCost of the cache. When the cache is operating at capacity, a heavy item could displace many lightweight items. This mechanism is nice in that it works well for all different workloads, including the naive approach where each key-value costs 1.

Admission Policy via TinyLFU

“What should we let into the cache?”

is answered by the admission policy. The goal, obviously, is to let in new items if they are more “valuable” than the current items. However, this becomes a challenge if you consider the overhead (latency and memory) required to track relevant item information pertaining to the “value” question.

Despite being a well-documented strategy for increasing hit ratios, most Go cache libraries have no admission policy at all. In fact, many LRU eviction implementations assume the latest key as the most valuable.

Moreover, most of the Go cache libraries use pure LRU or an approximation of LRU as their eviction policy. The quality of LRU approximation notwithstanding, some workloads are just better suited to LFU eviction policies. We’ve found this to be the case in our benchmarks using various traces.

For our admission policy, we looked at a new and fascinating paper called TinyLFU: A Highly Efficient Cache Admission Policy. At a very high level, TinyLFU provides three methods:

  • Increment(key uint64)
  • Estimate(key uint64) int (referred as ɛ)
  • Reset

The paper explains it best, but TinyLFU is an eviction-agnostic admission policy designed to improve hit ratios with very little memory overhead. The main idea is to only let in a new item if its estimate is higher than that of the item being evicted. We implemented TinyLFU in Ristretto using a Count-Min Sketch. It uses 4-bit counters to approximate the frequency of access for the item (ɛ). This small cost per key allows us to keep track of a much larger sample of the global keyspace, than would be possible using a normal key to frequency map.

TinyLFU also maintains the recency of key access by a Reset function. After N key increments, the counters get halved. So, a key that has not been seen for a while would have its counter get reset to zero; paving the way for more recently seen keys.

Eviction Policy via Sampled LFU

When the cache reaches capacity, every incoming key should displace one or more keys present in the cache. Not only that, the ɛ of incoming key should be higher than the ɛ of key being evicted. To find a key with low ɛ, we used the natural randomness provided by Go map iteration to pick a sample of keys and loop over them to find a key with the lowest ɛ.

We then compare the ɛ of this key against the incoming key. If the incoming key has a higher ɛ, then this key gets evicted (eviction policy). Otherwise, the incoming key is rejected (admission policy). This mechanism is repeated until the incoming key’s cost can be fit into the cache. Thus, a single incoming key may displace more than one key. Note that the cost of the incoming key does not play a factor in choosing the eviction keys.

With this approach, the hit ratios are within 1% of the exact LFU policies for a variety of workloads. This means we get the benefits of admission policy, conservative memory usage, and lower contention in the same little package.

// Snippet from the Admission and Eviction Algorithm
incHits := p.admit.Estimate(key)
for ; room < 0; room = p.evict.roomLeft(cost) {
    sample = p.evict.fillSample(sample)
    minKey, minHits, minId := uint64(0), int64(math.MaxInt64), 0
    for i, pair := range sample {
        if hits := p.admit.Estimate(pair.key); hits < minHits {
            minKey, minHits, minId = pair.key, hits, i
        }
    }
    if incHits < minHits {
        p.stats.Add(rejectSets, key, 1)
        return victims, false
    }
    p.evict.del(minKey)
    sample[minId] = sample[len(sample)-1]
    sample = sample[:len(sample)-1]
    victims = append(victims, minKey)
}
DoorKeeper

Before we place a new key in TinyLFU, Ristretto uses a bloom filter to first check if the key has been seen before. Only if the key is already present in the bloom filter, is it inserted into the TinyLFU. This is to avoid polluting TinyLFU with a long tail of keys that are not seen more than once.

When calculating ɛ of a key, if the item is included in the bloom filter, its frequency is estimated to be the Estimate from TinyLFU plus one. During a Reset of TinyLFU, the bloom filter is also cleared out.

Metrics

While optional, it is important to understand how a cache is behaving. We wanted to ensure that tracking metrics related to cache is not only possible, the overhead of doing so is low enough to be turned on and kept on.

Beyond hits and misses, Ristretto tracks metrics like keys and their cost being added, updated and evicted, sets being dropped or rejected, and gets being dropped or kept. All these numbers help understand the cache behavior on various workloads and pave way for further optimizations.

We initially used atomic counters for these. However, the overhead was significant. We narrowed the cause down to False Sharing. Consider a multi-core system, where different atomic counters (8-bytes each) fall in the same cache line (typically 64 bytes). Any update made to one of these counters, causes the others to be marked invalid. This forces a cache reload for all other cores holding that cache, thus creating a write contention on the cache line.

To achieve scalability, we ensure that each atomic counter completely occupies a full cache line. So, every core is working on a different cache line. Ristretto uses this by allocating 256 uint64s for each metric, leaving 9 unused uint64s between each active uint64. To avoid extra computation, the key hash is reused to determine which uint64 to increment.

Add:
	valp := p.all[t]
	// Avoid false sharing by padding at least 64 bytes of space between two
	// atomic counters which would be incremented.
	idx := (hash % 25) * 10
	atomic.AddUint64(valp[idx], delta)
Read:
	valp := p.all[t]
	var total uint64
	for i := range valp {
		total += atomic.LoadUint64(valp[i])
	}
	return total

When reading the metric, all the uint64s are read and summed up to get the latest number. With this approach, metrics tracking only adds around 10% overhead to the cache performance.

Benchmarks

Now that you understand the various mechanisms present in Ristretto, let’s look at the Hit ratio and Throughput benchmarks compared to other popular Go caches.

Hit Ratios

Hit ratios were measured using Damian Gryski’s cachetest along with our own benchmarking suite. The hit ratio numbers are the same across both utilities, but we added the ability to read certain trace formats (LIRS and ARC, specifically) as well as CSV output for easier graphing. If you want to write your own benchmarks or add a trace format, check out the sim package.

To get a better idea of the room for improvement, we added a theoretically optimal cache implementation, which uses future knowledge to evict items with the least amount of hits over its entire lifetime. Note that this is a clairvoyant LFU eviction policy, where other clairvoyant policies may use LRU. Depending on the workload, LFU or LRU may be better suited, but we found clairvoyant LFU useful for comparisons with Ristretto’s Sampled LFU.

Search

This trace is described as “disk read accesses initiated by a large commercial search engine in response to various web search requests.”

Database

This trace is described as “a database server running at a commercial site running an ERP application on top of a commercial database.”

Looping

This trace demonstrates a looping access pattern. We couldn’t include Fastcache, Freecache, or Bigcache implementations in this and the following benchmark because they have minimum capacity requirements that would skew the results. Some trace files are small and require small capacities for performance measurements.

CODASYL

This trace is described as “references to a CODASYL database for a one hour period.” Note that Ristretto’s performance suffers in comparison to the others here. This is because of the LFU eviction policy being a bad fit for this workload.

Throughput

Throughput was measured using the same utility as the previous blog post, which generates a large number of keys and alternates between goroutines for Getting and Setting according to the workload.

All throughput benchmarks were ran on an Intel Core i7-8700K (3.7GHz) with 16gb of RAM.

Mixed: 25% Writes, 75% Reads

Read: 100% Reads

Write: 100% Writes

Future Improvements

As you may have noticed in the CODASYL benchmarks, Ristretto’s performance suffers in LRU-heavy workloads. However, for most workloads, our Sampled LFU policy performs quite well. The question then becomes “How can we get the best of both worlds?”

In a paper called Adaptive Software Cache Management, this exact question is explored. The basic idea is placing an LRU “window” before the main cache segment, and adaptively sizing that window using hill-climbing techniques to maximize the hit ratio. Caffeine has already seen great results by doing this. Something we believe Ristretto can benefit from as well in the future.

Special Thanks

We would like to sincerely thank Ben Manes. His depth of knowledge and dedicated, selfless sharing has been a large factor in any progress we’ve made and we are honored to have had many conversations with him about all things caching. Ristretto would just not have been possible without his guidance, support and 99.9% availability on our internal Slack channel.

We would also like to thank Damian Gryski for his help with benchmarking Ristretto and writing a reference TinyLFU implementation.

Conclusion

We set out with the goal of making a cache library competitive with Caffeine. While not completely there, we did create something significantly better than most others in the Go world at the moment by using some new techniques that others can learn from.

Some initial experiments with using this cache in Dgraph are looking promising. And we hope to integrate Ristretto into both Dgraph and Badger in the upcoming months. Do check it out and perhaps use Ristretto to speed up your workloads!

Top Image: Blue Bottle Coffee Espresso Preparation Guide.


This is a companion discussion topic for the original entry at https://blog.dgraph.io/post/introducing-ristretto-high-perf-go-cache/

(Ido Barkan) #2

Kudos guys! This was long missed in the landscape. What are the chances you add a TTL based eviction?


(Manish R Jain) #3

Feel free to create an issue. We’ll surely consider adding TTL on keys. Seems like a general enough use case.


(Ido Barkan) #4

https://github.com/dgraph-io/ristretto/issues/43 is already open.


(Aliaksandr Valialkin) #5

Awesome article with interesting technical details!
It would be interesting to see the code used for benchmarks.

Three questions:

  1. Does Ristretto store each added value separately by pointer? If yes, then caches with big number of items would lead to non-trivial amounts of CPU time spent on GC scanning all these pointers.
  2. The ristretto.Cache.Get API doesn’t allow reusing memory for the returned value unlike fastcache.Cache.Get. This means that the method either performs memory allocation on each call or it returns pointer to value owned by the cache if answer to the first question is yes. Which approach does Ristretto use?
  3. Ristretto doesn’t store keys according to this issue - it stores only 64-bit hashes for keys. This means that Cache.Get can return invalid value for different key with 50% probability on a cache with 2^(64/2)=4 billion items according to birthday paradox. How to handle this case when using Ristretto?

(Manish R Jain) #6

Hey there, @valyala,

The links to the code used for benchmarks are all listed in the blog post itself.

  1. Ristretto exposes an interface{} for the value. It could be anything, a pointer, byte array, string, int etc. Ristretto currently does not do anything specific to reduce the GC effort of scanning through these key-values.

  2. Ristretto directly returns the value as provided by the user. If it is a pointer, it is returned as it is. It is up to the user to ensure that they don’t directly manipulate the value in any way (if that’s the question).

  3. We’re working on adding another hash to be stored along with the value to decrease the probability of collisions. We could perhaps make it configurable, so a user can decide how many bits of hash they want for collision detection.