The State of Caching in Go - Dgraph Blog


(Manish R Jain) #1

Every database system requires a smart cache. That is, a cache that keeps the most frequently or recently accessed items and has a configurable memory utilization limit.

Being a graph database, Dgraph can access thousands—even millions—of keys per query, depending upon the number of intermediate results. Accessing these via our key-value DB, Badger, results in disk seeks, which is something we want to reduce for performance reasons.

Typical access patterns follow a Zipfian distribution. The most frequently accessed keys are accessed exponentially more times than others. We see this in Dgraph as well.

We have been extremely happy with our choice of using the Go language to build Dgraph. A lot can be said about why Go makes a great language to write back-end code in, but that is a post in itself. We would not swap out Go for any other existing language. Although Go is great, the Go ecosystem is still lacking.

My general complaints around Go stem from the lack of a library ecosystem around the language. Go is mature in the sense that it is stable and has delivered upon the core promise of fast compilation, execution, and utilization of machine cores really well. However, for a language built around concurrency, it still lacks an ecosystem of performant, concurrent libraries which can scale well to the number of cores. A concurrent list or map is largely left as an exercise to the end-user—which would be just fine if this was a serially-executing language—but feels like a glaring omission when everything is built around concurrency.

In particular, Go lacks a concurrent LRU (or LFU) cache which can scale well enough to be a process-global cache. In this blog post, I will take you through the various attempts at workarounds that are typically advocated, including some which we have executed and learnt from within Dgraph. Aman will then present the design, performance and hit ratio comparison for the existing popular cache implementations in the Go ecosystem.

I’ll start by listing the requirements for the cache, the various approaches we can take to implement the cache, and how they fail to achieve them.

Requirements for the cache

  1. Concurrent.
  2. Memory-bounded (limit to configurable max memory usage).
  3. Scale well as the number of cores and goroutines increase.
  4. Scale well under non-random key access distribution (e.g. Zipf).
  5. High cache hit ratio
Use Go map with sync.Mutex

Using a Go map with a sync.Mutex (or sync.RWMutex) is the most commonly advocated approach to caching. This does result in all the goroutines blocking upon one lock, which can lead to contention. This also fails to keep a tab on the amount of memory usage. So, it does not work as a memory bounded cache in itself.

Fails on requirements 2-4.

Use lock striping with Go maps

This is the same concept as above, but splits keys using a fingerprint into many smaller Go map shards protected by mutex locks (see here). Many developers incorrectly believe lock striping to be a great way to avoid contention, particularly when setting the number of shards to exceed the number of threads in the program (> GOMAXPROCS).

In our initial attempts at building a simplified memory-bounded cache, we built this as well. To allow for releasing memory back to the OS, we would periodically pick a random shard and delete its map, allowing it to refill. This was a crude but simple technique which performed better than an LRU cache (explained below) but had many downsides.

One, Go is slow to release memory back to the OS but fast to request it. As soon as the shard was emptied, goroutines trying to access the keys in that shard would start allocating memory while the previous memory was still not released back fully, causing a spike in memory usage and a rapid OOM crash.

Also, what we failed to realize at the time was that the access patterns are bounded by Zipf’s law. The most frequently accessed keys are still under a few locks, thus becoming the cause of contention for all goroutines. The performance of this approach does not scale well with the number of cores.

Fails on requirements 2,4.

LRU cache

Go has a basic LRU cache implementation as part of groupcache. After our failed attempt with lock striping with map shards, we modified this LRU cache by introducing locks and making it concurrent. While this cache did solve the immediate issues around memory spikes caused by frequently and consistent memory releases, we realized it would introduce contention.

This cache is also sized by the number of entries, not the amount of memory they are consuming. Trying to estimate the memory usage of a complex data structure on the heap in Go is prohibitively expensive and almost impossible to get right, something we realized after trying in vain using multiple mechanisms. This got particularly hard because our data structures were changing after being placed in the cache (something that we plan to avoid going forward).

But, we did not comprehend how much contention that cache could cause. After having this cache for over a year, we realized that the contention around this cache was so significant, that removing it caused our queries to speed up 10x!

In this implementation, every read is a write which updates the relative positioning of the element in the recency linked list. Thus, all accesses wait on a single mutex lock. In addition, the critical section of LRU is slower than a map and does a lot of pointer dereferences, maintaining a map and a doubly-linked list (see code). Despite our efforts around lazy eviction, it still suffered from severe contention.

Fails on requirement 3-4.

Striped LRU cache

We did not bother trying this. From our experiments with striped map shards, we know it would only be an incremental improvement and would not scale well. (Though, for the purpose of benchmarking caches for this article, we implemented a striped LRU cache as described below.)

Would fail on requirement 4.

Popular Cache Implementations

Many other approaches aim at reducing the GC time spent on the map shards. GC time increases as the number of entries in the map increase. Reduction is achieved by allocating fewer, larger byte slices and storing many cache entries in each slice. This is an effective approach—we use this in Badger in multiple places (Skiplist, Table builder, etc.). Some of the popular caches in Go use this technique.

BigCache

BigCache divides the data into shards based on the hash of the key. Each shard contains a map and a ring buffer. Whenever a new element is set, it appends the element in the ring buffer of the corresponding shard and the offset into the buffer is stored in the map. If the same element is set more than once, the previous entries in the buffer are marked invalid. If the buffer is too small, it is expanded until the maximum capacity is reached.

Each map key is a uint32 hash and the value is a uint32 pointer to the offset in the buffer where the value is stored along with metadata information. If there are hash collisions, BigCache ignores the previous key and stores the current one into the map. Allocating fewer, larger buffers up front and using a map[uint32]uint32 is a great way to avoid paying the cost of GC sweeps.

FreeCache

FreeCache divides the cache into 256 segments. Each segment contains 256 slots and a ring buffer to store the data. When a new key is added to the cache, a segment id is identified using the lower eight bits of the hash of the key. Further, a slot is chosen using LSB 9-16 of the hash of the key. Dividing data into slots helps in reducing the search space while looking for a key in the cache.

The data is then appended into the ring buffer and offset is stored into a sorted array. If a ring buffer doesn’t have enough space, eviction is performed in the segment from the beginning of the ring buffer using a modified LRU policy. An entry is deleted from the ring buffer if the last access time for the entry is smaller than the average access time of the segment. To find an entry in a cache on Get, a binary search is performed in the sorted array in the corresponding slot.

GroupCache

GroupCache implements an exact LRU eviction policy using a linked list and a Go map. For a fair comparison, we implemented a sharding logic with 256 shards on top of GroupCache.

Performance Comparison

To compare performance of various caches, we generated a Zipf-distributed workload and ran benchmarks using an n1-highcpu-32 machine. The chart below compares performance of the three cache libraries for a read-only workload.

Read-Only

We can see that BigCache reads scales well given that reads are lock-free. FreeCache and GroupCache reads are not lock-free and don’t scale after a point (20 concurrent accesses). (lower value is better on y axis)

Write-Only

For a write-only workload, all the libraries seem to perform similarly. Though, FreeCache is slightly faster than the other two.

Read-Write (25% writes, 75% reads)

For a mixed workload containing 25% writes and 75% reads, while BigCache is the only library that clearly seems to scale well, the hit ratios are bad for a Zipf workload, as explained in the next section.

Hit Ratio Comparison

Hit Ratio for the three caches are shown below. FreeCache does pretty close to LRU policy implemented by GroupCache. BigCache, however, doesn’t do well for a Zipf-distributed workload for following reasons:

  • BigCache doesn’t utilize the buffer efficiently and may end up storing multiple entries for the same key in the buffer.
  • BigCache doesn’t update entries on access (read), hence, resulting in eviction of recently accessed keys.
Cache Size (# of elem) 10000 100000 1000000 10000000 BigCache - 37% 52% 55% FreeCache - 38% 55% 90% GroupCache 29% 40% 54% 90%

So, we can conclude that none of the cache libraries meet all the requirements.

GroupCache and FreeCache fails on requirement 4 whereas BigCache fails on requirement 5.

So, what are we left with?

Well, nothing really. We are not aware of a smart memory-bounded cache in Go that can meet the entire list of requirements. If you know of one, do let us know in the comments.

Meanwhile, we came across Caffeine, a Java library used by Cassandra, Finagle and other database systems. It uses TinyLFU, a highly efficient cache admission policy and uses various techniques to scale and perform well as the number of threads and cores grow, while providing a close to optimal hit ratio. You can read more about how it works in this article.

Caffeine meets all the five requirements I mentioned at the beginning, so we are looking into building a Go equivalent of Caffeine, something which can serve our needs and potentially fill the gap of a concurrent, performant, memory-bounded cache in the Go language. Let us know if you’d like to contribute or if you have something similar built already!

Acknowledgement

We want to thank Benjamin Manes for helping us replicate performance benchmarks from Caffeine into Go (code available here). We also would like to thank Damian Gryski for providing us with base framework to benchmark cache hit ratios (available here), which we also modified to work for our needs. He readily accepted our changes into his Github repository.

Top image: A ristretto doppio in Chiang Mai.


This is a companion discussion topic for the original entry at https://blog.dgraph.io/post/caching-in-go/

(Farhan) #2

Awesome writeup! I’d written this inmem cache for caching on server side ZIZOU keeping async eviction in mind, ran your benchmarks, not that efficient though compared to freecache, but yeah, can look into where I could make it better.


(Artem Krylysov) #3

Use of sync.Map is discouraged by the Go team for various reasons. One of them is that it does not reduce in size, despite deleting elements in the map

Thanks for the benchmarks, where can I find more details about the issue?


(Manish R Jain) #4

I think the best place to read about sync.Map would be the official documentation:

https://golang.org/pkg/sync/#Map


(Artem Krylysov) #5

Thanks, but I couldn’t find anything about sync.Map not reducing in size when you delete elements from it. Could you please point me to the right direction?


(Manish R Jain) #6

You can look at the map.Delete code here, in particular entry.Delete. It only swaps the pointer of value with nil, but does not result in the key being removed from the Map (unless the key happens to be in the dirty portion of the Map).

https://golang.org/src/sync/map.go?s=8705:8742#L257


(Artem Krylysov) #7

I see now. Thanks for explaining!


(Voodoo Entity) #8

Right now im working myself on a golang Graph(like) database. I faced the exact similar Problems. Even if i hoped the article would end in a solution that would solve them, it kinda calms me that i didn’t fail to see the “proper” way. I (for now) decided to use the RW.Mutex instead of a shard(Bigcache) solution. Still, thanks for the article :slight_smile: !


(Sebastian Himberger) #9

Thanks for the post! I was also recently looking at go caching libraries and didn’t turn up anything. Good to know I didn’t overlook anything.


(Sokolov Yura) #10

Deleted entries will be actually deleted eventually in missLocked when dirty replaces readonly.


(Sokolov Yura) #11

Demian Grysky has tinyLFU implementation https://github.com/dgryski/go-tinylfu .
Did you try it?


(Aman Mangal) #12

Not as part of the benchmarks, but clearly this will not have good concurrent performance (Requirement 3,4)


(Artem Grechka) #13

Hey guys, there’s great lib from HashiCorp: https://github.com/hashicorp/golang-lru


(Sokolov Yura) #14

No, it is not great if concurrency concerned (single RWMutex for whole cache)


(James) #15

Concurrency can be achieve by combining with single flight algorithm infront of lru cache?


(Aliaksandr Valialkin) #16

It would be great to benchmark https://github.com/VictoriaMetrics/fastcache . It should be faster than BigCache and should use less memory for the same amount of cached data due to lower memory fragmentation.


(Aman Mangal) #17

This looks very similar to bigcache. Would you help me understand why it is doing better in performance than BigCache? The one major difference that I observe is the use of two dimensional byte array, though, it is still used as a one dimensional ring buffer.


(Sokolov Yura) #18

Aliaksandr is an author of fasthttp. I have no doubds he could do any algorithm and datastructure very efficiently.


(Aliaksandr Valialkin) #19

This looks very similar to bigcache. Would you help me understand why it is doing better in performance than BigCache? The one major difference that I observe is the use of two dimensional byte array, though, it is still used as a one dimensional ring buffer.

There are multiple things that make it faster than BigCache:

  • It has less abstraction layers
  • It has API optimized for zero-allocation mode. For instance, Cache.Get accepts destination buffer for the returned value, so this buffer may be re-used.
  • As you already noticed, it splits shards into chunks in order to minimize CPU and memory overhead during shards’ growth.
  • It uses the fastest hash function - xxhash.

(Justin Lovero) #21

Looks like a Ristretto repo was added a couple of days ago: https://github.com/dgraph-io/ristretto. Presumably this is the repo for the Caffeine-inspired concurrent cache library you will be building?