When using `string` as keys, there could be hash collisions, and thus `Get` could return the wrong data

diggy · July 30, 2019, 5:29pm

Moved from GitHub ristretto/30

This is more a “design question” than an actual “issue”, however given its implications I think it is an important question which impacts either the design or the usage constraints.

Given that the KeyToHash (1) function supports string (and []byte), and it returns uint64, it is possible that there are hash collisions.

However the cache.Get (2) method doesn’t check if the found entry (if any) actually has the given key. (In fact the store doesn’t even support storing the key.)

(1) ristretto/z.go at master · dgraph-io/ristretto · GitHub
(2) ristretto/cache.go at master · dgraph-io/ristretto · GitHub

diggy · July 30, 2019, 6:09pm

manishrjain commented :

Yes. We decided to use uint64 knowing that that leaves us open to collisions to avoid paying for significant memory overhead with large keys.

The idea is that, if this becomes a problem, we’ll add another hashing technique (or allow a way to do so in general), which we can use to ascertain if the key is correct or not. If there’s a collision, we immediately evict the key from the cache. Also, we’re just going to assume that the chances of a second hash (different algorithm) colliding is too low to be handled.

diggy · July 30, 2019, 6:46pm

cipriancraciun commented :

(The statements bellow take into account that you use only a single 64bit hash function.)

Granted that by using an AES based hashing algorithm, the chance of a collision given exactly two entries is 2^63 (not 64 due to the “birthday effect”).

However quoting from Birthday problem on Wikipedia article (the probability table section), it would take only 200 million different keys to get a 0.1% chance for a collision (these keys don’t necessarily have to be all stored at the same time).

However by adding a second 64 bits hash function, you’re basically using now a “compound” 128 bits hash, which means that from a practical point of view you can just do the following:

instead of using AES to generate the hash, use SHA (one of the SHA functions in the family that are hardware accelerated and provides at least 192 bits);
use the first 64 bits of the hash as you are doing now to establish a “slot”;
use the remaining 128 bits to store besides the the actual data;

I think will provide a much better collision “margin”, and will increase the current storage requirements by only 16 bytes.

diggy · July 30, 2019, 8:14pm

manishrjain commented :

AESHash is supported by go runtime, and it does 64 bytes key in 5ns on my laptop. If we can find an SHA implementation which can give us this kind of performance, we can quickly switch.

diggy · July 30, 2019, 8:52pm

cipriancraciun commented :

[…] we can quickly switch.

But are you considering doing this “switch”? (I.e. introducing a way to check that Get returns the data for the “actual” key? Where “actual” means with a high degree of probability.)

Regarding the 192 bit hash function, one could always use the same aeshash function with three different “keys” (“seeds”), which practically would increase the hash time by a factor of 3.

diggy · July 30, 2019, 9:11pm

manishrjain commented :

I think doing it twice should be enough (128 bits), using the second one for detecting conflict.

diggy · July 30, 2019, 9:28pm

cipriancraciun commented :

I’m no mathematician / cryptographer but for me, given the following two conditions, it should do the trick:

use the mentioned 128 bit hash (split in the two 64 bit hashes);
provided that one doesn’t keep the values for “too much”;

I would define “too much” as:

given a probability of 1e-18 (which Wikipedia states it is the uncorrectable bit error rate for HDD’s),
one needs 2.6e10 different keys to reach a collision (with the previous chosen probability),
which if generated at a rate of ~10K per second,
should take around 28 days;

I.e. my conclusion (to be on the safe side) is that one shouldn’t keep cached data more than a week, and definitively it should be entirely flushed once a month.

diggy · September 26, 2019, 10:03am

6ecuk commented :

AESHash is supported by go runtime, and it does 64 bytes key in 5ns on my laptop. If we can find an SHA implementation which can give us this kind of performance, we can quickly switch.

@manishrjain
Hi, maybe look on GitHub - minio/sha256-simd: Accelerate SHA256 computations in pure Go using AVX512, SHA Extensions for x86 and ARM64 for ARM. On AVX512 it provides an up to 8x improvement (over 3 GB/s per core). SHA Extensions give a performance boost of close to 4x over native.

diggy · September 26, 2019, 11:16pm

manishrjain commented :

Yeah, minio sha256 looks useful.

diggy · September 27, 2019, 2:54am

dgryski commented :

For small keys, minio’s implementation has a lot of overhead. From the readme: Note that, because of the scheduling overhead, for small messages (< 1 MB) you will be better off using the regular SHA256 hashing. I have found this to be the case in my benchmarks also.

diggy · September 29, 2019, 5:37pm

andersfylling commented :

This also means the metrics from the benchmarks might be affected by collisions giving false positives when it comes to High hit ratio.

diggy · October 25, 2019, 5:56pm

karlmcguire commented :

Fixed in #88.

diggy · January 3, 2020, 1:49pm

templexxx commented :

Yes. We decided to use uint64 knowing that that leaves us open to collisions to avoid paying for significant memory overhead with large keys.

The idea is that, if this becomes a problem, we’ll add another hashing technique (or allow a way to do so in general), which we can use to ascertain if the key is correct or not. If there’s a collision, we immediately evict the key from the cache. Also, we’re just going to assume that the chances of a second hash (different algorithm) colliding is too low to be handled.

I think what you want is Cuckoo hashing

diggy · February 21, 2020, 10:30pm

martinmr commented :

@karlmcguire Hey. Is this fixed? Your comment says this was fixed by #88 but you closed and reopened this immediately. If it’s not fixed, what else should be done to close this ticket?

Topic		Replies	Views
Document behavior on key (hash) collisions Ristretto	1	1172	December 26, 2021
Using custom KeyToHash in configuration resulting in drop of cache hit ratio Ristretto ristretto	12	991	July 16, 2020
KeyToHash possibly letting key interfaces escape to heap Issues ristretto , status:accepted , kind:bug , priority:p1	2	912	September 30, 2020
Cache key is fingerprint Users	4	711	November 28, 2017
Introducing Ristretto: A High-Performance Go Cache - Dgraph Blog Blog	8	2593	August 11, 2021

When using `string` as keys, there could be hash collisions, and thus `Get` could return the wrong data

Related topics