When using `string` as keys, there could be hash collisions, and thus `Get` could return the wrong data

Moved from GitHub ristretto/30

Posted by cipriancraciun:

This is more a “design question” than an actual “issue”, however given its implications I think it is an important question which impacts either the design or the usage constraints.

Given that the KeyToHash (1) function supports string (and []byte), and it returns uint64, it is possible that there are hash collisions.

However the cache.Get (2) method doesn’t check if the found entry (if any) actually has the given key. (In fact the store doesn’t even support storing the key.)

(1) https://github.com/dgraph-io/ristretto/blob/master/z/z.go#L20
(2) https://github.com/dgraph-io/ristretto/blob/master/cache.go#L83

manishrjain commented :

Yes. We decided to use uint64 knowing that that leaves us open to collisions to avoid paying for significant memory overhead with large keys.

The idea is that, if this becomes a problem, we’ll add another hashing technique (or allow a way to do so in general), which we can use to ascertain if the key is correct or not. If there’s a collision, we immediately evict the key from the cache. Also, we’re just going to assume that the chances of a second hash (different algorithm) colliding is too low to be handled.

cipriancraciun commented :

(The statements bellow take into account that you use only a single 64bit hash function.)

Granted that by using an AES based hashing algorithm, the chance of a collision given exactly two entries is 2^63 (not 64 due to the “birthday effect”).

However quoting from Birthday problem on Wikipedia article (the probability table section), it would take only 200 million different keys to get a 0.1% chance for a collision (these keys don’t necessarily have to be all stored at the same time).


However by adding a second 64 bits hash function, you’re basically using now a “compound” 128 bits hash, which means that from a practical point of view you can just do the following:

  • instead of using AES to generate the hash, use SHA (one of the SHA functions in the family that are hardware accelerated and provides at least 192 bits);
  • use the first 64 bits of the hash as you are doing now to establish a “slot”;
  • use the remaining 128 bits to store besides the the actual data;

I think will provide a much better collision “margin”, and will increase the current storage requirements by only 16 bytes.

manishrjain commented :

AESHash is supported by go runtime, and it does 64 bytes key in 5ns on my laptop. If we can find an SHA implementation which can give us this kind of performance, we can quickly switch.

cipriancraciun commented :

[…] we can quickly switch.

But are you considering doing this “switch”? (I.e. introducing a way to check that Get returns the data for the “actual” key? Where “actual” means with a high degree of probability.)

Regarding the 192 bit hash function, one could always use the same aeshash function with three different “keys” (“seeds”), which practically would increase the hash time by a factor of 3.

manishrjain commented :

I think doing it twice should be enough (128 bits), using the second one for detecting conflict.

cipriancraciun commented :

I’m no mathematician / cryptographer but for me, given the following two conditions, it should do the trick:

  • use the mentioned 128 bit hash (split in the two 64 bit hashes);
  • provided that one doesn’t keep the values for “too much”;

I would define “too much” as:

  • given a probability of 1e-18 (which Wikipedia states it is the uncorrectable bit error rate for HDD’s),
  • one needs 2.6e10 different keys to reach a collision (with the previous chosen probability),
  • which if generated at a rate of ~10K per second,
  • should take around 28 days;

I.e. my conclusion (to be on the safe side) is that one shouldn’t keep cached data more than a week, and definitively it should be entirely flushed once a month.

6ecuk commented :

AESHash is supported by go runtime, and it does 64 bytes key in 5ns on my laptop. If we can find an SHA implementation which can give us this kind of performance, we can quickly switch.

@manishrjain
Hi, maybe look on https://github.com/minio/sha256-simd

manishrjain commented :

Yeah, minio sha256 looks useful.

dgryski commented :

For small keys, minio’s implementation has a lot of overhead. From the readme: Note that, because of the scheduling overhead, for small messages (< 1 MB) you will be better off using the regular SHA256 hashing. I have found this to be the case in my benchmarks also.

andersfylling commented :

This also means the metrics from the benchmarks might be affected by collisions giving false positives when it comes to High hit ratio.

karlmcguire commented :

Fixed in #88.

templexxx commented :

Yes. We decided to use uint64 knowing that that leaves us open to collisions to avoid paying for significant memory overhead with large keys.

The idea is that, if this becomes a problem, we’ll add another hashing technique (or allow a way to do so in general), which we can use to ascertain if the key is correct or not. If there’s a collision, we immediately evict the key from the cache. Also, we’re just going to assume that the chances of a second hash (different algorithm) colliding is too low to be handled.

I think what you want is Cuckoo hashing

martinmr commented :

@karlmcguire Hey. Is this fixed? Your comment says this was fixed by #88 but you closed and reopened this immediately. If it’s not fixed, what else should be done to close this ticket?