Hi,
I’m trying to use a custom KeyToHash function in my configuration but it has resulted in a massive drop in cache hit ratio from 90+% to almost 4%.
I’ve also updated to the most recent version of ristretto in my application.
I’m unable to understand the reason behind the drop. Can anyone please help?
Hi @muskan_sethia,
Assuming you are keeping cache capacity and other parameters same across the 2 configuration.
Can you please share what KeyToHash function you are using?
I reduced the cache capacity to half but that cannot account to such a drop in the cache hit ratio.
My KeyToHash function is –
type SourceDestinationCellID struct {
SourceCellID s2.CellID
DestinationCellID s2.CellID
}
KeyToHash: func(key interface{}) (uint64, uint64) {
srcDest, _ := key.(models.SourceDestinationCellID)
return uint64(srcDest.DestinationCellID), uint64(srcDest.SourceCellID)
}
Yes, that’s strange.
In both the scenarios you mentioned, Is the data set and the order of reading the keys from the cache same?
Yes. The only difference is earlier, I was converting these cellIDs into string and letting the internal KeyToHash take care of the hashing but now, since I already have a hash which can be converted to uint64, I’m using my custom function. Reading from the cache has also changed accordingly.
You mean you were creating a string which is concatenation of source and destination cell ids and using that string as key, Correct?
One thing that can happen is when the keys are hashed in the first scenario, multiple string keys may map to same hash value because of hash collisions (this is unlikely but possible) and then while accessing you might be getting same value for 2 different keys, where the latter key is actually a cache miss. In the second scenario this cannot happen since you are using the actual cell ids of S2 cells which are certainly unique. This can lead to a higher cache miss ratio in the second scenario, but it’s unlikely.
Can you just confirm in the first scenario, that you are constructing the string keys in a manner that they are distinct across different (SourceCellID, DestinationCellID) pair for example a concatenation of the two etc.?
I was not simply converting the CellIDs into string and concatenating them. I was using the s2 library to convert them into tokens which are unique to the cell and are string themselves. Hence, collision is very unlikely.
Also, I’m using other caching mechanism too and their cache hit ratio is similar to my first scenario. Application of the custom KeyToHash and update to the new version of ristretto is the only change I made which resulted into the massive drop.
I think the problem is your function. The conflict hash is to prevent collisions so in the default version we are using two different hash methods (see below):
case string:
return MemHashString(k), xxhash.Sum64String(k)
However, in your function you are using one ID to generate one hash and the other one to generate the other hash. This is most likely leading to a lot of conflicts.
As a fix, you could try concatenating the two IDs into one byte array and then using the default KeyToHash. so something like:
KeyToHash: func(key interface{}) (uint64, uint64) {
srcDest, _ := key.(models.SourceDestinationCellID)
var b []byte
// logic to concatenate DestinationCellID and SourceCellID goes here.
return z.KeyToHash(b)
}
The reason I moved to the custom KeyToHash was because earlier, concatenating the two CellIDs was taking significant memory allocation because of s2Cell.ToToken(). Since I already had a hash, I figured I could use a custom one to get rid of that memory block.
As per your suggestion, concat again leads me back to the same problem. Goal is to avoid allocation.
@muskan_sethia If your srcDest.DestinationCellID
and srcDest.SourceCellID
are uint32
, you can do something like
KeyToHash: func(key interface{}) (uint64, uint64) {
srcDest, _ := key.(models.SourceDestinationCellID)
return uint64(srcDest.DestinationCellID) << 32 | uint64(srcDest.SourceCellID), 0
}
But this is based on an assumption that your (DestinationCellID, SourceCellID)
is unique.
As per your suggestion, using only one of the hashing methods might lead to higher collisions, right?
I tried to understand how collision is handled in Ristretto and as per my understanding, if collisions occur, instead of returning !found, the cache returns a wrong value. Is my understanding correct?
How does Ristretto handle the two hashing methods if separate keys, like my initial example, is passed to it? Is there any detailed doc on this that I could refer to or any other way?
configurable collision checking by karlmcguire · Pull Request #88 · dgraph-io/ristretto · GitHub – The comments and implementation of memhash mentioned here doesn’t show to reduce the collision by the two methods.
Looking at the following code, I have a fair bit of idea about the working of the two hash as key and conflictHash.
func (m *lockedMap) get(key, conflict uint64) (interface{}, bool) {
m.RLock()
item, ok := m.data[key]
m.RUnlock()
if !ok {
return nil, false
}
if conflict != 0 && (conflict != item.conflict) {
return nil, false
}
// Handle expired items.
if !item.expiration.IsZero() && time.Now().After(item.expiration) {
return nil, false
}
return item.value, true
}
If I send the conflict Hash as 0, it won’t check for conflicts. So, I’ve decided to change it to the following instead –
KeyToHash: func(key interface{}) (uint64, uint64) {
srcDest, _ := key.(models.SourceDestinationCellID)
return uint64(srcDest.DestinationCellID)<<32 | uint64(srcDest.SourceCellID), uint64(srcDest.SourceCellID)<<32 | uint64(srcDest.DestinationCellID)
}
Yeah, this looks reasonable.