Using custom KeyToHash in configuration resulting in drop of cache hit ratio

muskan_sethia · July 14, 2020, 12:51pm

Hi,
I’m trying to use a custom KeyToHash function in my configuration but it has resulted in a massive drop in cache hit ratio from 90+% to almost 4%.
I’ve also updated to the most recent version of ristretto in my application.
I’m unable to understand the reason behind the drop. Can anyone please help?

Rahul · July 14, 2020, 1:48pm

Hi @muskan_sethia,

Assuming you are keeping cache capacity and other parameters same across the 2 configuration.

Can you please share what KeyToHash function you are using?

muskan_sethia · July 14, 2020, 1:52pm

I reduced the cache capacity to half but that cannot account to such a drop in the cache hit ratio.
My KeyToHash function is –

type SourceDestinationCellID struct {
	SourceCellID      s2.CellID
	DestinationCellID s2.CellID
}

KeyToHash: func(key interface{}) (uint64, uint64) {
			srcDest, _ := key.(models.SourceDestinationCellID)
			return uint64(srcDest.DestinationCellID), uint64(srcDest.SourceCellID)
		}

Rahul · July 14, 2020, 5:10pm

Yes, that’s strange.
In both the scenarios you mentioned, Is the data set and the order of reading the keys from the cache same?

muskan_sethia · July 14, 2020, 5:28pm

Yes. The only difference is earlier, I was converting these cellIDs into string and letting the internal KeyToHash take care of the hashing but now, since I already have a hash which can be converted to uint64, I’m using my custom function. Reading from the cache has also changed accordingly.

Rahul · July 14, 2020, 5:41pm

You mean you were creating a string which is concatenation of source and destination cell ids and using that string as key, Correct?

One thing that can happen is when the keys are hashed in the first scenario, multiple string keys may map to same hash value because of hash collisions (this is unlikely but possible) and then while accessing you might be getting same value for 2 different keys, where the latter key is actually a cache miss. In the second scenario this cannot happen since you are using the actual cell ids of S2 cells which are certainly unique. This can lead to a higher cache miss ratio in the second scenario, but it’s unlikely.

Can you just confirm in the first scenario, that you are constructing the string keys in a manner that they are distinct across different (SourceCellID, DestinationCellID) pair for example a concatenation of the two etc.?

muskan_sethia · July 14, 2020, 7:31pm

I was not simply converting the CellIDs into string and concatenating them. I was using the s2 library to convert them into tokens which are unique to the cell and are string themselves. Hence, collision is very unlikely.
Also, I’m using other caching mechanism too and their cache hit ratio is similar to my first scenario. Application of the custom KeyToHash and update to the new version of ristretto is the only change I made which resulted into the massive drop.

martinmr · July 14, 2020, 10:58pm

I think the problem is your function. The conflict hash is to prevent collisions so in the default version we are using two different hash methods (see below):

	case string:
		return MemHashString(k), xxhash.Sum64String(k)

However, in your function you are using one ID to generate one hash and the other one to generate the other hash. This is most likely leading to a lot of conflicts.
As a fix, you could try concatenating the two IDs into one byte array and then using the default KeyToHash. so something like:

KeyToHash: func(key interface{}) (uint64, uint64) {
    srcDest, _ := key.(models.SourceDestinationCellID)
    var b []byte
    // logic to concatenate DestinationCellID and SourceCellID goes here.
    return z.KeyToHash(b)
}

muskan_sethia · July 15, 2020, 3:50am

The reason I moved to the custom KeyToHash was because earlier, concatenating the two CellIDs was taking significant memory allocation because of s2Cell.ToToken(). Since I already had a hash, I figured I could use a custom one to get rid of that memory block.

As per your suggestion, concat again leads me back to the same problem. Goal is to avoid allocation.

ibrahim · July 15, 2020, 9:48am

@muskan_sethia If your srcDest.DestinationCellID and srcDest.SourceCellID are uint32, you can do something like

KeyToHash: func(key interface{}) (uint64, uint64) {
			srcDest, _ := key.(models.SourceDestinationCellID)
			return uint64(srcDest.DestinationCellID) << 32 | uint64(srcDest.SourceCellID), 0
		}

But this is based on an assumption that your (DestinationCellID, SourceCellID) is unique.

muskan_sethia · July 16, 2020, 10:28am

As per your suggestion, using only one of the hashing methods might lead to higher collisions, right?

I tried to understand how collision is handled in Ristretto and as per my understanding, if collisions occur, instead of returning !found, the cache returns a wrong value. Is my understanding correct?

How does Ristretto handle the two hashing methods if separate keys, like my initial example, is passed to it? Is there any detailed doc on this that I could refer to or any other way?

configurable collision checking by karlmcguire · Pull Request #88 · dgraph-io/ristretto · GitHub – The comments and implementation of memhash mentioned here doesn’t show to reduce the collision by the two methods.

muskan_sethia · July 16, 2020, 1:07pm

Looking at the following code, I have a fair bit of idea about the working of the two hash as key and conflictHash.

func (m *lockedMap) get(key, conflict uint64) (interface{}, bool) {
	m.RLock()
	item, ok := m.data[key]
	m.RUnlock()
	if !ok {
		return nil, false
	}
	if conflict != 0 && (conflict != item.conflict) {
		return nil, false
	}

	// Handle expired items.
	if !item.expiration.IsZero() && time.Now().After(item.expiration) {
		return nil, false
	}
	return item.value, true
}

If I send the conflict Hash as 0, it won’t check for conflicts. So, I’ve decided to change it to the following instead –

KeyToHash: func(key interface{}) (uint64, uint64) {
			srcDest, _ := key.(models.SourceDestinationCellID)
			return uint64(srcDest.DestinationCellID)<<32 | uint64(srcDest.SourceCellID), uint64(srcDest.SourceCellID)<<32 | uint64(srcDest.DestinationCellID)
		}

martinmr · July 16, 2020, 6:09pm

Yeah, this looks reasonable.

Topic		Replies	Views
When using `string` as keys, there could be hash collisions, and thus `Get` could return the wrong data Issues ristretto , status:accepted , kind:bug , priority:p1	13	1044	February 21, 2020
Document behavior on key (hash) collisions Ristretto	1	1133	December 26, 2021
KeyToHash possibly letting key interfaces escape to heap Issues ristretto , status:accepted , kind:bug , priority:p1	2	871	September 30, 2020
Proposal: add metrics KeysNumber Ristretto ristretto , kind:enhancement , priority:p2 , status:needs-attenti	3	765	August 10, 2021
Question: read-your-writes consistency Issues ristretto , kind:enhancement , priority:p2 , status:accepted	6	865	August 17, 2020

Using custom KeyToHash in configuration resulting in drop of cache hit ratio

Related topics