FrankReh commented :
@ben-manes Thanks for your comment and your interest in this feature. Your work on Caffeine and with W-TinyLFU is remarkable and I look forward to adding an LFU window in front of my own version of this package soon.
This issue I tried to raise yesterday I think I did a bad job of explaining. The current code doesn’t work. The fact that a 64 bit variable named hash is an input and that there are variables with names seed through seed and that XOR is being used easily obfuscates the problem - and why it’s been in the code base two months now and why it got past a code review.
There’s no bit swapping - there is no independence between the four row indexes.
The number of columns in the cm-sketch defines the number of bits used in
the bit mask. The same bit mask is used for all four rows. For convenience, the mask is of the low order bits so a simple bitwise-and makes the pseudo-random looking value fit into a power-of-2 index for the row.
In case further elaboration is helpful: the number of columns tends to be small compared to 64 bits in the input space. Say it is chosen as 16. The sketch has 16 columns. The current code uses the same 16 bits from the input variable each of the four times - the other 48 bit values are totally ignored. There is nothing that fiddles with bit positions. The fact that a randomly chosen value for each row is used to bitwise-xor those 16 bits doesn’t make the second, third or fourth row more useful. The tiny counters they end up holding are copies of the first row, just swizzled around.
In terms of the original cm-sketch paper, there is no independence between the four indexes, they are not pairwise independent.
The earlier code that indexed the counters of each row made an attempt to use all 64 bits, a different way for each row, so at least an approximation of pairwise independence.
One scheme, if one knew the cm-sketch structure was going to be 16x4, would be to use a different 16 bits of the input for each row’s index. But if you can’t rely on the input being well distributed (and the current default function in ristretto does not distribute well for keys that are recognized by the compiler as bytes or 32bit or 64 bit variables, then this would cause many tiny counters to be maxed out while not using most of the rest. So a true hashing scheme should be used, even when the key is fewer than 8 bytes. The default function does use a true hashing function for golang strings and byte slices so at least for those two forms of key, the 64 bits given to the cm-sketch will be well distributed. (I would suggest the golang types that are not strings or byte slices be copied, byte for byte, into a byte slice, and then using the byte slice hashing function provided by golang or by xxhash).
Lots of counters and other 64 bit variables use just one or two bytes of non zero data while most of the high order bytes are zero, so not putting those through a real hash also breaks a requirement of the cm-sketch algorithm - distributing the input well over all the tiny counters (just 4 bits wide in Caffeine’s and Ristretto’s case). But this is starting to digress too far from the first problem - the fact that the four rows are being indexed with the same small set of bits from the input. It’s interesting that nobody noticed, even during the code review process and I believe it has to do with the variable names and the fact that the original code also used bitwise-XOR - just not well enough for the case where the high 32 bits were all zero.
To prove this to myself - because I wouldn’t want to be wrong with a forum like this, I wrote unit tests with a evenly distributed pseudo-random input set and found the cm-sketch can be reduced to one row and the Estimate values it returns were identical to the four row version. Again, because higher order bits of the input are not used - the same small bit set are used for each index calculation and the latter three rows are reduced to being redundant.
I still hesitate pasting the code here. It is easily seen by following the GitHub provided link to the commit mentioned above.