cmSketch not benefitting from four rows

Moved from GitHub ristretto/108

Posted by FrankReh:

I believe the commit f823dc4a broke the purpose of the cm sketch to have four rows. The bitwise-xor and mask are not creating the mix of indexes expected by the design. The same Increment/Estimate results, the same values, are achieved from a single row with or without the bitwise-xor. The earlier implementation seems to have been a good and very fast approximation to a distinct hash for each row except when the high 32 bits were all zero. One solution to fixing the earlier version could be to bitwise-or a single bit into the top 32 half. I can provide my testing and benchmarking on this offline if interested.

For small values of row length as used in the current unit tests, this doesn’t matter. As the row length gets larger, the gap between the earlier algorithm and the current one widens and I believe becomes significant.

ben-manes commented :

Caffeine uses the 32-bit spread function derived by computed analysis, as described on StackOverflow. While Java only offers a 32-bit Object.hashCode(), the quality of this function and the explicitly chosen seeds has been very good. I suspect that the usage of random seeds here would degrade the distribution quality in comparison.

A similar analysis offering other recommended spreading functions is suggested by hash-prospector.

FrankReh commented :

@ben-manes Thanks for your comment and your interest in this feature. Your work on Caffeine and with W-TinyLFU is remarkable and I look forward to adding an LFU window in front of my own version of this package soon.

This issue I tried to raise yesterday I think I did a bad job of explaining. The current code doesn’t work. The fact that a 64 bit variable named hash is an input and that there are variables with names seed[0] through seed[3] and that XOR is being used easily obfuscates the problem - and why it’s been in the code base two months now and why it got past a code review.

There’s no bit swapping - there is no independence between the four row indexes.

The number of columns in the cm-sketch defines the number of bits used in
the bit mask. The same bit mask is used for all four rows. For convenience, the mask is of the low order bits so a simple bitwise-and makes the pseudo-random looking value fit into a power-of-2 index for the row.

In case further elaboration is helpful: the number of columns tends to be small compared to 64 bits in the input space. Say it is chosen as 16. The sketch has 16 columns. The current code uses the same 16 bits from the input variable each of the four times - the other 48 bit values are totally ignored. There is nothing that fiddles with bit positions. The fact that a randomly chosen value for each row is used to bitwise-xor those 16 bits doesn’t make the second, third or fourth row more useful. The tiny counters they end up holding are copies of the first row, just swizzled around.

In terms of the original cm-sketch paper, there is no independence between the four indexes, they are not pairwise independent.

The earlier code that indexed the counters of each row made an attempt to use all 64 bits, a different way for each row, so at least an approximation of pairwise independence.

One scheme, if one knew the cm-sketch structure was going to be 16x4, would be to use a different 16 bits of the input for each row’s index. But if you can’t rely on the input being well distributed (and the current default function in ristretto does not distribute well for keys that are recognized by the compiler as bytes or 32bit or 64 bit variables, then this would cause many tiny counters to be maxed out while not using most of the rest. So a true hashing scheme should be used, even when the key is fewer than 8 bytes. The default function does use a true hashing function for golang strings and byte slices so at least for those two forms of key, the 64 bits given to the cm-sketch will be well distributed. (I would suggest the golang types that are not strings or byte slices be copied, byte for byte, into a byte slice, and then using the byte slice hashing function provided by golang or by xxhash).

Lots of counters and other 64 bit variables use just one or two bytes of non zero data while most of the high order bytes are zero, so not putting those through a real hash also breaks a requirement of the cm-sketch algorithm - distributing the input well over all the tiny counters (just 4 bits wide in Caffeine’s and Ristretto’s case). But this is starting to digress too far from the first problem - the fact that the four rows are being indexed with the same small set of bits from the input. It’s interesting that nobody noticed, even during the code review process and I believe it has to do with the variable names and the fact that the original code also used bitwise-XOR - just not well enough for the case where the high 32 bits were all zero.

To prove this to myself - because I wouldn’t want to be wrong with a forum like this, I wrote unit tests with a evenly distributed pseudo-random input set and found the cm-sketch can be reduced to one row and the Estimate values it returns were identical to the four row version. Again, because higher order bits of the input are not used - the same small bit set are used for each index calculation and the latter three rows are reduced to being redundant.

I still hesitate pasting the code here. It is easily seen by following the GitHub provided link to the commit mentioned above.

ben-manes commented :

Thanks @FrankReh.

I cannot speak to the Golang implementation, as it has many differences to the Java library that I wrote. Similar in concepts, yet different in all implementation choices. From glancing at its code, I can see why there would be concerns with the quality of its result.

Can you try evaluating with either (a) the changes that I suggested or (b) against the Java version? I believe that the Java version has a better distribution of the bits. It first scrambles the input hash using a verified good hash spreader. From this we use the lower bits to choose the four rows (out of 16) to operate against. It then uses a fixed set of seeds that are known to cause good, distinct spreads to apply against the scrambled hash. The top and bottom portion of the result is added together before modulating for the counter index. When running against various traces this resulted in cache hits rates very similar to if a 64-bit off-the-shelf CMS library was used instead.

The Java version would still have the problems of not being pairwise independent, but hopefully it has a better approximation that is good enough for our needs. But if it too has this problem then I’d like to incorporate your test case and apply a fix.

static final long[] SEED = { // A mixture of seeds from FNV-1a, CityHash, and Murmur3
    0xc3a5c85c97cb3127L, 0xb492b66fbe98f273L, 0x9ae16a3b2f90404fL, 0xcbf29ce484222325L};

/**
 * Returns the table index for the counter at the specified depth.
 *
 * @param item the element's hash
 * @param i the counter depth
 * @return the table index
 */
int indexOf(int item, int i) {
  long hash = SEED[i] * item;
  hash += hash >>> 32;
  return ((int) hash) & tableMask;
}

/**
 * Applies a supplemental hash function to a given hashCode, which defends against poor quality
 * hash functions.
 */
int spread(int x) {
  x = ((x >>> 16) ^ x) * 0x45d9f3b;
  x = ((x >>> 16) ^ x) * 0x45d9f3b;
  return (x >>> 16) ^ x;
}

FrankReh commented :

@ben-manes
Thanks for the help. The 64 bit indexOf function works very well for four rows and 16 columns. Even better if I applied the spread to the 64 bits first. But what’s really cool is that hash code generator you pointed us to written up at https://nullprogram.com/blog/2018/07/31/

His description of a low biased and perfectly invertible hash that his JIT program builds and evaluates is really something. I’ve been using it on a single core for the last couple hours and it has already spit out a version that works the best in all my tests so far. Far better than the original hash used in this package and better than any I had cobbled together. I have two test cases I’m using for comparisons, one that records when the first false estimate occurs just before a new item is added to the sketch, and one that records once .1% of the new items added to the sketch experienced false estimates.

For one test case, your indexOf(spread(x)) came out a little ahead. For the other, the JIT generated algorithm came out a little ahead. I’m going to let the JIT keep chugging to see if it finds something that by its definition is better. The score it gives the current one is a 21, which is high by its standards. It means the hash is too biased, favoring some bit flips over others, and that’s just the kind of thing we want to smooth out when dealing with 4 bit counters that are easily maxed out.

But I’ve been limiting my tests today to cm-sketch structures that have exactly 64 counters in them, so 4 rows of 16 columns, or 8by8 or even 16by4, so I get a nice one bit per counter mapping. When I expand the tests to again use wider rows, I’ll need more than 64 bits total to index them and your IndexOf function provides that nicely (256 bits). So unless the JIT finds something extra special or I figure out how to modify it your function looks like the right choice.

The other idea I’m mulling about is looking at the JIT code and see if I can tweak it to rule out poorly biased guesses more quickly so it can look at 256 to 256 bit hashes. Probably beyond the time I have but these are interesting tools and the ability to generate code to find a better fit for something is very useful.

And to save on run time costs, your approach of using a few bits after the spread to select which four rows to work on sounds very nice. I may settle on 8 rows, despite the slight added overhead because tests yesterday showed a marked benefit if I halved the columns but doubled the rows. But that was around an inflection point because doubling the rows and halving the columns one more time degraded the results - I think because it became too easy to max out a 4 bit counter. But your approach of having only a subset of rows in play for a given hash is nice. Yet another level of indirection that gives a nice time for space tradeoff.

But alas, another factor will have to be what these things do to the actual L1 and L2 cache. Too many bytes allocated to the cm-sketch and the read and write accesses to those counters will make the hardware caching thrash. But first things first - a hashing scheme that uses the 4 bit counters effectively over 4 or 8 rows. Thanks!

Already long winded so …

Taken from the hash composer document, an important pair of goals of these kinds of hash are

  • High avalanche effect. When I flip one input bit, the output bits should each flip with a 50% chance.
  • Low bias. Ideally there is no correlation between which output bits flip for a particular flipped input bit.

I’ve written something that let’s me see the avalanche and bias for each algorithm, one input bit at a time so there are only 64 rows per scheme to look at. A lot of the schemes we’ve been using or looking at are highly biased. Especially the ones that don’t come with magic 64 bit seed values to create a good mix right off the bat.

ben-manes commented :

Thanks for the detailed analysis, @FrankReh. I’m no expert in hashing but do enjoy reading about other’s findings and @thomasmueller is to thank for the scheme used in Caffeine. His description and the program he wrote to discover the constants are described in the StackOverflow post that I linked to above. For whatever the end result is, I’d be interested in seeing your code and porting any test cases to our frequency sketch to help ensure it retains a high quality scheme.

thomasmueller commented :

Hi - I agree with both @FrankReh and @ben-manes. The blob post https://nullprogram.com/blog/2018/07/31/ is very good.

If possible, I think it makes sense to use 64-bit spread / mix functions now. Back when I was testing this, the 32-bit spread function (in Java) was still a bit faster, but now I think it no longer is. I suggest to use the Murmur 3 64-bit finalizer:

h ^= h >> 33;
h *= 0xff51afd7ed558ccd;
h ^= h >> 33;
h *= 0xc4ceb9fe1a85ec53;
h ^= h >> 33;

That is C / C++ code, Java code would be for example:

x = (x ^ (x >>> 33)) * 0xff51afd7ed558ccdL;
x = (x ^ (x >>> 33)) * 0xc4ceb9fe1a85ec53L;
x = x ^ (x >>> 33);

If you want to add a seed to that, then it is sufficient to add that to x, for example (Java):

public static long hash64(long x, long seed) {
    x += seed;
    x = (x ^ (x >>> 33)) * 0xff51afd7ed558ccdL;
    x = (x ^ (x >>> 33)) * 0xc4ceb9fe1a85ec53L;
    x = x ^ (x >>> 33);
    return x;
}

karlmcguire commented :

Thank you for the detailed writeup @FrankReh and input from everyone else.

The broader issue mentioned is that, because we are accepting interface{} keys, the default z.KeyToHash function only guarantees a well distributed output for underlying string and []byte inputs. I’ve been considering making the public API require []byte keys for other reasons (#107), so this issue has pretty much made up my mind. Doing so will make the solution to this issue infinitely easier, in my opinion.

Now that we can assume a well distributed uint64 hash, @FrankReh’s idea of using independent 16-bit segments for finding the counter index on each row is appealing to me because of its simplicity. I’ve written up a PoC here and it appears to be benefitting from the rows.

func (s *Sketch) Increment(hash uint64) {
    for i := range s.blocks {
        // fibonacci hashing 16-bit segment of the hash to get a different
        // block index on each row, reduce collisions
        mixed := fib.Hash(hash << (i * 16) >> 48)
        // right shift based on the size of the row
        block := &s.blocks[i][mixed>>s.offset]
        // shift determines whether we use the left or right half of the block
        shift := (mixed & 1) * 4
        if (*block>>shift)&0x0f < 15 {
            *block += 1 << shift
        }
    }
}

I’m using Fibonacci Hashing which is fast and provides a pretty good distribution for how simple it is. I’m not remotely close to an expert on hashing but this seems like a good solution.

I’m going to read some of the things suggested here and write some unit tests for Ristretto, then fix this issue.

thomasmueller commented :

I see, however I think you can do even better (faster and more secure). “Fibonacci hashing is not a really good hash function” (this is a quote from the article you linked to), so it won’t protect you from bad hash codes, and for good hash codes, it’s not needed: For the case you mentioned, it’s fast to generate a good 64-bit hash value first, then split it into 2 halves, and then in the loop just add. Same as done in the Google Guava Bloom filter, and here:

// good hash function (e.g. Murmur 3 64-bit finalizer)
int64_t hash = hasher(key);
// split in 2 halves
uint32_t a = (uint32_t)(hash >> 32);
uint32_t b = (uint32_t)hash;
for (int j = 0; j < k; j++) {
  int index = reduce(a, this->arrayLength);
  ...
  a += b;
}

The “reduce” part you can skip (it allows to use non-power-of-2 sizes). But the main loop is:

for i := range s.blocks {
    mixed := mixed + b

FrankReh commented :

@karlmcguire @thomasmueller Is there a concern the hash isn’t good enough at distributing the entropy across all the row counters somewhat evenly? Or is the extra fibonacci hash or idea of adding the upper bits to the lower bits to solve the problem of needing fewer than 64 bits for the total block indexing?

karlmcguire commented :

Is there a concern the hash isn’t good enough at distributing the entropy across all the row counters somewhat evenly?

Yes, this was my concern. If we’re allowing people to provide their own KeyToHash I’d be more comfortable if there was an extra mixing function with low overhead (doesn’t necessarily need to be a good hash function, just good enough at mixing a hash somewhat randomly).

@thomasmueller I agree that Fibonacci Hashing is not a good standalone hash function, but it does some mixing with just one multiplication operation. I like the solution you provided. I will try that out.

ben-manes commented :

Using @thomasmueller’s idea of adding the seed, I adapted my sketch indexing function slightly. This variation ensures that a zero, a multitive constant, doesn’t result in the same array indexes across all rows. The balance of cost versus quality of the mixing function followed by per-row seeded indexing has been pretty great in my experience.

int indexOf(int item, int i) {
  long hash = (item + SEED[i]) * SEED[i];
  hash += (hash >>> 32);
  return ((int) hash) & tableMask;
}

karlmcguire commented :

This solution proposed by @thomasmueller is about 30% faster without using Fibonacci Hashing. I think it’s probably worth it just to make it known that KeyToHash needs to be a strong hashing function as we’re not going to mix it.

func (s *Sketch) Increment(hash uint64) {
        // left shifting a because we'll later right shift it with s.offset
	a, b := hash<<32, hash
	for i := range s.blocks {
		// right shift based on the size of the row
		block := &s.blocks[i][a>>s.offset]
		// shift determines whether we use the left or right half of the block
		shift := (a & 1) * 4
		if (*block>>shift)&0x0f < 15 {
			*block += 1 << shift
		}
		a += b
	}
}

karlmcguire commented :

Hit ratio testing results:

  • 4 rows: 41% using S3 trace with 400,000 capacity
  • 1 row: 38% using S3 trace with 400,000 capacity
  • 4 rows: 43% using DS1 trace with 4,000,000 capacity
  • 1 row: 41% using DS1 trace with 4,000,000 capacity
  • 4 rows: 59% using OLTP trace with 10,000 capacity
  • 1 row: 59% using OLTP trace with 10,000 capacity

It appears multiple rows are benefitting the hit ratios now, will make a PR.

FrankReh commented :

Okay. Thanks for the clarification. I wasn’t sure this level of code needed protecting from a user but makes sense if you are opening up the API. The addition seemed to loose a carry bit once in a while but I realize that is noise compared to what’s going on with the TinyLFU anyway.

I don’t have the tests in front of me right now but I thought I saw an order of magnitude improvement when going from one row to four. Could something be wrong with that hit ratio testing or maybe its not testing what I thought.

I had tried to create two levels of test criteria. Using a prng, how many random numbers could be inserted before the next random number falsely was reported as being in the cm-sketch already, and beyond that, how many could be inserted before .1% of those already added were being reported as already in before they were actually inserted. Not sure those are useful measurements but it showed the true value of the cm-sketch I thought and the numbers shot up as one would expect when going from one row to two, and then from two to four.

Anyway, I’m happy to wait to see what gets pushed.

thomasmueller commented :

If we’re allowing people to provide their own KeyToHash I’d be more comfortable if there was an extra mixing function with low overhead (doesn’t necessarily need to be a good hash function, just good enough at mixing a hash somewhat randomly).

My main concern with Fibonacci hashing was that you did it inside the loop. I share the concern that if you allow people to provide their own KeyToHash (so their own hash function), it is problematic. Maybe you want to do at least some amount of mixing within the Increment method… For example, Java HashMap uses this: https://www.connect2java.com/tutorials/collections/how-hashmap-works-internally-in-java/

By the way if you want to speed this up some more, maybe this part could be made faster by using branchless code:

if (*block>>shift)&0x0f < 15 {
    *block += 1 << shift
}

Reason: wherever you have a branch, the CPU will try to predict the outcome. Here, it’s probably quite hard, so you will have many branch misses. There are ways around that, by using branchless operations: https://stackoverflow.com/questions/25615440/clamped-increment-integer

For this case, I assume it would be something like this:

// get the old value
x = (*block >> shift) & 0xf
// convert 0xf -> 0, and everything else to a value smaller than 0
x = x - 0xf
// convert smaller than 0 to 1, and 0 to 0
x = x >> 31
// now x is only 1 if we need to add 1, and 0 otherwise
*block += x << shift

karlmcguire commented :

@FrankReh I will try to recreate the testing methods you used, although I’m pretty confident because I would never expect the (overall cache) hit ratio to change by orders of magnitude.

@thomasmueller I still need to think about mixing, if anything we’ll probably go with the per-row seeds that @ben-manes is using.

As for the branchless operations, interestingly enough the code proposed by you/StackOverflow is about 60% slower than the “branching” method. Here’s the two functions I benchmarked:

Branching: 94.6M ops/sec

func (s *Sketch) Increment(hash uint64) {
    a, b := hash<<32, hash
    for i := range s.blocks {
        block := &s.blocks[i][a>>s.offset]
        if shift := (a & 1) * 4; (*block>>shift)&0x0f < 15 {
            *block += 1 << shift
        }
        a += b
    }
}

Branchless: 52.3M ops/sec

func (s *Sketch) Increment(hash uint64) {
    a, b := hash<<32, hash
    for i := range s.blocks {
        block := &s.blocks[i][a>>s.offset]
        shift := (a & 1) * 4
        n := (*block >> shift) & 0x0f
        n = n - 0xf
        n = n >> 7
        *block += n << shift
        a += b
    }
}

Benchmark

func BenchmarkIncrement(b *testing.B) {
    s := NewSketch(1e5)
    h := fnv.New64a()
    hashes := make([]uint64, 1e6)
    for i := range hashes {
        h.Write([]byte(fmt.Sprintf("%d", i)))
        hashes[i] = h.Sum64()
    }
    b.SetBytes(1)
    b.ResetTimer()
    for n := 0; n < b.N; n++ {
        s.Increment(hashes[n&(1e6-1)])
    }
}

I’m not super knowledgable about Go’s compiler but I assume that it’s doing something weird and that’s what’s causing this. But I do find it odd…

ben-manes commented :

It was also slower in Java, but the branchless had a large error bound that I think it is negligible. The JIT is pretty good and I didn’t look at the assembly to determine how it might have optimized it.

branch: 160,374,302.050 ± 20,625,286.268  ops/s
branch-free: 102,463,137.106 ± 67,214,204.145  ops/s

As much as I like the idea of this optimization, the sketch hasn’t been a bottleneck in my experience since its operated on async to the user-facing calls (during cache maintenance rather than on the read/write paths).

andersfylling commented :

What version of Go are you guys using in these tests, 1.12.0?