Badger: Optimize bloomfilter parameters

According to the formula shown in the WIKI:https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions, number of hash functions k is image
, and m is bitsPerKey

so the calculation of bitsPerKey in badger is not right,

func BloomBitsPerKey(numEntries int, fp float64) int {
	size := -1 * float64(numEntries) * math.Log(fp) / math.Pow(float64(0.69314718056), 2)
	locs := math.Ceil(float64(0.69314718056) * size / float64(numEntries))
	return int(locs)
}

func appendFilter(buf []byte, keys []uint32, bitsPerKey int) []byte {
	if bitsPerKey < 0 {
		bitsPerKey = 0
	}
	// 0.69 is approximately ln(2).
	k := uint32(float64(bitsPerKey) * 0.69)
	if k < 1 {
		k = 1
	}
	if k > 30 {
		k = 30
	}

	nBits := len(keys) * int(bitsPerKey)
	// For small len(keys), we can see a very high false positive rate. Fix it
	// by enforcing a minimum bloom filter length.
	if nBits < 64 {
		nBits = 64
	}
	......

In BloomBitsPerKey(numEntries int, fp float64),

func BloomBitsPerKey(numEntries int, fp float64) int {
	size := -1 * float64(numEntries) * math.Log(fp) / math.Pow(float64(0.69314718056), 2)
	locs := math.Ceil(float64(0.69314718056) * size / float64(numEntries))
	return int(locs)
}

The locs is ln2 * m / n, but in func appendFilter(), calculate the number of hash functions,ln2 was calculated again

It should be

func BloomBitsPerKey(numEntries int, fp float64) int {
	size := -1 * float64(numEntries) * math.Log(fp) / math.Pow(float64(0.69314718056), 2)
	locs := math.Ceil(size / float64(numEntries))
	return int(locs)
}

Here is the pr: https://github.com/dgraph-io/badger/pull/1763

1 Like