According to the formula shown in the WIKI:https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions, number of hash functions k is
, and m is bitsPerKey
so the calculation of bitsPerKey in badger is not right,
func BloomBitsPerKey(numEntries int, fp float64) int {
size := -1 * float64(numEntries) * math.Log(fp) / math.Pow(float64(0.69314718056), 2)
locs := math.Ceil(float64(0.69314718056) * size / float64(numEntries))
return int(locs)
}
func appendFilter(buf []byte, keys []uint32, bitsPerKey int) []byte {
if bitsPerKey < 0 {
bitsPerKey = 0
}
// 0.69 is approximately ln(2).
k := uint32(float64(bitsPerKey) * 0.69)
if k < 1 {
k = 1
}
if k > 30 {
k = 30
}
nBits := len(keys) * int(bitsPerKey)
// For small len(keys), we can see a very high false positive rate. Fix it
// by enforcing a minimum bloom filter length.
if nBits < 64 {
nBits = 64
}
......
In BloomBitsPerKey(numEntries int, fp float64),
func BloomBitsPerKey(numEntries int, fp float64) int {
size := -1 * float64(numEntries) * math.Log(fp) / math.Pow(float64(0.69314718056), 2)
locs := math.Ceil(float64(0.69314718056) * size / float64(numEntries))
return int(locs)
}
The locs
is ln2 * m / n, but in func appendFilter(), calculate the number of hash functions,ln2 was calculated again
It should be
func BloomBitsPerKey(numEntries int, fp float64) int {
size := -1 * float64(numEntries) * math.Log(fp) / math.Pow(float64(0.69314718056), 2)
locs := math.Ceil(size / float64(numEntries))
return int(locs)
}
Here is the pr: https://github.com/dgraph-io/badger/pull/1763