Iterators with bloom filter


(Kevin Yang) #1

Hey, Badger team:

I have a use case where I need to scan the whole badger database, stop at ~10K guessed locations. If the guessed location actually exists, stop there and read some stuff.

I can have 2 possible solutions:

  1. txn.NewIterator, and Seek to any of the guess locations, verify it’s really the key I want.
  2. txn.NewKeyIterator, and supply the guessed location.

Solution 2 seems better as (according to documentation), badger can take advantage of internal bloom filter and make quick decisions. However, there is currently no way to reuse an iterator (created by NewKeyIterator), as it’s not possible to change the Prefix and the opt.prefixIsKey on-the-fly.

So my question:

  1. how much performance difference is solution 1 and 2?
  2. can a regular iterator (created by txn.NewIterator) take advantage of bloom filter somehow? I know internally the prefix cannot include the timestamp (for bloom filter to work), and there is no good way to enforce that.
  3. is there a way to re-use an iterator (created by txn.NewKeyIterator), with different KeyPrefix? that would be ideal.

Thanks.


(Ashish) #2

@yangzh Please find the answers below:

  1. As such we don’t have any numbers. But solution 2 should be more performant as it significantly reduces number of SSTables to be searched for key. Solution 1 searches in all the tables.
  2. KeyIterator is nothing but a regular iterator with Prefix and PrefixIsKey fields set in iterator options. If both of the options are set then regular iterator will behave as keyIterator.
  3. No. Iterator in general has list of SSTables to be searched for any key/prefix. If Prefix and PrefixIsKey fields changes then tables to be searched might be different.

(Kevin Yang) #3

@ashishgoswami, thanks for the response.

I understand that a different Prefix implies a different list of sstables to search for, but I guess iter.Rewind() can do some initialization work? or the initialization is done in the NewIterator (which is less flexible, but I understand and can relate)?

Alternatively, are you guys open to the idea of having another iterator type, that optimized for my use case (or maybe use cases for others): frequent stop, and do something only when the key actually exists.

In the worse case, I can stick with existing NewKeyIterator solution, as there is conceptually no way around having to preparing for different set of sstables when Prefix changes. However, I don’t like the fact that iterators can not be re-used, and we need Go GC to create/destroy many many iterators, but I guess it might not be a big deal after all.

Thanks!


(Kevin Yang) #4

Thought about this for a while, I guess another good solution would be to always use bloom filter (do you really have to try all tables) during Seek, for regular iterators (created by NewIterator): it makes very little code change, fit well with existing design / code, and no performance degradation (only potential gains).

Is that possible?


(Ashish) #5

Regular Iterator has all the tables in its list. Rewind() on it, will work same as reinitialisation of it. You can change its prefix anytime and rewind it. It will point to the first key which is >= its prefix.
KeyIterator is an optimisation over regular iterator to narrow down the search space, when you are certain Prefix is also the key.

Normally Iterators are used in getting range of keys. When you want to search exact key, Get() is the best choice, which does exactly what you have described.


(Kevin Yang) #6

@ashishgoswami Thanks, didn’t pay attention to Get(), (I assume), each Get() will consult bloom filter to locate the exacts sstable, if that’s the case, yes, it solves exactly my problem.

Thanks!