Sorry, should have done this earlier.
This pertains to indexing predicates specified by the user.
We use Bleve indices: http://www.blevesearch.com/
Indices will run in two different modes: backfill and frontfill.
Backfill
For backfill, the user will give us a posting store (which can be a snapshot in future with a timestamp). Then for each predicate that the user wants to index, we will seek to the right place in posting store and add to indices. We also add sharding by predicate but the user has to decide the number of shards at the outset, for now. Note: We do not deal with gotomic hashes / dirty mutations. We only consume from RocksDB.
One may ask: Instead of seeking for each predicate, why not just read each row of RocksDB and add to index. The code could be reused for frontfill as well. However, we think that it will be quite common for users to add index for a new predicate over time, and we like this to be efficient. Iterating over the whole table each time seems too time-consuming.
Frontfill
For frontfill, the design is not fixed. Ideally, you want to have a log (maybe not commit logs) to keep the mutation changes so that we can replay them. This would be useful if we want to add an index while keeping dgraph
running — you make a snapshot, remember the timestamp, copy the snapshot, replay mutations from logs from that timestamp onwards.
However, as warned, commit logs will be under heavy construction from the work with RAFT. So, it is best not to meddle with it for now.
Hence, I propose doing a very simple frontfill. As mutations come in, we will update the indices. We will not try to read from logs. This would not support replay, but it would work for mutations for now.
Wonder what you all think @core-devs ? Thanks.