Hey @jchiu,
Here’s a working solution to our indexing problem.
To build index, we derive deltas based on current version of the data.
So, when we get:
Current Val: uid -name-> A
Update: uid -name-> B
We’ll create two updates to the index:
A -DEL-> uid
B -SET-> uid
If there are no crashes, this works great. But, in the case of a crash, the original PL might have been updated to B, but the indices might not.
The data left would be something like this:
Data: uid -name-> B
Index: A → uid
Problem
On a mutation replay (uid -name-> B), nothing would happen to Data. So, no delta would be generated, and hence the index would stay out-of-sync.
Solution
Irrespective of the existing value of the data, we always send a SET instruction to the index. Hence, doing:
B -SET-> uid
This would leave us with both:
A -SET-> uid, &
B -SET-> uid
Periodically, we run a thread, which runs over all the indices, and verifies that the uid pointed has the index token. So, we encounter B, and check if (uid, name) has B. It won’t. In that case, we delete the uid from B.
Race condition
There might be a rare race condition here, when this periodic thread, runs the B -DEL-> uid instruction, but before it could, another data update instruction comes in, which does uid -name-> B, and goes on to set B → uid. However, this is really rare, because we acquire mutex locks over the posting list data, and therefore, the uid -name->B instruction can only happen after we’ve already read (uid, name). Otherwise, we’d have read the B in (uid, name).
So, after reading, and before doing the write to index, this new update instruction would have to read the data, update the data, generate the delta, and apply both the DEL and SET to the index; before we could set the index. That would be very rare.
Potential solution to rare race condition
We could acquire one lock over the original data, and all the indices before doing any reads or writes. This would solve the above rare race condition, because any new update would have to wait before the periodic thread can finish it’s operation. Alternatively, the update would have already happened, and then the periodic thread would see A as the extra token, and not B.
Generating new indices
We still need a way to generate new indices, once we start having dynamic schemas. Or, if we introduce a new index to existing dgraph instance. That is a separate use case though; and we need to find a way to determine that a new index has been added, that we need to regenerate.
Alternatively, the user may give us a regenerate instruction for a particular predicate, in which case we should do it.