Some possibilities for posting list count

Method 1

Suppose most of the time we only need to get the count of the whole UID list, i.e., afterUID=0.

In this method, we will maintain one additional int to store the size of the posting list. We will update this var as we insert and delete elements. We just have to be a bit more careful. When asked for the count of the whole UID list, we return this var. Otherwise, we have to iterate as before.

Method 2

Keep one list for deleted elements and one list for inserted elements. They do not intersect with each other. When you delete an element, you will check the list for inserted elements and remove an element if it is found. Same goes for inserting an element: you may need to remove an element from the list for deleted elements.

When asked for the size of the array with some afterUID, we will do binary search on all three lists: the two lists above and the posting list. Return A+B-C where A, B, C are respectively number of elements after UID in posting list, list for inserted elements, list for deleted elements.

The downside is that we need to do more work when removing elements from the lists of inserted and deleted elements.

Method 3

Extend Method 1. Instead of maintaining only one var for the total size, we will maintain sizes of ranges. We will do this in a multiscale way so that each insert/delete will update the var only in O(log n) time.

Method 4

Like Method 3, but do it in a more straightforward way, not in a multiscale way. This will somewhat resembles the very original method and feel as if you are maintaining some mindex. Basically, when you insert or delete, you might need to update up to O(n) counts.

My preference

I am slightly leaning towards Method 2, then Method 1. I love to hear your opinions @core-devs

1 Like

Method 2 in general, sounds good, but there are edge cases that need to be handled.

Replace: A UID might already be present, but gets replaced by another. Most common scenario is a Value posting, but there could be others as well.
Del then Set
Set then Del
Del of non-existent uid
Del of different posting (with the same uid) → no change

Think through these – it seems like you’ll need 3 different mutation arrays, 2 as you proposed, and then 1 for replaces. And while merging, we could use a heap of size 4 to do merging of these 4 arrays (3 mutations, and one PL).

Alternatively, you could extend the existing impl, do a binary search, but then just iterate over the mutation layer. Note that mutation layer is not going to be very big, because we’ll be committing it to RocksDB pretty fast - so we can trade for simplicity here, at the cost of theoretically worse time complexity - but in practice, this would still give you the binary search based count that you intend to do with method 2.

My recommendation would be to extend the existing impl.

I do like your recommendation. Since we are foresaking skip lists, balanced trees because they are not good in practice for our setup, method 2 won’t achieve good theoretical properties either.

Ok, I will try to extend the existing impl. All we need to do is to maintain accurate count information in the mutation layer, so that we do not have to merge it with the posting list.

With iteration over mutation layer, you wouldn’t need to, right? You do a binary search, then just start iterating from that point onwards.

I was thinking of doing binary search when we insert/delete elements but yeah, the additional code complexity and memory is probably not worth it.

We do that already. Nvm, once you start coding, you’ll figure out the best way to do this.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.