Recall we have a UID matrix:
i-th posting list. We want to sort all of them by some attribute say
B be the merged list. Let
N be the total number of elements in UID matrix. Let
|A_i| be size of
i-th posting list. Let
|B| be size of
B, i.e., number of unique values. Note
N=sum_i |A_i|. Let
|A| be number of posting lists.
There are two extreme scenarios to keep in mind. First scenario is that the posting lists overlap completely, and
|B|=N/|A|. Second scenario is that posting lists do not overlap at all, and
There are many ways we can sort. First, don’t worry about pagination.
Method 1: Worker sorts
B and returns permutation. Build posting lists using permutation.
Here we do only one network call instead of
|A| network calls where
|A| is the number of posting lists. The complexity lies in how we use the ordering of
B to order each posting list
Here is one way to do it. As we merge the posting lists to form
B, keep information to be able to go from
B to each
A_i. One way to do it is that for each element of
B, we store which posting lists this element comes from. For example, if
B=(1,2,3,4). Then we will have a list of lists (or something equivalent and more memory-efficient) that goes like
Q:=((0), (0,1), (0,1), (1)). The first element of
Q is just
B_0=1 is contained in posting list
A_0 only. On the other hand, the second element of
B_1=2 is contained in both
B to be sorted. Say the final sorted
dob goes like
(4,3,2,1). Then the worker will return
B_3 is the first element in the sorted
B. Now, do a loop over
(3,2,1,0). First, we see a
3. We look up
Q and we see that
B_0 is contained only in
A_1. Hence, we append
C is like
A (UID matrix) but will contain posting lists sorted by
dob. Continue scanning
(3,2,1,0). The next element we see is
2. Look up
Q and we see that
B_2 is contained in
A_0,A_1. Hence, we append both
C_1=(B_3,B_2)=(4,3). We can continue this process and obtain
Total running time is
O(|B| log |B| + N). The first term is due to sorting of
B in worker. The second term is due to “coordinator” sorting each posting list using the returned permutation.
One downside is that you need to allocate more memory to store
Q and garbage collection is not cheap.
Method 2: Like Method 1 but do binary search instead of keeping Q.
You could try this. Say we have
B sorted by
dob, which is
(4,3,2,1). For each element, we can run through each posting list, do a binary search to see if it is inside. If it is inside, append to
C_i. This is actually quite expensive… The amount of work needed to sort each posting list (after getting sorted
O(|A| N) ignoring log factors for the binary search. This looks bad. Maybe you all have some other ways of using binary search?
Method 3: Worker sorts
B, then coordinator sorts each posting list by their position in sorted B
Each posting list will need to store their positions in
B. In the above example,
A_0=(1,2,3) will store
A_1 will store
(1,2,3). Then the worker returns the position of
B_i in the sorted list. In this case, it will return
(3,2,1,0) like before because
B_0=1 is position 3 of sorted list
B_1=2 is position 2 of sorted list. Now, sort each posting list by this mapping.
There is actually no memory overhead here because after this, we have the ordering of each posting list. The disadvantage is that the coordinator still needs to do sorting per posting list, which removes the advantage of sorting
B in the case of a lot of overlap between posting lists. Running time is
O(|B| log |B| + N log (N/|A|)), roughly.
Method 4: Worker sorts each posting list (simplest)
This is the simplest method. Running time is
O(N log (N/|A|)) roughly. The downside is that we need to make
|A| network calls instead of 1, and time spent doing actual sorting can be worse by a factor of
|A| in the case of max overlap. And we can no longer tell people that Dgraph is more efficient because we make one network call per predicate… However, in terms of asymptotic behavior, it is actually not that bad, as method 1 still has that
O(N) term to build sorted posting lists in the coordinator.
Say we want just the top few entries of each posting lists. If we use method 4, the worker can trim the results and return early. It can iterate over RocksDB until it gets enough results per posting list, and return.
However, for the other methods that sort
B, this is a lot harder. You can’t really do any trimming. For example,
A_1=(11). We ask for just one result.
B=(0,1,2,3,4,5,6,7,8,9,10,11). Say when sorted by
dob, the order doesn’t change. If we want the top result for
A_1, we have to return the full sorted
B. If the worker sorts each posting list, then it could have just returned
(0) for sorting
(11) for sorting
Am I missing something basic here? Do you all have some other approaches in mind? I am leaning towards Method 4. It can be worse by a log factor, which I thought is palatable given that pagination can sometimes save a lot of work.
What do you all think? @core-devs
(In another post, I can write about whether we want to keep keys of index in memory.)