Recall we have a UID matrix: `A`

. Denote `A_i`

as `i`

-th posting list. We want to sort all of them by some attribute say `dob`

. Let `B`

be the merged list. Let `N`

be the total number of elements in UID matrix. Let `|A_i|`

be size of `i`

-th posting list. Let `|B|`

be size of `B`

, i.e., number of unique values. Note `N=sum_i |A_i|`

. Let `|A|`

be number of posting lists.

There are two extreme scenarios to keep in mind. First scenario is that the posting lists overlap completely, and `|B|=N/|A|`

. Second scenario is that posting lists do not overlap at all, and `|B|=N`

.

There are many ways we can sort. First, don’t worry about pagination.

# Method 1: Worker sorts `B`

and returns permutation. Build posting lists using permutation.

Here we do only one network call instead of `|A|`

network calls where `|A|`

is the number of posting lists. The complexity lies in how we use the ordering of `B`

to order each posting list `A_i`

.

Here is one way to do it. As we merge the posting lists to form `B`

, keep information to be able to go from `B`

to each `A_i`

. One way to do it is that for each element of `B`

, we store which posting lists this element comes from. For example, if `A_0=(1,2,3)`

and `A_1=(2,3,4)`

and `B=(1,2,3,4)`

. Then we will have a list of lists (or something equivalent and more memory-efficient) that goes like `Q:=((0), (0,1), (0,1), (1))`

. The first element of `Q`

is just `(0)`

because `B_0=1`

is contained in posting list `A_0`

only. On the other hand, the second element of `Q`

is `(0,1)`

because `B_1=2`

is contained in both `A_0,A_1`

.

Now, send `B`

to be sorted. Say the final sorted `B`

by `dob`

goes like `(4,3,2,1)`

. Then the worker will return `(3,2,1,0)`

as `B_3`

is the first element in the sorted `B`

. Now, do a loop over `(3,2,1,0)`

. First, we see a `3`

. We look up `Q`

and we see that `B_0`

is contained only in `A_1`

. Hence, we append `C_1`

by `B_3=4`

, where `C`

is like `A`

(UID matrix) but will contain posting lists sorted by `dob`

. Continue scanning `(3,2,1,0)`

. The next element we see is `2`

. Look up `Q`

and we see that `B_2`

is contained in `A_0,A_1`

. Hence, we append both `C_0,C_1`

by `B_2`

. Now, `C_0=(B_2)=(3)`

while `C_1=(B_3,B_2)=(4,3)`

. We can continue this process and obtain `C_0=(3,2,1)`

and `C_1=(4,3,2)`

.

Total running time is `O(|B| log |B| + N)`

. The first term is due to sorting of `B`

in worker. The second term is due to “coordinator” sorting each posting list using the returned permutation.

One downside is that you need to allocate more memory to store `Q`

and garbage collection is not cheap.

# Method 2: Like Method 1 but do binary search instead of keeping Q.

You could try this. Say we have `B`

sorted by `dob`

, which is `(4,3,2,1)`

. For each element, we can run through each posting list, do a binary search to see if it is inside. If it is inside, append to `C_i`

. This is actually quite expensive… The amount of work needed to sort each posting list (after getting sorted `B`

) is `O(|A| N)`

ignoring log factors for the binary search. This looks bad. Maybe you all have some other ways of using binary search?

# Method 3: Worker sorts `B`

, then coordinator sorts each posting list by their position in sorted B

Each posting list will need to store their positions in `B`

. In the above example, `A_0=(1,2,3)`

will store `(0,1,2)`

while `A_1`

will store `(1,2,3)`

. Then the worker returns the position of `B_i`

in the sorted list. In this case, it will return `(3,2,1,0)`

like before because `B_0=1`

is position 3 of sorted list `(4,3,2,1)`

, and `B_1=2`

is position 2 of sorted list. Now, sort each posting list by this mapping.

There is actually no memory overhead here because after this, we have the ordering of each posting list. The disadvantage is that the coordinator still needs to do sorting per posting list, which removes the advantage of sorting `B`

in the case of a lot of overlap between posting lists. Running time is `O(|B| log |B| + N log (N/|A|))`

, roughly.

# Method 4: Worker sorts each posting list (simplest)

This is the simplest method. Running time is `O(N log (N/|A|))`

roughly. The downside is that we need to make `|A|`

network calls instead of 1, and time spent doing actual sorting can be worse by a factor of `|A|`

in the case of max overlap. And we can no longer tell people that Dgraph is more efficient because we make one network call per predicate… However, in terms of asymptotic behavior, it is actually not that bad, as method 1 still has that `O(N)`

term to build sorted posting lists in the coordinator.

# Pagination

Say we want just the top few entries of each posting lists. If we use method 4, the worker can trim the results and return early. It can iterate over RocksDB until it gets enough results per posting list, and return.

However, for the other methods that sort `B`

, this is a lot harder. You can’t really do any trimming. For example, `A_0=(0,1,2,3,4,5,6,7,8,9,10)`

and `A_1=(11)`

. We ask for just one result. `B=(0,1,2,3,4,5,6,7,8,9,10,11)`

. Say when sorted by `dob`

, the order doesn’t change. If we want the top result for `A_1`

, we have to return the full sorted `B`

. If the worker sorts each posting list, then it could have just returned `(0)`

for sorting `A_0`

and `(11)`

for sorting `A_1`

.

# Questions

Am I missing something basic here? Do you all have some other approaches in mind? I am leaning towards Method 4. It can be worse by a log factor, which I thought is palatable given that pagination can sometimes save a lot of work.

What do you all think? @core-devs

(In another post, I can write about whether we want to keep keys of index in memory.)