Geo Indexing: parent and cover prefixes

This topic is to briefly summarize why we need parent and cover prefixes for geo indices.

S2 cells (Intro):

S2 cells are how the region on a sphere (earth) are identified. There are different levels (say 1 to 20) of cells with 1 being the largest size (covering area of say 100000 sq km) and 20 being smallest (covering area of say 1sq metre). The cells are arranged in quad-tree form, with each cell being split into 4 in the subsequent level and hence the area covered by a cell would also be reduced by 4 in each level.

In Dgraph we allow cells from range 5 (Approx 250km x 380km) to 16 (Approx 120m x 180m)

Indices:

  • Point: A point can be represented by a cell of the lowest allowed size.
  • Region: A region is represented by a collection of cells at different levels (Imagine a large region, the middle parts can be covered by larger cells and the edges might need smaller cells for accurate covering).

Parent and cover cells:

  • Cover: For points, cover would be the smallest cell which contains the point. And for regions it’d be the set of cells at different levels that cover the region.

  • Parent: All the upper-level cells of a given cell are the parent cells. For a point, it’d be all the cells from lowest to highest level. For regions, it would be the parents at all the levels of all the cells present in the cover.

Queries

  • Contains: Here we need to find all the points which are contained by a given region. On converting the region to s2 cells, we’d get a set of cells at different level. So, Only if we had indexed the points at all the levels, it could be retrieved. We then add the cover cells with the parent prefix and query the index to get the appropriate cells.

  • Intersects: Think about the cases when the larger region is indexed, we’d need to query using the parent cells of the smaller region but using the cover prefix. Conversly, If the smaller region is indexed, we’d have to query with the cover cells of the larger region but by using the parent prexif (As we have indexed all the parent cells of the smaller region)

For other query types, we’d use appropriate combination of parents and cover to retrieve the required indices.

Hope this gives an idea about why we index the parent and cover cells with prefixes.

Note that prefix-matching could be used to identify the parent cells (As S2 tokenizes the cells in that manner) but the in-memory maps wouldn’t support prefix matching and hence to avoid iterating through all the keys, we store the keys at all the levels (Parents) and generate appropriate keys while querying.

Feel free to discuss in case something is not clear or needs more explanation. @core-devs
@kostub: Please have a read and let me know in case something is incorrect or can be better explained.

Thanks

Sorry, it’s hard to follow the queries logic. Can you explain with an example?

My thinking for doing geo-indexing would be this:

  • For each point, generate all the covers, from level X to level Y.
  • For each region, generating the best effort coverage, which involves cells at different levels.
    These two form the keys. For point and for region, multiple keys would be formed.

If a nearby query comes, we generate the S2 cells between [X, Y]. This gives us a list of tokens. We then query for these tokens, and merge the results – these form all the geo-locations which fall in that spherical region. We can then query for their exact geo-points, and eliminate the ones that are not exactly within.

I’m not sure what the intersect query is. Can you explain?

The way to understand about this is as follows:

S2 cells can be thought of as a quad-tree. Each S2 cell has 4 children at the next level. In the case of DGraph, the cells are from level 5 to level 16, so there are multiple trees with root nodes being at level 5 and the leaf level is level 16.

Indexing Points
When indexing a point, we index it at the leaf level and then add it to every ancestor cell all the way to the root. The reason for doing this is that at query time we can quickly check if a point is in a region by querying any of the ancestor nodes.

Indexing Regions
A region is represented as an arbitrary closed polygon. To index a region we find a covering for the region, that is a set of S2 cells (which may be at different levels) which completely cover the polygon. Note: that this covering is approximate (i.e. the covering may contain points that are not in the region).

Once we have a covering, indexing a region works the exact same as indexing points. i.e. the region is indexed at the covering S2 cell levels (which may not be the leaf level) and then added to every ancestor cell all the way to the root.

I’ll illustrate why regions are indexed this way using intersects queries. Note that queries of the form contains, within, and near are all special cases of intersects with additional restrictions applied.

Intersects Queries
An intersect query is something of the form:

SELECT g FROM table WHERE INTERSECTS(g, h)

This query returns all geometries in the table which intersect with the given geometry h. We assume here that h is provided by the user and could be either a region or a point.

Now consider the following geometries indexed the database:

  1. Region for City of San Francisco
  2. Region representing the Google campus in Mountain View.
  3. Point representing Google building 43.
  4. Point representing the Golden gate bridge.
  5. Region representing the SF Bay area.

Note: I’ve used real world examples for illustration here, if you are not familiar with the geography of the area consider them as follows:

  • r1 and r2 are smallish regions which are not very close but share an ancestor cell in the S2 tree.
  • p3 is a point inside r2
  • p4 is a point inside r1
  • r5 is a large region that contains both r1 and r2.

Now the query is:
find me all objects that intersect with 6. the region representing the City of Mountain View
Note that, r6 is a larger region that completely contains r2. Now, while r5 contains r6 which contains r2, they are regions of vastly different sizes, and when we generate a covering for each, the coverings are at different levels in our tree.

The result of this query should be r2, r5 and p3. I’ll illustrate how the index helps us find all these 3 objects.

Finding p3
Finding p3 is easiest case. We query the index using the cover for r6, and then find all objects whose ancestor is that cover. As points are indexed in all ancestors cells, and one S2 cell in the cover for r6 is a ancestor of p3, this query will find p3.

Finding r2
Finding r2 is identical to finding p3. We query the index using the cover for r6, and then find all objects whose ancestor is that cover. In this case, the fact that we are indexing the region in all ancestor cells is important (note: this is the same as points). As there is a S2 cell in the cover of r6 that is the ancestor of a cell in the cover of r2, this query will find r2.

Finding r5
For finding r5 the approach we used earlier does not work. The cover of r5 has a lower S2 cell level than r6 and hence the index for the cover of r6 will not contain r5. So in this case, we flip the approach, we query the index using the ancestors of the cells for r6 and find all objects which cover those cells. Since there is a cell in the cover of r5 which is an ancestor of a cell in the cover of r6, this query will find r5.

Ancestor and Cover Prefixes
One thing to note is that when finding r5 we want to avoid finding r1 and p4. The reason we will pick them up in the approach illustrated so far is that r1 and p4 are added to all the ancestor cells when they are indexed. Our root level S2 cell is quite large and so when we are querying all the ancestor cells of r6, we will find a common ancestor of r1, r6 and p4.

To find r5, we only needed to look at objects whose cover contains one of the ancestor cells of r6. To distinguish these two cases, we index covers and ancestors using different prefixes. (Here a cover means the highest S2 level at which the geometry is indexed). Now, when we are searching the index using the ancestors of r6, we will only look at objects whose cover is one of the ancestor cells and thus only find r5 and exclude r1 and p4.

Other Queries
I’ll quickly illustrate why the other queries are just special cases of the more generalized intersects query described above:

Within
A within query is your typical “find all restaurants in my zip code” query. A within query can be represented as:

SELECT g FROM table WHERE (g WITHIN h)

We restrict that g can only be points. So this is the same as saying INTERSECTS(g, h) && IS_POINT(g). On the execution side, we don’t need to query for ancestors of h since points cannot cover regions.

Contains
This is your typical geofencing query, e.g. “find the zip code for my location”. A contains query is of the form:

SELECT g FROM table WHERE (g CONTAINS h)

We restrict that h can only be a point. So this is the same as saying INTERSECTS(g, h) && ASSERT(IS_POINT(h)). On the execution side, we only need to query for ancestors of h since points cannot contain regions.

Near
A near query is your typical, “find all restaurants within 2km of my location” query. It can be represented as:

SELECT g from table WHERE (g WITHIN CIRCLE(h,d))

As you can see it is just a special case of the WITHIN query where instead of providing a region the parameters are a point and a radius representing a circle to search within.

3 Likes

Thanks, @kostub. This is a very detailed explanation of a pretty intricate problem. Great explanation!

So, to summarize this:

For r6, we do:

  • Get all S2 cells which cover r6 [rcov]
  • Get all ancestor cells for r6 [rans]

Using these, we find:

  • All points whose ancestors are [rcov].
  • All regions whose ancestors are [rcov].
  • All regions whose covers are [rans].

So, in terms of seeks, we’re doing:

  • K1: Seek all keys = (ancestorprefix, [rcov])
  • K2: Seek all keys = (coverprefix, [rans])

Is that correct? This would find all points or regions which fall entirely within the requested region, r6; and would also find all regions which contain r6. If we don’t care about finding regions which contain r6, we can just skip looking for K2; and then we’ll only find the points or regions contained by r6.

Thanks @kostub for the great example covering the edge cases.

Yes @mrjn, K1 would give us the points/regions contained by the region (r6). And K2 would give us the regions that contain r6. For intersects, we’d take a union of them.

Yes @mrjn that is exactly right.

So when we run a WITHIN query we only look at K1, when we run a CONTAINS query we only look at K2 and for an INTERSECTS query we look at both K1 and K2.

1 Like

Thanks, @kostub – this makes a lot of sense!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.