The way to understand about this is as follows:
S2 cells can be thought of as a quad-tree. Each S2 cell has 4 children at the next level. In the case of DGraph, the cells are from level 5 to level 16, so there are multiple trees with root nodes being at level 5 and the leaf level is level 16.
Indexing Points
When indexing a point, we index it at the leaf level and then add it to every ancestor cell all the way to the root. The reason for doing this is that at query time we can quickly check if a point is in a region by querying any of the ancestor nodes.
Indexing Regions
A region is represented as an arbitrary closed polygon. To index a region we find a covering for the region, that is a set of S2 cells (which may be at different levels) which completely cover the polygon. Note: that this covering is approximate (i.e. the covering may contain points that are not in the region).
Once we have a covering, indexing a region works the exact same as indexing points. i.e. the region is indexed at the covering S2 cell levels (which may not be the leaf level) and then added to every ancestor cell all the way to the root.
I’ll illustrate why regions are indexed this way using intersects queries. Note that queries of the form contains, within, and near are all special cases of intersects with additional restrictions applied.
Intersects Queries
An intersect query is something of the form:
SELECT g FROM table WHERE INTERSECTS(g, h)
This query returns all geometries in the table which intersect with the given geometry h
. We assume here that h
is provided by the user and could be either a region or a point.
Now consider the following geometries indexed the database:
- Region for City of San Francisco
- Region representing the Google campus in Mountain View.
- Point representing Google building 43.
- Point representing the Golden gate bridge.
- Region representing the SF Bay area.
Note: I’ve used real world examples for illustration here, if you are not familiar with the geography of the area consider them as follows:
- r1 and r2 are smallish regions which are not very close but share an ancestor cell in the S2 tree.
- p3 is a point inside r2
- p4 is a point inside r1
- r5 is a large region that contains both r1 and r2.
Now the query is:
find me all objects that intersect with 6. the region representing the City of Mountain View
Note that, r6 is a larger region that completely contains r2. Now, while r5 contains r6 which contains r2, they are regions of vastly different sizes, and when we generate a covering for each, the coverings are at different levels in our tree.
The result of this query should be r2, r5 and p3. I’ll illustrate how the index helps us find all these 3 objects.
Finding p3
Finding p3 is easiest case. We query the index using the cover for r6, and then find all objects whose ancestor is that cover. As points are indexed in all ancestors cells, and one S2 cell in the cover for r6 is a ancestor of p3, this query will find p3.
Finding r2
Finding r2 is identical to finding p3. We query the index using the cover for r6, and then find all objects whose ancestor is that cover. In this case, the fact that we are indexing the region in all ancestor cells is important (note: this is the same as points). As there is a S2 cell in the cover of r6 that is the ancestor of a cell in the cover of r2, this query will find r2.
Finding r5
For finding r5 the approach we used earlier does not work. The cover of r5 has a lower S2 cell level than r6 and hence the index for the cover of r6 will not contain r5. So in this case, we flip the approach, we query the index using the ancestors of the cells for r6 and find all objects which cover those cells. Since there is a cell in the cover of r5 which is an ancestor of a cell in the cover of r6, this query will find r5.
Ancestor and Cover Prefixes
One thing to note is that when finding r5 we want to avoid finding r1 and p4. The reason we will pick them up in the approach illustrated so far is that r1 and p4 are added to all the ancestor cells when they are indexed. Our root level S2 cell is quite large and so when we are querying all the ancestor cells of r6, we will find a common ancestor of r1, r6 and p4.
To find r5, we only needed to look at objects whose cover contains one of the ancestor cells of r6. To distinguish these two cases, we index covers and ancestors using different prefixes. (Here a cover means the highest S2 level at which the geometry is indexed). Now, when we are searching the index using the ancestors of r6, we will only look at objects whose cover is one of the ancestor cells and thus only find r5 and exclude r1 and p4.
Other Queries
I’ll quickly illustrate why the other queries are just special cases of the more generalized intersects query described above:
Within
A within query is your typical “find all restaurants in my zip code” query. A within query can be represented as:
SELECT g FROM table WHERE (g WITHIN h)
We restrict that g
can only be points. So this is the same as saying INTERSECTS(g, h) && IS_POINT(g)
. On the execution side, we don’t need to query for ancestors of h
since points cannot cover regions.
Contains
This is your typical geofencing query, e.g. “find the zip code for my location”. A contains query is of the form:
SELECT g FROM table WHERE (g CONTAINS h)
We restrict that h
can only be a point. So this is the same as saying INTERSECTS(g, h) && ASSERT(IS_POINT(h))
. On the execution side, we only need to query for ancestors of h
since points cannot contain regions.
Near
A near query is your typical, “find all restaurants within 2km of my location” query. It can be represented as:
SELECT g from table WHERE (g WITHIN CIRCLE(h,d))
As you can see it is just a special case of the WITHIN query where instead of providing a region the parameters are a point and a radius representing a circle to search within.