Query Planner

(Note: I’m not involved with writing the query planner, nor do I have visibility into the current state. But I’ve experience ripping out the Postgres QP and playing around with it, also I enjoy inventing new programming languages as a hobby)

Here are some brief and unrigourous thoughts on the subject.

Pieces of the Puzzle

A query is mostly declarative. Given a schema

type Person {
     name
     friends
}

name: string . 
friends: [uid] . 

and the following example dataset depicting a (undirected) cycle of friends,

set {
    _:a <dgraph.type> "Person" . 
    _:a <name> "we live in a society" .
    _:b <dgraph.type> "Person" .
    _:b <name> "69420" .
    _:c <dgraph.type> "Person" .
    _:c <name> "Wololo" . 

     _:a <friends> _:b .
     _:b <friends> _:c .
     _:a <friends> _:c .
}

This query

q (func: type(Person)) {
     name 
     friends { name }
}

is pretty straightforwards in showing you what you are asking for:

Show me all nodes that are marked Person. I want their names, and their friends’ names

The result is as follows:

   {
        "name": "69420",
        "friends": [
          {
            "name": "Wololo"
          }
        ]
      },
      {
        "name": "Wololo"
      },
      {
        "name": "we live in a society",
        "friends": [
          {
            "name": "69420"
          },
          {
            "name": "Wololo"
          }
        ]
      }

Now consider this query:

r(func: has(name)){
     name
     friends { name }
}

Written out in English, we can read it as:

Show me all nodes that have a predicate name. I want their name and friends’ names. (IMPLICIT: if there are no friends, then just show me the name)

The results are as follows:

      {
        "name": "69420",
        "friends": [
          {
            "name": "Wololo"
          }
        ]
      },
      {
        "name": "Wololo"
      },
      {
        "name": "we live in a society",
        "friends": [
          {
            "name": "69420"
          },
          {
            "name": "Wololo"
          }
        ]
      }

Observe that query q and r returns the same results. To philosophers, this can be quite vexing. What does it mean when two queries give the same result?

Does it mean:

  1. That queries r and q denote the same thing?
  2. That queries q and r mean the same thing?
  3. That queries r and q perform the same operations to get the same result.

We can instantly rule out 2. We can simply compare the written-out English sentences, and conclude from the meaning of the English sentence that queries q and r mean different things.

There are “official” names for the study of these things of course, fancy schmancy sounding things like:

  1. Denotational Semantics [1]
  2. (Abstract) Semantics
  3. Operational Semantics

But we’re not really interested in any of that, except perhaps Operational Semantics.

Detour on Operational Semantics and Confluence

The big picture view of Operational Semantics is that the operations of a query/expression/function/program/etc are codified and verifiable.

Consider the following functions:

$$
\begin{aligned}
f(x) &= x * x / x \
g(x) &= 1*x
\end{aligned}
$$

These two functions, $f$ and $g$ will always return the same answer. However, when we consider the operational semantics, they are different. $g$ performs one multiplication operation only while $f$ performs a multiplication operation and a division operation.

And now for the trick question. Consider the function $h$, defined as follows

$$
h(x) = x / x * x
$$

Does $h$ do the same things as $f$ ?

To answer this we need the idea of confluence, or the diamond property.

Assuming that we evaluate the steps of the function from left to right, we come to the following confluence diagram

image

No matter which way we group the operations, we come to the same result. We can extend this notion when comparing $f$ and $h$. BTW this “leap of logic” is something we can only do in this specific example by cheating (i.e division is essentially multiplication in disguise and that multiplication is associative and commutative). However it does give us a clue on what can be done, which I shall come back to later.

How Dgraph Currently Performs Queries

I am now going to describe how Dgraph performs queries in broad strokes. The finer details are unimportant and can be elided. But the general structure of how it’s done is important.

When a query comes in, it first gets parsed. Parsing is boring, so let’s ignore that. OK, maybe not. Parsing a query transforms the query into an AST[2]. Like any good system[3], this AST is transformed into a graph of operationed data[4], which is then executed by ProcessGraph.

Here I want to clarify what I mean by “operationed” data. A *SubGraph holds data. How is the data acquired? It’s acquired through ProcessGraph.

ProcessGraph is particularly interesting because it’s a mega function with the sequence of events hard-coded. To be sure, this function is a concurrent function. But within the function, the order of operations are hard coded to be as follows (in broad strokes):

  1. Collect data from all the Alphas around the world.
  2. If there are @filter clauses in each predicate, apply them concurrently.
  3. Do facet stuff if any
  4. Do sort and pagination stuff if any
  5. Do counts and groupbys if any
  6. Return results

Bear in mind that because this is a graph of operationed data, there is some recursiveness going on. So for each subgraph in the subgraph of operationed data, steps 1 - 6 happens, and the results are collected and passed on.

Isometry(ish) to Relational Algebra

If steps 1 to 6 seems awfully familiar, you know you’ve implemented Codd’s relational algebra at least once in your life. There is some sort of similarity if not isometry, of what Dgraph does with relational algebra, enumerated in the table below

Dgraph Codd
Collect data Π and ⋈
Filter σ
Facet σ
Sort and Paginate n/a (sorting and pagination are not part of the Codd algebra, but is part of all databases’ operations - it’s a “postprocessing” step, if you will)
Counts and GroupBys

Most databases (except ones that lose data) generally have to follow Codd’s algebra (more or less). The innovations in databases are in the details of how the algebra is implemented.

For example, Dgraph’s main contribution is not treating projections and joins as separate operations, but as one single operation that is unified. In essence the “join” operation disappears. Well, kinda. I’ll explain further.

Detour on Query Rewriting and Query Optimization

OK, all that has been set out, now let’s see what interesting things we can do.

First, recall q and r. Both are queries that have different meanings but return the same results.

Question: q vs r, which is faster?

Answer: q is faster.

Why? The has function is slow because it needs to look everywhere for data that has a particular predicate. The type function by contrast, acts as an index. Look up which nodes has the <dgraph.type> of Person, then follow the pointers to retrieve the correct data. It turns out in this instance, following pointers is faster than performing a full scan of all the tablets.

Given q is faster than r, might it be a sane idea to actually run q when the input r is given? This is the basis of query optimization. We simply rewrite the query to one that is more performant under the hood.

However, to achieve this requires a very strong grasp of the semantics of the query language. As enumerated above, we have really two options. Here I shall enumerate the considerations that we have to give in order for us to say that two queries A and B are equal.

Considerations

Denotational Semantics

If we are to use denotational semantics to identify if two queries are the same thing, then we will have to rely heavily on the provided schema (see Potential Pitfalls below) to “guess” and perform heuristics to perform the query rewriting

Operational Semantics

If we are to use operational semantics to identify if two queries are the same thing then we just have to consider the list of operations in each queries. And then component by component, identify if they are equal. We will also have to account for the commutativity of the operations.

As an example, let’s compare a “loose” operational semantics of the following compositions:

  • filter then sort
  • sort then filter

In this specific example, the composition of functions is commutative (very rare!).

However, it’s trivial to see that filter-then-sort is a superior approach - filtering is a O(N) operation. Sorting is a O(N log(N)) operation.

Filter-then-sort: If the result of filtering is M elements, where M < N, then following with sorts would result in M(log(M)) operations.
Sort-then-filter: Sorting take O(N(log(N)), and filtering after would take N operations to yield M elements as a result.

The total number of operations would be smaller in filter-then-sort. This forms a good heuristics.

So with this heuristics in hand, we can now confidently re-write all queries that have a sort-then-filter operation to be a filter-then-sort operation order.

Potential Pitfalls

q and r returns the same results under the condition that the schema was as given. Assume now that the schema has changed to the following:

type Person {
     name
     friends
}

type Animal {
    name
    friends
}

name: string . 
friends: [uid] . 

Now the semantics of r being different to the semantics of q really matters! q and r now returns different results!

What is a Query Planner?

And now we can ask the question: what is a query planner?

Recall from the earlier section that the queries are declarative. That means that an implementor is free too implement whatever they want as long as it returns the results the query is supposed to return.

Thus, a query planner is one that determines what operations (and in what order) should take place.

In Dgraph, there is a general “framework” of how a query is processed, as laid out in the section How Dgraph Currently Performs Queries. And indeed, right now, due to the concurrent nature of Dgraph’s query processing, if you run the same query twice, the order of operations may not necessarily be the same in each run.

However we are not going to talk about query planning on that level. We simply trust the Go scheduler’s correctness.

There are two kinds of query planners we can write:

  1. Query planner that changes the order of ProcessGraph based on heuristics (Dynamic/“runtime” optimization).
  2. Query planner that rewrites queries into more performant ones and execute them (Static/“compile time” optimization).

Both of these query planners require some sort of understanding of the semantics of the query language. We can then leverage the algebraic properties of the operations to perform the work of both kinds of query planners.

Just like programming languages, we may prefer one over the other, but all programming languages have a runtime and a compiletime component. A efficient programming language will optimize both phases.

The Order of Operation of Writing a Query Planner

Assuming we do finally go ahead and decide to write a query planner (again, I say this as a guy that isn’t writing the query planner, and loves posturing), this is my humble opinion of how to proceed.

On the software planning end we have the following:

  1. Decide on the notion of equality in the semantics of a query. What kind of semantics are we comparing? I strongly propose we consider the Operational Semantics.
  2. Create a table that is a cross product of all pairs of unary operations - call this the composition table.
  3. Mark all unary operation pairs as either commutative or not.
  4. Assign costs to the pairs
  5. ???
  6. Profit!

On the actual programming end we have the following:

  1. Create a new package (presumably called analysis or something like that). The goal of this package is to convert the AST into a graph of operations. By separating out the operations from the operationed data, we can proceed with better optimizations.
  2. Symbolize all the operations in package query - this means breaking up ProcessGraph into named bitesized chunks, where each name corresponds to an operation (a function is good because Go has first class functions, but we might also want to protobuf that shit).
  3. Modify SubGraph in query to also take a list of operations that it needs to perform. This list of operations is a topologically sorted graph of operations.
  4. ??? (rewrite/reorder operations based on the costs in the heuristics from the process above)
  5. Profit!

Footnotes


  1. with databases denotational semantics can get quite hairy. Specifically if you ask the question in regards to models of the world. In that case, denotational semantics and the unlabelled semantics(#2 from the list) might be the same thing … But this is not the time or place to discuss it. Maybe on April 20th and with somethings burning. ↩︎

  2. seriously, the naming convention sucks. Why is it called gql? Because once upon a time DQL used to be called GraphQL±. ↩︎

  3. self plug. Duh. Any good complicated library will eventually need to build up a graph of operations. ↩︎

  4. I am not joking. the naming convention sucks. The data type is called SubGraph but the package name is called query. How frustrating! But the way the data structure is used to orchestrate operations and collect data over a concurrent process is chef’s kiss. Just beautiful. ↩︎

5 Likes