(Note: I’m not involved with writing the query planner, nor do I have visibility into the current state. But I’ve experience ripping out the Postgres QP and playing around with it, also I enjoy inventing new programming languages as a hobby)
Here are some brief and unrigourous thoughts on the subject.
Pieces of the Puzzle
A query is mostly declarative. Given a schema
type Person {
name
friends
}
name: string .
friends: [uid] .
and the following example dataset depicting a (undirected) cycle of friends,
set {
_:a <dgraph.type> "Person" .
_:a <name> "we live in a society" .
_:b <dgraph.type> "Person" .
_:b <name> "69420" .
_:c <dgraph.type> "Person" .
_:c <name> "Wololo" .
_:a <friends> _:b .
_:b <friends> _:c .
_:a <friends> _:c .
}
This query
q (func: type(Person)) {
name
friends { name }
}
is pretty straightforwards in showing you what you are asking for:
Show me all nodes that are marked
Person
. I want their names, and their friends’ names
The result is as follows:
{
"name": "69420",
"friends": [
{
"name": "Wololo"
}
]
},
{
"name": "Wololo"
},
{
"name": "we live in a society",
"friends": [
{
"name": "69420"
},
{
"name": "Wololo"
}
]
}
Now consider this query:
r(func: has(name)){
name
friends { name }
}
Written out in English, we can read it as:
Show me all nodes that have a predicate
name
. I want their name and friends’ names. (IMPLICIT: if there are nofriends
, then just show me thename
)
The results are as follows:
{
"name": "69420",
"friends": [
{
"name": "Wololo"
}
]
},
{
"name": "Wololo"
},
{
"name": "we live in a society",
"friends": [
{
"name": "69420"
},
{
"name": "Wololo"
}
]
}
Observe that query q
and r
returns the same results. To philosophers, this can be quite vexing. What does it mean when two queries give the same result?
Does it mean:
- That queries
r
andq
denote the same thing? - That queries
q
andr
mean the same thing? - That queries
r
andq
perform the same operations to get the same result.
We can instantly rule out 2. We can simply compare the written-out English sentences, and conclude from the meaning of the English sentence that queries q
and r
mean different things.
There are “official” names for the study of these things of course, fancy schmancy sounding things like:
- Denotational Semantics [1]
- (Abstract) Semantics
- Operational Semantics
But we’re not really interested in any of that, except perhaps Operational Semantics.
Detour on Operational Semantics and Confluence
The big picture view of Operational Semantics is that the operations of a query/expression/function/program/etc are codified and verifiable.
Consider the following functions:
$$
\begin{aligned}
f(x) &= x * x / x \
g(x) &= 1*x
\end{aligned}
$$
These two functions, $f$ and $g$ will always return the same answer. However, when we consider the operational semantics, they are different. $g$ performs one multiplication operation only while $f$ performs a multiplication operation and a division operation.
And now for the trick question. Consider the function $h$, defined as follows
$$
h(x) = x / x * x
$$
Does $h$ do the same things as $f$ ?
To answer this we need the idea of confluence, or the diamond property.
Assuming that we evaluate the steps of the function from left to right, we come to the following confluence diagram
No matter which way we group the operations, we come to the same result. We can extend this notion when comparing $f$ and $h$. BTW this “leap of logic” is something we can only do in this specific example by cheating (i.e division is essentially multiplication in disguise and that multiplication is associative and commutative). However it does give us a clue on what can be done, which I shall come back to later.
How Dgraph Currently Performs Queries
I am now going to describe how Dgraph performs queries in broad strokes. The finer details are unimportant and can be elided. But the general structure of how it’s done is important.
When a query comes in, it first gets parsed. Parsing is boring, so let’s ignore that. OK, maybe not. Parsing a query transforms the query into an AST[2]. Like any good system[3], this AST is transformed into a graph of operationed data[4], which is then executed by ProcessGraph
.
Here I want to clarify what I mean by “operationed” data. A *SubGraph
holds data. How is the data acquired? It’s acquired through ProcessGraph
.
ProcessGraph
is particularly interesting because it’s a mega function with the sequence of events hard-coded. To be sure, this function is a concurrent function. But within the function, the order of operations are hard coded to be as follows (in broad strokes):
- Collect data from all the Alphas around the world.
- If there are
@filter
clauses in each predicate, apply them concurrently. - Do facet stuff if any
- Do sort and pagination stuff if any
- Do counts and groupbys if any
- Return results
Bear in mind that because this is a graph of operationed data, there is some recursiveness going on. So for each subgraph in the subgraph of operationed data, steps 1 - 6 happens, and the results are collected and passed on.
Isometry(ish) to Relational Algebra
If steps 1 to 6 seems awfully familiar, you know you’ve implemented Codd’s relational algebra at least once in your life. There is some sort of similarity if not isometry, of what Dgraph does with relational algebra, enumerated in the table below
Dgraph | Codd |
---|---|
Collect data | Π and ⋈ |
Filter | σ |
Facet | σ |
Sort and Paginate | n/a (sorting and pagination are not part of the Codd algebra, but is part of all databases’ operations - it’s a “postprocessing” step, if you will) |
Counts and GroupBys | XΠ |
Most databases (except ones that lose data) generally have to follow Codd’s algebra (more or less). The innovations in databases are in the details of how the algebra is implemented.
For example, Dgraph’s main contribution is not treating projections and joins as separate operations, but as one single operation that is unified. In essence the “join” operation disappears. Well, kinda. I’ll explain further.
Detour on Query Rewriting and Query Optimization
OK, all that has been set out, now let’s see what interesting things we can do.
First, recall q
and r
. Both are queries that have different meanings but return the same results.
Question: q
vs r
, which is faster?
Answer: q
is faster.
Why? The has
function is slow because it needs to look everywhere for data that has a particular predicate. The type
function by contrast, acts as an index. Look up which nodes has the <dgraph.type>
of Person
, then follow the pointers to retrieve the correct data. It turns out in this instance, following pointers is faster than performing a full scan of all the tablets.
Given q
is faster than r
, might it be a sane idea to actually run q
when the input r
is given? This is the basis of query optimization. We simply rewrite the query to one that is more performant under the hood.
However, to achieve this requires a very strong grasp of the semantics of the query language. As enumerated above, we have really two options. Here I shall enumerate the considerations that we have to give in order for us to say that two queries A and B are equal.
Considerations
Denotational Semantics
If we are to use denotational semantics to identify if two queries are the same thing, then we will have to rely heavily on the provided schema (see Potential Pitfalls below) to “guess” and perform heuristics to perform the query rewriting
Operational Semantics
If we are to use operational semantics to identify if two queries are the same thing then we just have to consider the list of operations in each queries. And then component by component, identify if they are equal. We will also have to account for the commutativity of the operations.
As an example, let’s compare a “loose” operational semantics of the following compositions:
- filter then sort
- sort then filter
In this specific example, the composition of functions is commutative (very rare!).
However, it’s trivial to see that filter-then-sort is a superior approach - filtering is a O(N) operation. Sorting is a O(N log(N)) operation.
Filter-then-sort: If the result of filtering is M elements, where M < N, then following with sorts would result in M(log(M)) operations.
Sort-then-filter: Sorting take O(N(log(N)), and filtering after would take N operations to yield M elements as a result.
The total number of operations would be smaller in filter-then-sort. This forms a good heuristics.
So with this heuristics in hand, we can now confidently re-write all queries that have a sort-then-filter operation to be a filter-then-sort operation order.
Potential Pitfalls
q
and r
returns the same results under the condition that the schema was as given. Assume now that the schema has changed to the following:
type Person {
name
friends
}
type Animal {
name
friends
}
name: string .
friends: [uid] .
Now the semantics of r
being different to the semantics of q
really matters! q
and r
now returns different results!
What is a Query Planner?
And now we can ask the question: what is a query planner?
Recall from the earlier section that the queries are declarative. That means that an implementor is free too implement whatever they want as long as it returns the results the query is supposed to return.
Thus, a query planner is one that determines what operations (and in what order) should take place.
In Dgraph, there is a general “framework” of how a query is processed, as laid out in the section How Dgraph Currently Performs Queries. And indeed, right now, due to the concurrent nature of Dgraph’s query processing, if you run the same query twice, the order of operations may not necessarily be the same in each run.
However we are not going to talk about query planning on that level. We simply trust the Go scheduler’s correctness.
There are two kinds of query planners we can write:
- Query planner that changes the order of
ProcessGraph
based on heuristics (Dynamic/“runtime” optimization). - Query planner that rewrites queries into more performant ones and execute them (Static/“compile time” optimization).
Both of these query planners require some sort of understanding of the semantics of the query language. We can then leverage the algebraic properties of the operations to perform the work of both kinds of query planners.
Just like programming languages, we may prefer one over the other, but all programming languages have a runtime and a compiletime component. A efficient programming language will optimize both phases.
The Order of Operation of Writing a Query Planner
Assuming we do finally go ahead and decide to write a query planner (again, I say this as a guy that isn’t writing the query planner, and loves posturing), this is my humble opinion of how to proceed.
On the software planning end we have the following:
- Decide on the notion of equality in the semantics of a query. What kind of semantics are we comparing? I strongly propose we consider the Operational Semantics.
- Create a table that is a cross product of all pairs of unary operations - call this the composition table.
- Mark all unary operation pairs as either commutative or not.
- Assign costs to the pairs
- ???
- Profit!
On the actual programming end we have the following:
- Create a new package (presumably called
analysis
or something like that). The goal of this package is to convert the AST into a graph of operations. By separating out the operations from the operationed data, we can proceed with better optimizations. - Symbolize all the operations in package
query
- this means breaking upProcessGraph
into named bitesized chunks, where each name corresponds to an operation (a function is good because Go has first class functions, but we might also want to protobuf that shit). - Modify
SubGraph
inquery
to also take a list of operations that it needs to perform. This list of operations is a topologically sorted graph of operations. - ??? (rewrite/reorder operations based on the costs in the heuristics from the process above)
- Profit!
Footnotes
with databases denotational semantics can get quite hairy. Specifically if you ask the question in regards to models of the world. In that case, denotational semantics and the unlabelled semantics(#2 from the list) might be the same thing … But this is not the time or place to discuss it. Maybe on April 20th and with somethings burning. ↩︎
seriously, the naming convention sucks. Why is it called
gql
? Because once upon a time DQL used to be called GraphQL±. ↩︎self plug. Duh. Any good complicated library will eventually need to build up a graph of operations. ↩︎
I am not joking. the naming convention sucks. The data type is called
SubGraph
but the package name is calledquery
. How frustrating! But the way the data structure is used to orchestrate operations and collect data over a concurrent process is chef’s kiss. Just beautiful. ↩︎