Dgraph keeps hinting around at a query planner and it is in the roadmap. Are there any public discussions or think tanks on how to make a query planner in Dgraph?
please
I still do not know what kind of form that a proposed query planner would take. In several RDBMS they:
- silently work all the time on every query to optimize execution patterns based on index size/contents
- power tooling to print detailed execution plans for human consumption (eg:
EXPLAIN
) - must be explicitly bypassed (eg
STRAIGHT JOIN
)
Can we use this post as a prompt to start a RFC? Or, has one been compiled internal to dgraph that we can peek at?
(Note: Iâm not involved with writing the query planner, nor do I have visibility into the current state. But Iâve experience ripping out the Postgres QP and playing around with it, also I enjoy inventing new programming languages as a hobby)
Here are some brief and unrigourous thoughts on the subject.
Pieces of the Puzzle
A query is mostly declarative. Given a schema
type Person {
name
friends
}
name: string .
friends: [uid] .
and the following example dataset depicting a (undirected) cycle of friends,
set {
_:a <dgraph.type> "Person" .
_:a <name> "we live in a society" .
_:b <dgraph.type> "Person" .
_:b <name> "69420" .
_:c <dgraph.type> "Person" .
_:c <name> "Wololo" .
_:a <friends> _:b .
_:b <friends> _:c .
_:a <friends> _:c .
}
This query
q (func: type(Person)) {
name
friends { name }
}
is pretty straightforwards in showing you what you are asking for:
Show me all nodes that are marked
Person
. I want their names, and their friendsâ names
The result is as follows:
{
"name": "69420",
"friends": [
{
"name": "Wololo"
}
]
},
{
"name": "Wololo"
},
{
"name": "we live in a society",
"friends": [
{
"name": "69420"
},
{
"name": "Wololo"
}
]
}
Now consider this query:
r(func: has(name)){
name
friends { name }
}
Written out in English, we can read it as:
Show me all nodes that have a predicate
name
. I want their name and friendsâ names. (IMPLICIT: if there are nofriends
, then just show me thename
)
The results are as follows:
{
"name": "69420",
"friends": [
{
"name": "Wololo"
}
]
},
{
"name": "Wololo"
},
{
"name": "we live in a society",
"friends": [
{
"name": "69420"
},
{
"name": "Wololo"
}
]
}
Observe that query q
and r
returns the same results. To philosophers, this can be quite vexing. What does it mean when two queries give the same result?
Does it mean:
- That queries
r
andq
denote the same thing? - That queries
q
andr
mean the same thing? - That queries
r
andq
perform the same operations to get the same result.
We can instantly rule out 2. We can simply compare the written-out English sentences, and conclude from the meaning of the English sentence that queries q
and r
mean different things.
There are âofficialâ names for the study of these things of course, fancy schmancy sounding things like:
- Denotational Semantics [1]
- (Abstract) Semantics
- Operational Semantics
But weâre not really interested in any of that, except perhaps Operational Semantics.
Detour on Operational Semantics and Confluence
The big picture view of Operational Semantics is that the operations of a query/expression/function/program/etc are codified and verifiable.
Consider the following functions:
$$
\begin{aligned}
f(x) &= x * x / x \
g(x) &= 1*x
\end{aligned}
$$
These two functions, $f$ and $g$ will always return the same answer. However, when we consider the operational semantics, they are different. $g$ performs one multiplication operation only while $f$ performs a multiplication operation and a division operation.
And now for the trick question. Consider the function $h$, defined as follows
$$
h(x) = x / x * x
$$
Does $h$ do the same things as $f$ ?
To answer this we need the idea of confluence, or the diamond property.
Assuming that we evaluate the steps of the function from left to right, we come to the following confluence diagram
No matter which way we group the operations, we come to the same result. We can extend this notion when comparing $f$ and $h$. BTW this âleap of logicâ is something we can only do in this specific example by cheating (i.e division is essentially multiplication in disguise and that multiplication is associative and commutative). However it does give us a clue on what can be done, which I shall come back to later.
How Dgraph Currently Performs Queries
I am now going to describe how Dgraph performs queries in broad strokes. The finer details are unimportant and can be elided. But the general structure of how itâs done is important.
When a query comes in, it first gets parsed. Parsing is boring, so letâs ignore that. OK, maybe not. Parsing a query transforms the query into an AST[2]. Like any good system[3], this AST is transformed into a graph of operationed data[4], which is then executed by ProcessGraph
.
Here I want to clarify what I mean by âoperationedâ data. A *SubGraph
holds data. How is the data acquired? Itâs acquired through ProcessGraph
.
ProcessGraph
is particularly interesting because itâs a mega function with the sequence of events hard-coded. To be sure, this function is a concurrent function. But within the function, the order of operations are hard coded to be as follows (in broad strokes):
- Collect data from all the Alphas around the world.
- If there are
@filter
clauses in each predicate, apply them concurrently. - Do facet stuff if any
- Do sort and pagination stuff if any
- Do counts and groupbys if any
- Return results
Bear in mind that because this is a graph of operationed data, there is some recursiveness going on. So for each subgraph in the subgraph of operationed data, steps 1 - 6 happens, and the results are collected and passed on.
Isometry(ish) to Relational Algebra
If steps 1 to 6 seems awfully familiar, you know youâve implemented Coddâs relational algebra at least once in your life. There is some sort of similarity if not isometry, of what Dgraph does with relational algebra, enumerated in the table below
Dgraph | Codd |
---|---|
Collect data | Î and â |
Filter | Ï |
Facet | Ï |
Sort and Paginate | n/a (sorting and pagination are not part of the Codd algebra, but is part of all databasesâ operations - itâs a âpostprocessingâ step, if you will) |
Counts and GroupBys | XÎ |
Most databases (except ones that lose data) generally have to follow Coddâs algebra (more or less). The innovations in databases are in the details of how the algebra is implemented.
For example, Dgraphâs main contribution is not treating projections and joins as separate operations, but as one single operation that is unified. In essence the âjoinâ operation disappears. Well, kinda. Iâll explain further.
Detour on Query Rewriting and Query Optimization
OK, all that has been set out, now letâs see what interesting things we can do.
First, recall q
and r
. Both are queries that have different meanings but return the same results.
Question: q
vs r
, which is faster?
Answer: q
is faster.
Why? The has
function is slow because it needs to look everywhere for data that has a particular predicate. The type
function by contrast, acts as an index. Look up which nodes has the <dgraph.type>
of Person
, then follow the pointers to retrieve the correct data. It turns out in this instance, following pointers is faster than performing a full scan of all the tablets.
Given q
is faster than r
, might it be a sane idea to actually run q
when the input r
is given? This is the basis of query optimization. We simply rewrite the query to one that is more performant under the hood.
However, to achieve this requires a very strong grasp of the semantics of the query language. As enumerated above, we have really two options. Here I shall enumerate the considerations that we have to give in order for us to say that two queries A and B are equal.
Considerations
Denotational Semantics
If we are to use denotational semantics to identify if two queries are the same thing, then we will have to rely heavily on the provided schema (see Potential Pitfalls below) to âguessâ and perform heuristics to perform the query rewriting
Operational Semantics
If we are to use operational semantics to identify if two queries are the same thing then we just have to consider the list of operations in each queries. And then component by component, identify if they are equal. We will also have to account for the commutativity of the operations.
As an example, letâs compare a âlooseâ operational semantics of the following compositions:
- filter then sort
- sort then filter
In this specific example, the composition of functions is commutative (very rare!).
However, itâs trivial to see that filter-then-sort is a superior approach - filtering is a O(N) operation. Sorting is a O(N log(N)) operation.
Filter-then-sort: If the result of filtering is M elements, where M < N, then following with sorts would result in M(log(M)) operations.
Sort-then-filter: Sorting take O(N(log(N)), and filtering after would take N operations to yield M elements as a result.
The total number of operations would be smaller in filter-then-sort. This forms a good heuristics.
So with this heuristics in hand, we can now confidently re-write all queries that have a sort-then-filter operation to be a filter-then-sort operation order.
Potential Pitfalls
q
and r
returns the same results under the condition that the schema was as given. Assume now that the schema has changed to the following:
type Person {
name
friends
}
type Animal {
name
friends
}
name: string .
friends: [uid] .
Now the semantics of r
being different to the semantics of q
really matters! q
and r
now returns different results!
What is a Query Planner?
And now we can ask the question: what is a query planner?
Recall from the earlier section that the queries are declarative. That means that an implementor is free too implement whatever they want as long as it returns the results the query is supposed to return.
Thus, a query planner is one that determines what operations (and in what order) should take place.
In Dgraph, there is a general âframeworkâ of how a query is processed, as laid out in the section How Dgraph Currently Performs Queries. And indeed, right now, due to the concurrent nature of Dgraphâs query processing, if you run the same query twice, the order of operations may not necessarily be the same in each run.
However we are not going to talk about query planning on that level. We simply trust the Go schedulerâs correctness.
There are two kinds of query planners we can write:
- Query planner that changes the order of
ProcessGraph
based on heuristics (Dynamic/âruntimeâ optimization). - Query planner that rewrites queries into more performant ones and execute them (Static/âcompile timeâ optimization).
Both of these query planners require some sort of understanding of the semantics of the query language. We can then leverage the algebraic properties of the operations to perform the work of both kinds of query planners.
Just like programming languages, we may prefer one over the other, but all programming languages have a runtime and a compiletime component. A efficient programming language will optimize both phases.
The Order of Operation of Writing a Query Planner
Assuming we do finally go ahead and decide to write a query planner (again, I say this as a guy that isnât writing the query planner, and loves posturing), this is my humble opinion of how to proceed.
On the software planning end we have the following:
- Decide on the notion of equality in the semantics of a query. What kind of semantics are we comparing? I strongly propose we consider the Operational Semantics.
- Create a table that is a cross product of all pairs of unary operations - call this the composition table.
- Mark all unary operation pairs as either commutative or not.
- Assign costs to the pairs
- ???
- Profit!
On the actual programming end we have the following:
- Create a new package (presumably called
analysis
or something like that). The goal of this package is to convert the AST into a graph of operations. By separating out the operations from the operationed data, we can proceed with better optimizations. - Symbolize all the operations in package
query
- this means breaking upProcessGraph
into named bitesized chunks, where each name corresponds to an operation (a function is good because Go has first class functions, but we might also want to protobuf that shit). - Modify
SubGraph
inquery
to also take a list of operations that it needs to perform. This list of operations is a topologically sorted graph of operations. - ??? (rewrite/reorder operations based on the costs in the heuristics from the process above)
- Profit!
Footnotes
with databases denotational semantics can get quite hairy. Specifically if you ask the question in regards to models of the world. In that case, denotational semantics and the unlabelled semantics(#2 from the list) might be the same thing ⊠But this is not the time or place to discuss it. Maybe on April 20th and with somethings burning. â©ïž
seriously, the naming convention sucks. Why is it called
gql
? Because once upon a time DQL used to be called GraphQL±. â©ïžself plug. Duh. Any good complicated library will eventually need to build up a graph of operations. â©ïž
I am not joking. the naming convention sucks. The data type is called
SubGraph
but the package name is calledquery
. How frustrating! But the way the data structure is used to orchestrate operations and collect data over a concurrent process is chefâs kiss. Just beautiful. â©ïž
Wow! What a thorough reply! Thank you for your time and thoughts!!
The Postgres query planner is pure magic. Still blows my mind that this all works, and is automatic, and you benefit from it even if you donât know it exists. If RDBMSes didnât exist today and someone described Postgres query optimization, it would sound impossibly ambitious. https://twitter.com/garybernhardt/status/1397236117032869894
^ That was what spurred this topic. Trying to figure out if this is a reality or really just something that would be nice to have but is an ambitious impossibility.
And that leads to something I would also like to point out even with this very simple explanation you gave below and why IMO, very very few two queries would be the same in âOperational Semanticsâ and why I foresee a query planner only being effective if it can somehow see some data without querying the data which is an ambitious impossibility.
To understand my thoughts letâs look at another potential pitfall given even this very small sample of schema and these two queries
Only for the data that you provided as en example. Letâs add another piece of data into the mix:
set {
_:d <dgraph.type> "Person" .
_:d <friends> _:a .
_:e <name> "Lorem Ipsum" .
}
Here we added a node having the type Person with a friend, but no name and another node with a name predicate but no type. This mutation would be valid given the original first schema which disproves the following point:
The schema is not what made the two queries return the same results, but rather it was the data that yielded to the same results. What is even more frustrating is that Dgraph can be utilized in a schema-less style and sort of build a schema as it goes when first seeing a predicate and guessing at itâs type from the received type. That is great for flexibility, but makes things difficult when trying to implement a query planner. But that is just a side point.
And here is some of my ideas with what might be possible with a query planner, but it kinda gets dicey and again seems an ambitious impossibility.
- Switching root and filter functions to reduce the known galaxy
{x(func: type(Person)) @filter(uid(0x2a)) { uid }}
and
{y(func: uid(0x2a)) @filter(type(Person)) { uid }}
Will have the same results every time no matter the schema or the data it contains. The difference is just optimization of the query putting the more efficient function in the root. This gets further complicated though as the filter directive accepts logical combinators, but the root function does not. Which leads to another idea
- Processing logic before the query
{m(func: type(Person)) @filter(uid(0x2a) OR uid(0x2b)){ ... }}
{n(func: uid(0x2a, 0x2b)) @filter(type(Person)){ ... }}
This example again uses the strategy in #1, but before that is applied it processes the logic to know that uid(0x2a) OR uid(0x2b)
can be reduced to uid(0x2a, 0x2b)
I just got done recently writing an algorithm to logically reduce arrays of strings that get combined with AND, OR, and NOT to deduce it to a single array either being [...]
or NOT [...]
. As I worked through that I also thought about a query planner and how that might be helpful, but only if the logic could be done which lead to an optimized query. My purpose was to simplify the application layer as a whole when performing multiple parallels queries and combining responses into one final query
- Rewriting queries with cascade
Another longer discussion which can be found over at V21.03: After pagination+cascade change, queries are too slow to finish But a lot of this optimization can only be done if you have an idea in advance how many nodes or predicates a query might return. Something like optimizing a cascade by adding has
functions on a rewritten query might make it more performant if there are 6M nodes and only a small percentage have the predicates being cascaded, but if you know in advance that all of the nodes will have the edges being cascaded and just some of them will match the filter being applied then adding a has
filter reduces performance instead of increases perofrmance.
- Reversing a query and then recompiling the results returned
I see this as being the key into a graph query planner, but it involves the most work and having some things understood such as every predicate having a reverse directive and being able to quickly and efficiently get counts of predicates even before running the query that is being optimized (ambitious impossibility, idk?)
{a(func: type(Person)) { hobbies { name } }}
{b(func: type(Hobbies)) @filter(has(~hobbies)) { name }}
These two queries could return the same nodes of data if this syntax is/were even possible for reverse predicates. But this is really dependent on knowing that no fields of the Person were returned and that was really only getting what hobbies people were using. This also brings in some ideas about graph theory involving two graphs containing the same nodes with the same data being equivalent even if those graphs had a different structure and repeating nodes. This basically is the theory of converting two subgraphs into n-quads and seeing if the two sets of n-quads are equivalent and deducing that the subgraphs contain the same exact data just represented differently.
At the root of a query planner is touching the fewest amount of predicates which this sort of handles, but only depending on the number of predicates. If there are Mâs of type(Person)
but 1000% fewer type(Hobbies)
then reversing the query would be more performant and return the same graph data but impossible to return the same graph structure, again why this seems impossibly ambitious.
And here is wherein lies almost every problem with any kind of query planner/optimization that would really be performant:
The shape of the graph the client asks for is the shape of the graph the response should return
If I were to ask for query a
above I might get:
[
{ hobbies: [{ name: "hiking" },{ name: "programming" }] },
{ hobbies: [{ name: "camping" }] },
{ hobbies: [{ name: "running" }, { name: "hiking" }] },
{ hobbies: [{ name: "programming" }, { name: "running" }] },
{ hobbies: [{ name: "programming" }] },
{ hobbies: [{ name: "hiking" }] }
]
While query b
might respond with:
[
{ name: "hiking" },
{ name: "programming" },
{ name: "running" },
{ name: "hiking" }
]
Which contains the same nodes of data if hobbies is what you really want, but a query planner cannot make assumptions into what you really want vs. what you asked for.
Now if a table was returned instead of a graph, then maybe this would be more possible. But in DQL the query and the response is the same. It is not like other graph languages that let you form a query and then pick what you want back out of that query (Cypher/Gremlin), rather it is forming the query and the response back in a single operation.
Hence which leads me to believe that a query planner might not really work well after all in a graph database, and instead query planner would really just be a query rewriter/optimizer to do 1
, 2
, 3
, but nothing as complex as 4
like a rdbms query planner might do when doing joins differently than how the client asked for them.
A factor into this might be the normalize directive which I still do not understand fluently enough to make educated assumptions on how to optimize that.
IMHO the biggest part of a query planner would be #3 and rewriting queries that have post query logic. (cascade, pagination if cascaded, normalize?) This is the same as @chewxy simple example of changing the order of sort/filter as it returns the same exact graph either way it is applied.
An Edit Mistake
I miswrote:
q
andr
returns the same results under the condition that the schema was as given
should be
q
andr
returns the same results under the condition that the schema and data was as given
I think an earlier version of the draft did use data.
Yes and no. Under a strict schema regime, you would not have data that do not fit the schema. But as you say:
On The Concrete Steps Listed
Your concrete steps (1,2,3,4) are very very excellent and specific elaborations/examples of the table of compositions process I listed above. I have some comments:
A key thing I feel that has been slightly missed is that the @cascade
directive or ~
are actually elements of the query language, and thus can be boiled down to an operation. I do think this is jumping to solutions before a blueprint is even created (e.g. the table of compositions)
On Denotational Semantics
Thatâs what the whole denotational semantics shebang is all about. You can only make a rewrite if you know that the result of the rewrite will yield the same data (same denotation).
The main issue is you run into âruntimeâ issues, requiring you to look at the data before you look at the data. Enforcing a strict schema would allow you to avoid doing that, so you look at the metadata (schema) before you look at the data.
On the Suitability of a Query Planner for DQL
I think the flexibility of the query specifically allows for a query planner to be written! Other than directives (which are âactionâ words, or âcommandsâ if you are used to 1960s style PLT lingo), everything else is declarative. That leaves a LOT of freedom to re-order the operations to something that best fits the performance characteristics. A key is to lean on Coddâs relational algebra, or to create something similar.
On @normalize
My understanding is @normalize
is just a fmap
. Itâs just an operation.