Query Planner

Dgraph keeps hinting around at a query planner and it is in the roadmap. Are there any public discussions or think tanks on how to make a query planner in Dgraph?

2 Likes

please
I still do not know what kind of form that a proposed query planner would take. In several RDBMS they:

  • silently work all the time on every query to optimize execution patterns based on index size/contents
  • power tooling to print detailed execution plans for human consumption (eg: EXPLAIN)
  • must be explicitly bypassed (eg STRAIGHT JOIN)

Can we use this post as a prompt to start a RFC? Or, has one been compiled internal to dgraph that we can peek at?

1 Like

(Note: I’m not involved with writing the query planner, nor do I have visibility into the current state. But I’ve experience ripping out the Postgres QP and playing around with it, also I enjoy inventing new programming languages as a hobby)

Here are some brief and unrigourous thoughts on the subject.

Pieces of the Puzzle

A query is mostly declarative. Given a schema

type Person {
     name
     friends
}

name: string . 
friends: [uid] . 

and the following example dataset depicting a (undirected) cycle of friends,

set {
    _:a <dgraph.type> "Person" . 
    _:a <name> "we live in a society" .
    _:b <dgraph.type> "Person" .
    _:b <name> "69420" .
    _:c <dgraph.type> "Person" .
    _:c <name> "Wololo" . 

     _:a <friends> _:b .
     _:b <friends> _:c .
     _:a <friends> _:c .
}

This query

q (func: type(Person)) {
     name 
     friends { name }
}

is pretty straightforwards in showing you what you are asking for:

Show me all nodes that are marked Person. I want their names, and their friends’ names

The result is as follows:

   {
        "name": "69420",
        "friends": [
          {
            "name": "Wololo"
          }
        ]
      },
      {
        "name": "Wololo"
      },
      {
        "name": "we live in a society",
        "friends": [
          {
            "name": "69420"
          },
          {
            "name": "Wololo"
          }
        ]
      }

Now consider this query:

r(func: has(name)){
     name
     friends { name }
}

Written out in English, we can read it as:

Show me all nodes that have a predicate name. I want their name and friends’ names. (IMPLICIT: if there are no friends, then just show me the name)

The results are as follows:

      {
        "name": "69420",
        "friends": [
          {
            "name": "Wololo"
          }
        ]
      },
      {
        "name": "Wololo"
      },
      {
        "name": "we live in a society",
        "friends": [
          {
            "name": "69420"
          },
          {
            "name": "Wololo"
          }
        ]
      }

Observe that query q and r returns the same results. To philosophers, this can be quite vexing. What does it mean when two queries give the same result?

Does it mean:

  1. That queries r and q denote the same thing?
  2. That queries q and r mean the same thing?
  3. That queries r and q perform the same operations to get the same result.

We can instantly rule out 2. We can simply compare the written-out English sentences, and conclude from the meaning of the English sentence that queries q and r mean different things.

There are “official” names for the study of these things of course, fancy schmancy sounding things like:

  1. Denotational Semantics [^denotation]
  2. (Abstract) Semantics
  3. Operational Semantics

But we’re not really interested in any of that, except perhaps Operational Semantics.

Detour on Operational Semantics and Confluence

The big picture view of Operational Semantics is that the operations of a query/expression/function/program/etc are codified and verifiable.

Consider the following functions:

$$
\begin{aligned}
f(x) &= x * x / x \
g(x) &= 1*x
\end{aligned}
$$

These two functions, $f$ and $g$ will always return the same answer. However, when we consider the operational semantics, they are different. $g$ performs one multiplication operation only while $f$ performs a multiplication operation and a division operation.

And now for the trick question. Consider the function $h$, defined as follows

$$
h(x) = x / x * x
$$

Does $h$ do the same things as $f$ ?

To answer this we need the idea of confluence, or the diamond property.

Assuming that we evaluate the steps of the function from left to right, we come to the following confluence diagram

image

No matter which way we group the operations, we come to the same result. We can extend this notion when comparing $f$ and $h$. BTW this “leap of logic” is something we can only do in this specific example by cheating (i.e division is essentially multiplication in disguise and that multiplication is associative and commutative). However it does give us a clue on what can be done, which I shall come back to later.

How Dgraph Currently Performs Queries

I am now going to describe how Dgraph performs queries in broad strokes. The finer details are unimportant and can be elided. But the general structure of how it’s done is important.

When a query comes in, it first gets parsed. Parsing is boring, so let’s ignore that. OK, maybe not. Parsing a query transforms the query into an AST[^naming1]. Like any good system[^self], this AST is transformed into a graph of operationed data[^naming2], which is then executed by ProcessGraph.

Here I want to clarify what I mean by “operationed” data. A *SubGraph holds data. How is the data acquired? It’s acquired through ProcessGraph.

ProcessGraph is particularly interesting because it’s a mega function with the sequence of events hard-coded. To be sure, this function is a concurrent function. But within the function, the order of operations are hard coded to be as follows (in broad strokes):

  1. Collect data from all the Alphas around the world.
  2. If there are @filter clauses in each predicate, apply them concurrently.
  3. Do facet stuff if any
  4. Do sort and pagination stuff if any
  5. Do counts and groupbys if any
  6. Return results

Bear in mind that because this is a graph of operationed data, there is some recursiveness going on. So for each subgraph in the subgraph of operationed data, steps 1 - 6 happens, and the results are collected and passed on.

Isometry(ish) to Relational Algebra

If steps 1 to 6 seems awfully familiar, you know you’ve implemented Codd’s relational algebra at least once in your life. There is some sort of similarity if not isometry, of what Dgraph does with relational algebra, enumerated in the table below

Dgraph Codd
Collect data Π and ⋈
Filter σ
Facet σ
Sort and Paginate n/a (sorting and pagination are not part of the Codd algebra, but is part of all databases’ operations - it’s a “postprocessing” step, if you will)
Counts and GroupBys

Most databases (except ones that lose data) generally have to follow Codd’s algebra (more or less). The innovations in databases are in the details of how the algebra is implemented.

For example, Dgraph’s main contribution is not treating projections and joins as separate operations, but as one single operation that is unified. In essence the “join” operation disappears. Well, kinda. I’ll explain further.

Detour on Query Rewriting and Query Optimization

OK, all that has been set out, now let’s see what interesting things we can do.

First, recall q and r. Both are queries that have different meanings but return the same results.

Question: q vs r, which is faster?

Answer: q is faster.

Why? The has function is slow because it needs to look everywhere for data that has a particular predicate. The type function by contrast, acts as an index. Look up which nodes has the <dgraph.type> of Person, then follow the pointers to retrieve the correct data. It turns out in this instance, following pointers is faster than performing a full scan of all the tablets.

Given q is faster than r, might it be a sane idea to actually run q when the input r is given? This is the basis of query optimization. We simply rewrite the query to one that is more performant under the hood.

However, to achieve this requires a very strong grasp of the semantics of the query language. As enumerated above, we have really two options. Here I shall enumerate the considerations that we have to give in order for us to say that two queries A and B are equal.

Considerations

Denotational Semantics

If we are to use denotational semantics to identify if two queries are the same thing, then we will have to rely heavily on the provided schema (see Potential Pitfalls below) to “guess” and perform heuristics to perform the query rewriting

Operational Semantics

If we are to use operational semantics to identify if two queries are the same thing then we just have to consider the list of operations in each queries. And then component by component, identify if they are equal. We will also have to account for the commutativity of the operations.

As an example, let’s compare a “loose” operational semantics of the following compositions:

  • filter then sort
  • sort then filter

In this specific example, the composition of functions is commutative (very rare!).

However, it’s trivial to see that filter-then-sort is a superior approach - filtering is a O(N) operation. Sorting is a O(N log(N)) operation.

Filter-then-sort: If the result of filtering is M elements, where M < N, then following with sorts would result in M(log(M)) operations.
Sort-then-filter: Sorting take O(N(log(N)), and filtering after would take N operations to yield M elements as a result.

The total number of operations would be smaller in filter-then-sort. This forms a good heuristics.

So with this heuristics in hand, we can now confidently re-write all queries that have a sort-then-filter operation to be a filter-then-sort operation order.

Potential Pitfalls

q and r returns the same results under the condition that the schema was as given. Assume now that the schema has changed to the following:

type Person {
     name
     friends
}

type Animal {
    name
    friends
}

name: string . 
friends: [uid] . 

Now the semantics of r being different to the semantics of q really matters! q and r now returns different results!

What is a Query Planner?

And now we can ask the question: what is a query planner?

Recall from the earlier section that the queries are declarative. That means that an implementor is free too implement whatever they want as long as it returns the results the query is supposed to return.

Thus, a query planner is one that determines what operations (and in what order) should take place.

In Dgraph, there is a general “framework” of how a query is processed, as laid out in the section How Dgraph Currently Performs Queries. And indeed, right now, due to the concurrent nature of Dgraph’s query processing, if you run the same query twice, the order of operations may not necessarily be the same in each run.

However we are not going to talk about query planning on that level. We simply trust the Go scheduler’s correctness.

There are two kinds of query planners we can write:

  1. Query planner that changes the order of ProcessGraph based on heuristics (Dynamic/“runtime” optimization).
  2. Query planner that rewrites queries into more performant ones and execute them (Static/“compile time” optimization).

Both of these query planners require some sort of understanding of the semantics of the query language. We can then leverage the algebraic properties of the operations to perform the work of both kinds of query planners.

Just like programming languages, we may prefer one over the other, but all programming languages have a runtime and a compiletime component. A efficient programming language will optimize both phases.

The Order of Operation of Writing a Query Planner

Assuming we do finally go ahead and decide to write a query planner (again, I say this as a guy that isn’t writing the query planner, and loves posturing), this is my humble opinion of how to proceed.

On the software planning end we have the following:

  1. Decide on the notion of equality in the semantics of a query. What kind of semantics are we comparing? I strongly propose we consider the Operational Semantics.
  2. Create a table that is a cross product of all pairs of unary operations - call this the composition table.
  3. Mark all unary operation pairs as either commutative or not.
  4. Assign costs to the pairs
  5. ???
  6. Profit!

On the actual programming end we have the following:

  1. Create a new package (presumably called analysis or something like that). The goal of this package is to convert the AST into a graph of operations. By separating out the operations from the operationed data, we can proceed with better optimizations.
  2. Symbolize all the operations in package query - this means breaking up ProcessGraph into named bitesized chunks, where each name corresponds to an operation (a function is good because Go has first class functions, but we might also want to protobuf that shit).
  3. Modify SubGraph in query to also take a list of operations that it needs to perform. This list of operations is a topologically sorted graph of operations.
  4. ??? (rewrite/reorder operations based on the costs in the heuristics from the process above)
  5. Profit!

Footnotes

[^denotation]: with databases denotational semantics can get quite hairy. Specifically if you ask the question in regards to models of the world. In that case, denotational semantics and the unlabelled semantics(#2 from the list) might be the same thing … But this is not the time or place to discuss it. Maybe on April 20th and with somethings burning.
[^self]: self plug. Duh. Any good complicated library will eventually need to build up a graph of operations.
[^naming1]: seriously, the naming convention sucks. Why is it called gql? Because once upon a time DQL used to be called GraphQL±.
[^naming2]: I am not joking. the naming convention sucks. The data type is called SubGraph but the package name is called query. How frustrating! But the way the data structure is used to orchestrate operations and collect data over a concurrent process is chef’s kiss. Just beautiful.

3 Likes

Wow! What a thorough reply! Thank you for your time and thoughts!!

The Postgres query planner is pure magic. Still blows my mind that this all works, and is automatic, and you benefit from it even if you don’t know it exists. If RDBMSes didn’t exist today and someone described Postgres query optimization, it would sound impossibly ambitious. https://twitter.com/garybernhardt/status/1397236117032869894

^ That was what spurred this topic. Trying to figure out if this is a reality or really just something that would be nice to have but is an ambitious impossibility.

And that leads to something I would also like to point out even with this very simple explanation you gave below and why IMO, very very few two queries would be the same in “Operational Semantics” and why I foresee a query planner only being effective if it can somehow see some data without querying the data which is an ambitious impossibility.

To understand my thoughts let’s look at another potential pitfall given even this very small sample of schema and these two queries

Only for the data that you provided as en example. Let’s add another piece of data into the mix:

set {
  _:d <dgraph.type> "Person" .
  _:d <friends> _:a .
  _:e <name> "Lorem Ipsum" .
}

Here we added a node having the type Person with a friend, but no name and another node with a name predicate but no type. This mutation would be valid given the original first schema which disproves the following point:

The schema is not what made the two queries return the same results, but rather it was the data that yielded to the same results. What is even more frustrating is that Dgraph can be utilized in a schema-less style and sort of build a schema as it goes when first seeing a predicate and guessing at it’s type from the received type. That is great for flexibility, but makes things difficult when trying to implement a query planner. But that is just a side point.

And here is some of my ideas with what might be possible with a query planner, but it kinda gets dicey and again seems an ambitious impossibility.

  1. Switching root and filter functions to reduce the known galaxy
{x(func: type(Person)) @filter(uid(0x2a)) { uid }}

and

{y(func: uid(0x2a)) @filter(type(Person)) { uid }}

Will have the same results every time no matter the schema or the data it contains. The difference is just optimization of the query putting the more efficient function in the root. This gets further complicated though as the filter directive accepts logical combinators, but the root function does not. Which leads to another idea

  1. Processing logic before the query
{m(func: type(Person)) @filter(uid(0x2a) OR uid(0x2b)){ ... }}
{n(func: uid(0x2a, 0x2b)) @filter(type(Person)){ ... }}

This example again uses the strategy in #1, but before that is applied it processes the logic to know that uid(0x2a) OR uid(0x2b) can be reduced to uid(0x2a, 0x2b)

I just got done recently writing an algorithm to logically reduce arrays of strings that get combined with AND, OR, and NOT to deduce it to a single array either being [...] or NOT [...]. As I worked through that I also thought about a query planner and how that might be helpful, but only if the logic could be done which lead to an optimized query. My purpose was to simplify the application layer as a whole when performing multiple parallels queries and combining responses into one final query

  1. Rewriting queries with cascade

Another longer discussion which can be found over at V21.03: After pagination+cascade change, queries are too slow to finish But a lot of this optimization can only be done if you have an idea in advance how many nodes or predicates a query might return. Something like optimizing a cascade by adding has functions on a rewritten query might make it more performant if there are 6M nodes and only a small percentage have the predicates being cascaded, but if you know in advance that all of the nodes will have the edges being cascaded and just some of them will match the filter being applied then adding a has filter reduces performance instead of increases perofrmance.

  1. Reversing a query and then recompiling the results returned

I see this as being the key into a graph query planner, but it involves the most work and having some things understood such as every predicate having a reverse directive and being able to quickly and efficiently get counts of predicates even before running the query that is being optimized (ambitious impossibility, idk?)

{a(func: type(Person)) { hobbies { name } }}
{b(func: type(Hobbies)) @filter(has(~hobbies)) { name }}

These two queries could return the same nodes of data if this syntax is/were even possible for reverse predicates. But this is really dependent on knowing that no fields of the Person were returned and that was really only getting what hobbies people were using. This also brings in some ideas about graph theory involving two graphs containing the same nodes with the same data being equivalent even if those graphs had a different structure and repeating nodes. This basically is the theory of converting two subgraphs into n-quads and seeing if the two sets of n-quads are equivalent and deducing that the subgraphs contain the same exact data just represented differently.

At the root of a query planner is touching the fewest amount of predicates which this sort of handles, but only depending on the number of predicates. If there are M’s of type(Person) but 1000% fewer type(Hobbies) then reversing the query would be more performant and return the same graph data but impossible to return the same graph structure, again why this seems impossibly ambitious.


And here is wherein lies almost every problem with any kind of query planner/optimization that would really be performant:

The shape of the graph the client asks for is the shape of the graph the response should return

If I were to ask for query a above I might get:

[
  { hobbies: [{ name: "hiking" },{ name: "programming" }] },
  { hobbies: [{ name: "camping" }] },
  { hobbies: [{ name: "running" }, { name: "hiking" }] },
  { hobbies: [{ name: "programming" }, { name: "running" }] },
  { hobbies: [{ name: "programming" }] },
  { hobbies: [{ name: "hiking" }] }
]

While query b might respond with:

[
  { name: "hiking" },
  { name: "programming" },
  { name: "running" },
  { name: "hiking" }
]

Which contains the same nodes of data if hobbies is what you really want, but a query planner cannot make assumptions into what you really want vs. what you asked for.

Now if a table was returned instead of a graph, then maybe this would be more possible. But in DQL the query and the response is the same. It is not like other graph languages that let you form a query and then pick what you want back out of that query (Cypher/Gremlin), rather it is forming the query and the response back in a single operation.

Hence which leads me to believe that a query planner might not really work well after all in a graph database, and instead query planner would really just be a query rewriter/optimizer to do 1, 2, 3, but nothing as complex as 4 like a rdbms query planner might do when doing joins differently than how the client asked for them.

A factor into this might be the normalize directive which I still do not understand fluently enough to make educated assumptions on how to optimize that.

IMHO the biggest part of a query planner would be #3 and rewriting queries that have post query logic. (cascade, pagination if cascaded, normalize?) This is the same as @chewxy simple example of changing the order of sort/filter as it returns the same exact graph either way it is applied.

1 Like

An Edit Mistake

I miswrote:

q and r returns the same results under the condition that the schema was as given

should be

q and r returns the same results under the condition that the schema and data was as given

I think an earlier version of the draft did use data.

Yes and no. Under a strict schema regime, you would not have data that do not fit the schema. But as you say:

On The Concrete Steps Listed

Your concrete steps (1,2,3,4) are very very excellent and specific elaborations/examples of the table of compositions process I listed above. I have some comments:

A key thing I feel that has been slightly missed is that the @cascade directive or ~ are actually elements of the query language, and thus can be boiled down to an operation. I do think this is jumping to solutions before a blueprint is even created (e.g. the table of compositions)

On Denotational Semantics

That’s what the whole denotational semantics shebang is all about. You can only make a rewrite if you know that the result of the rewrite will yield the same data (same denotation).

The main issue is you run into “runtime” issues, requiring you to look at the data before you look at the data. Enforcing a strict schema would allow you to avoid doing that, so you look at the metadata (schema) before you look at the data.

On the Suitability of a Query Planner for DQL

I think the flexibility of the query specifically allows for a query planner to be written! Other than directives (which are “action” words, or “commands” if you are used to 1960s style PLT lingo), everything else is declarative. That leaves a LOT of freedom to re-order the operations to something that best fits the performance characteristics. A key is to lean on Codd’s relational algebra, or to create something similar.

On @normalize

My understanding is @normalize is just a fmap. It’s just an operation.

1 Like