Query Planner

Wow! What a thorough reply! Thank you for your time and thoughts!!

The Postgres query planner is pure magic. Still blows my mind that this all works, and is automatic, and you benefit from it even if you don’t know it exists. If RDBMSes didn’t exist today and someone described Postgres query optimization, it would sound impossibly ambitious. https://twitter.com/garybernhardt/status/1397236117032869894

^ That was what spurred this topic. Trying to figure out if this is a reality or really just something that would be nice to have but is an ambitious impossibility.

And that leads to something I would also like to point out even with this very simple explanation you gave below and why IMO, very very few two queries would be the same in “Operational Semantics” and why I foresee a query planner only being effective if it can somehow see some data without querying the data which is an ambitious impossibility.

To understand my thoughts let’s look at another potential pitfall given even this very small sample of schema and these two queries

Only for the data that you provided as en example. Let’s add another piece of data into the mix:

set {
  _:d <dgraph.type> "Person" .
  _:d <friends> _:a .
  _:e <name> "Lorem Ipsum" .
}

Here we added a node having the type Person with a friend, but no name and another node with a name predicate but no type. This mutation would be valid given the original first schema which disproves the following point:

The schema is not what made the two queries return the same results, but rather it was the data that yielded to the same results. What is even more frustrating is that Dgraph can be utilized in a schema-less style and sort of build a schema as it goes when first seeing a predicate and guessing at it’s type from the received type. That is great for flexibility, but makes things difficult when trying to implement a query planner. But that is just a side point.

And here is some of my ideas with what might be possible with a query planner, but it kinda gets dicey and again seems an ambitious impossibility.

  1. Switching root and filter functions to reduce the known galaxy
{x(func: type(Person)) @filter(uid(0x2a)) { uid }}

and

{y(func: uid(0x2a)) @filter(type(Person)) { uid }}

Will have the same results every time no matter the schema or the data it contains. The difference is just optimization of the query putting the more efficient function in the root. This gets further complicated though as the filter directive accepts logical combinators, but the root function does not. Which leads to another idea

  1. Processing logic before the query
{m(func: type(Person)) @filter(uid(0x2a) OR uid(0x2b)){ ... }}
{n(func: uid(0x2a, 0x2b)) @filter(type(Person)){ ... }}

This example again uses the strategy in #1, but before that is applied it processes the logic to know that uid(0x2a) OR uid(0x2b) can be reduced to uid(0x2a, 0x2b)

I just got done recently writing an algorithm to logically reduce arrays of strings that get combined with AND, OR, and NOT to deduce it to a single array either being [...] or NOT [...]. As I worked through that I also thought about a query planner and how that might be helpful, but only if the logic could be done which lead to an optimized query. My purpose was to simplify the application layer as a whole when performing multiple parallels queries and combining responses into one final query

  1. Rewriting queries with cascade

Another longer discussion which can be found over at V21.03: After pagination+cascade change, queries are too slow to finish But a lot of this optimization can only be done if you have an idea in advance how many nodes or predicates a query might return. Something like optimizing a cascade by adding has functions on a rewritten query might make it more performant if there are 6M nodes and only a small percentage have the predicates being cascaded, but if you know in advance that all of the nodes will have the edges being cascaded and just some of them will match the filter being applied then adding a has filter reduces performance instead of increases perofrmance.

  1. Reversing a query and then recompiling the results returned

I see this as being the key into a graph query planner, but it involves the most work and having some things understood such as every predicate having a reverse directive and being able to quickly and efficiently get counts of predicates even before running the query that is being optimized (ambitious impossibility, idk?)

{a(func: type(Person)) { hobbies { name } }}
{b(func: type(Hobbies)) @filter(has(~hobbies)) { name }}

These two queries could return the same nodes of data if this syntax is/were even possible for reverse predicates. But this is really dependent on knowing that no fields of the Person were returned and that was really only getting what hobbies people were using. This also brings in some ideas about graph theory involving two graphs containing the same nodes with the same data being equivalent even if those graphs had a different structure and repeating nodes. This basically is the theory of converting two subgraphs into n-quads and seeing if the two sets of n-quads are equivalent and deducing that the subgraphs contain the same exact data just represented differently.

At the root of a query planner is touching the fewest amount of predicates which this sort of handles, but only depending on the number of predicates. If there are M’s of type(Person) but 1000% fewer type(Hobbies) then reversing the query would be more performant and return the same graph data but impossible to return the same graph structure, again why this seems impossibly ambitious.


And here is wherein lies almost every problem with any kind of query planner/optimization that would really be performant:

The shape of the graph the client asks for is the shape of the graph the response should return

If I were to ask for query a above I might get:

[
  { hobbies: [{ name: "hiking" },{ name: "programming" }] },
  { hobbies: [{ name: "camping" }] },
  { hobbies: [{ name: "running" }, { name: "hiking" }] },
  { hobbies: [{ name: "programming" }, { name: "running" }] },
  { hobbies: [{ name: "programming" }] },
  { hobbies: [{ name: "hiking" }] }
]

While query b might respond with:

[
  { name: "hiking" },
  { name: "programming" },
  { name: "running" },
  { name: "hiking" }
]

Which contains the same nodes of data if hobbies is what you really want, but a query planner cannot make assumptions into what you really want vs. what you asked for.

Now if a table was returned instead of a graph, then maybe this would be more possible. But in DQL the query and the response is the same. It is not like other graph languages that let you form a query and then pick what you want back out of that query (Cypher/Gremlin), rather it is forming the query and the response back in a single operation.

Hence which leads me to believe that a query planner might not really work well after all in a graph database, and instead query planner would really just be a query rewriter/optimizer to do 1, 2, 3, but nothing as complex as 4 like a rdbms query planner might do when doing joins differently than how the client asked for them.

A factor into this might be the normalize directive which I still do not understand fluently enough to make educated assumptions on how to optimize that.

IMHO the biggest part of a query planner would be #3 and rewriting queries that have post query logic. (cascade, pagination if cascaded, normalize?) This is the same as @chewxy simple example of changing the order of sort/filter as it returns the same exact graph either way it is applied.

1 Like