filter tells that get everything except me. So the result is huge.
If we connect this filter with other conditions then the query will be expensive one.
I wonder if there is a way to improve efficiency of such queries.
@filter(eq(city, "Berlin") AND NOT uid(me))
If I do above query, what is the query execution plan?
If this is already optimized, we are good.
If it is not we would find a massive improvement room.
It will check the indexation, get the uid of it make a call to the group that predicate is. Then it will check for the given uid and apply the logic. Finally continues to call the other predicates in the body of the query.
This was not a real use case. I just gave a sample filter condition. It can be anything.
According to the paper, if I understand correctly, Dgraph query executor engine is smart enough to reduce filtering of multiple conditions with AND connections.
If this is the case Dgraph is really good on the filtering execution.
I just want a confirmation -if possible- of my understanding.
It would be also nice to write a blog / doc page to reveal internals of query execution plans.
It is always important for developers to understand how the query executed internally.
Thank you.
Dgraph paper itself tells that we can optimize queries by ordering the filters.
For example:
Query 1: @filter(NOT uid_in(<0x7>) AND eq(city, "Berlin"))
Query 2: @filter(eq(city, "Berlin") AND NOT uid_in(<0x7>) )
Query 2 will be way more faster than the Query 1.
Because the first condition reduces the dataset and the second condition can operate on reduced dataset.
This is important fact for us as DQL users.
Can you confirm my understanding if it is true so far?
Not sure if the second will be fast. As any query with UID is the fastest way of querying. The other needs to apply eq() in the edge or check the index table.
Not the entire database, just all of the nodes that eother came from the parent edge or root function. So it is only EVERY node except one, if in your root function you are selecting EVERY node. But that would be a poorly designed query. Start smaller and expand larger as often as you can.
So logically, if you want to use the uid_in function, which says gets the nodes that are linked to uid(s) along <predicate>, so at most the root function should be has(predicate)
Still not clear what is your query root. A filter directive will be applied to the nodes coming from that query root. It is the query root that will determine whether it is a huge response and poor performance.
Also, the order of the filter params doesn’t have much difference. Just the fact that UIDs are a thousand times easier to process than inequality. If you have the UID in hand, always use the UID as the main param. And multiple query blocks, cuz it runs concurrently. So it boosts the performance of the query.
If uid_in would be supported in the root. It would use has filter under the hood.
It is not important what the query root is. I am talking about query filter execution plan.
Lets say the query root function reduces the dataset to N nodes.
Then first filter reduces the N records to M.
Second filter reduces the M records to Q records.
Final result size is Q.
if M >> Q (M is way greater than Q) executing the filter order changes the performance dramatically.
That is why the filtering order is important.
I try to learn what Dgraph does in this regard? Should we care this fact while writing queries?
I’m not sure about that statement. Look, in Math, the order of the factors doesn’t change the product.
What will actually change is the processing factor. Whether to process each of the factors takes X time. No matter the order, it will take at the same time. If you switch sides, in my opinion, the final time does not change at all.
And look, let’s hypothetically assume that to process UIDs it takes 1ns (nanosecond) and that to process inequality it takes 2ns. Which takes longer in order? If the UID is the first param in the filter directive, it will always end twice as fast. UID is direct processing, It has a small footprint in the RAM and CPU. But it does not matter.
In the end, you would be using both. UID and inequality, no matter the order, you will have the same result. It would be interesting to find a way to prove it. Maybe using Jaeger tracing this to see the final time and even the time for individual calls.
If processing inequality is faster than processing UID or the same. Maybe you were right about that. But still, the end result regardless of the order would be the same.
Or if your case were be choosing between only UID or only inequality. UID wins.
@MichelDiz
I think following code example shows exactly what I am trying to understand.
If you look at following 2 people LINQ queries, the first one is 80 times more efficient than the second one in average.
Of course, if we omit ToList() calls after the first filter condition this will not be the case and both queries will perform equally.
However, on Dgraph, every predicate has their own Badge database. This means there should be a ToList() step in DQL query processing.
I am trying to understand what is happening on that side.
Does Dgraph respect data sizes and selects the best performing filter condition first? (apparently, according to paper it does not!)
If not we should care about this when we write DQL queries.
public class Person
{
public int Uid { get; set; }
public int Age { get; set; }
public int Payment { get; set; }
}
void Main()
{
var people = CreatePeople(1000);
people
.Where(x => x.Age == 41)
.ToList() // 12 people found with age query
.Where(x => x.Payment > 100)
.Count();
people
.Where(x => x.Payment > 100)
.ToList() // 952 people found with payment query
.Where(x => x.Age == 41)
.Count();
}
private List<Person> CreatePeople(int count)
{
var random = new Random();
var result = new List<Person>();
for (var i = 0; i < count; ++i)
{
result.Add(new Person {
Uid = i,
Age = random.Next(1,70),
Payment = random.Next(1,2000)
});
}
return result;
}