Uid_not_in query filter to exclude some ids

Hi Dgraph lovers!

I am trying to fetch people except myself with a DQL query and I am not able to do so.
Is there a good way to do such a query?

It would be really great if there was an

uid_not_in(predicate,[uids])

or

uid_in_except(predicate1, [included_uids], predicate2, [excluded_uids])

functions.

It is important to do this at database level. With client side workarounds, pagination gets problems on big data sets.

The syntax is

func: eq(...)) @filter(NOT uid_in(pred, 0x1) AND uid_in(pred, $restUIDS))

See more at https://dgraph.io/docs/query-language/connecting-filters/#and-or-and-not

If you just wanna take off you from the query(not as a root param) you would do like

func: eq(...)) @filter(uid_in(pred, $restUIDS)) {
   pred @filter(NOT uid(me)) { 
    uid 
    name 
    ... 
  } 
}

more here https://dgraph.io/docs/query-language/functions/#uid

One more thing

To the last example you can also use ignorereflex directive https://dgraph.io/docs/query-language/ignorereflex-directive/

1 Like

@MichelDiz This is awesome. Thank you so much.

1 Like

@MichelDiz
Can you please share your opinions about the performance impact of above query?
What does Dgraph do internally to process filter below?

@filter(NOT uid_in(pred1, [excluded_uids]) AND uid_in(pred2, [included_uids]))

If the query executed in such a way that:

  • get all uids belong to included uids
  • get all uids that do not belong to excluded uids.
  • intersect two lists

Then this would be not an efficient query execution.
If this is the case, may I ask for a feature request to do efficient uid_in_except queries?

Thank you.

PS: There is a condition that pred1 != pred2, otherwise the query would be simplified by removing excluded uids before the query preparation.

This filter only checks ahead of a block’s traverse (one level). It does nothing extra. I don’t know what it does under the hood.

I think @filter(NOT uid(me)) or ignorereflex directive might fit better for your case.

@filter(NOT uid(me)) 

filter tells that get everything except me. So the result is huge.
If we connect this filter with other conditions then the query will be expensive one.
I wonder if there is a way to improve efficiency of such queries.

@filter(eq(city, "Berlin") AND NOT uid(me))

If I do above query, what is the query execution plan?
If this is already optimized, we are good.
If it is not we would find a massive improvement room.

Looks like your intentions are bigger than this.

What is the root query?

If it is like this

(func: eq(city, "Berlin") ) @filter(NOT uid(me))

It will check the indexation, get the uid of it make a call to the group that predicate is. Then it will check for the given uid and apply the logic. Finally continues to call the other predicates in the body of the query.

If you wanna understand better how it works. Take a read in the paper Dgraph Whitepapers: In-Depth Insights and Analysis

BTW, this filter doesn’t looks right. Show me the Graph Model you are using. Cuz if you are a user, you are not “Berlin” right?

This was not a real use case. I just gave a sample filter condition. It can be anything.
According to the paper, if I understand correctly, Dgraph query executor engine is smart enough to reduce filtering of multiple conditions with AND connections.

If this is the case Dgraph is really good on the filtering execution.
I just want a confirmation -if possible- of my understanding.
It would be also nice to write a blog / doc page to reveal internals of query execution plans.
It is always important for developers to understand how the query executed internally.
Thank you.

There is an execution plan feature in the roadmap. Soon you gonna see it. Dunno when.

1 Like

From the paper:

Dgraph paper itself tells that we can optimize queries by ordering the filters.

For example:
Query 1:
@filter(NOT uid_in(<0x7>) AND eq(city, "Berlin"))

Query 2:
@filter(eq(city, "Berlin") AND NOT uid_in(<0x7>) )

Query 2 will be way more faster than the Query 1.
Because the first condition reduces the dataset and the second condition can operate on reduced dataset.

This is important fact for us as DQL users.

Can you confirm my understanding if it is true so far?

Thank you.

Not sure if the second will be fast. As any query with UID is the fastest way of querying. The other needs to apply eq() in the edge or check the index table.

NOT uid_in(<0x7>) )
result is huge. entire dataset except 1 node. That is why it would affect the performance.

Not the entire database, just all of the nodes that eother came from the parent edge or root function. So it is only EVERY node except one, if in your root function you are selecting EVERY node. But that would be a poorly designed query. Start smaller and expand larger as often as you can.

There are cases that you cannot reduce on root level. For ex uid_in function is not supported in root function.

So logically, if you want to use the uid_in function, which says gets the nodes that are linked to uid(s) along <predicate>, so at most the root function should be has(predicate)

Has(predicate) retrieves entire dataset.

Still not clear what is your query root. A filter directive will be applied to the nodes coming from that query root. It is the query root that will determine whether it is a huge response and poor performance.

Also, the order of the filter params doesn’t have much difference. Just the fact that UIDs are a thousand times easier to process than inequality. If you have the UID in hand, always use the UID as the main param. And multiple query blocks, cuz it runs concurrently. So it boosts the performance of the query.

If uid_in would be supported in the root. It would use has filter under the hood.

It is not important what the query root is. I am talking about query filter execution plan.

Lets say the query root function reduces the dataset to N nodes.
Then first filter reduces the N records to M.
Second filter reduces the M records to Q records.
Final result size is Q.

if M >> Q (M is way greater than Q) executing the filter order changes the performance dramatically.

That is why the filtering order is important.
I try to learn what Dgraph does in this regard? Should we care this fact while writing queries?

I’m not sure about that statement. Look, in Math, the order of the factors doesn’t change the product.

What will actually change is the processing factor. Whether to process each of the factors takes X time. No matter the order, it will take at the same time. If you switch sides, in my opinion, the final time does not change at all.

And look, let’s hypothetically assume that to process UIDs it takes 1ns (nanosecond) and that to process inequality it takes 2ns. Which takes longer in order? If the UID is the first param in the filter directive, it will always end twice as fast. UID is direct processing, It has a small footprint in the RAM and CPU. But it does not matter.

In the end, you would be using both. UID and inequality, no matter the order, you will have the same result. It would be interesting to find a way to prove it. Maybe using Jaeger tracing this to see the final time and even the time for individual calls.

If processing inequality is faster than processing UID or the same. Maybe you were right about that. But still, the end result regardless of the order would be the same.

Or if your case were be choosing between only UID or only inequality. UID wins.

That’s just theory™

Cheers.

@MichelDiz
I think following code example shows exactly what I am trying to understand.

If you look at following 2 people LINQ queries, the first one is 80 times more efficient than the second one in average.
Of course, if we omit ToList() calls after the first filter condition this will not be the case and both queries will perform equally.
However, on Dgraph, every predicate has their own Badge database. This means there should be a ToList() step in DQL query processing.
I am trying to understand what is happening on that side.
Does Dgraph respect data sizes and selects the best performing filter condition first? (apparently, according to paper it does not!)
If not we should care about this when we write DQL queries.

public class Person
{
	public int Uid { get; set; }
	public int Age { get; set; }
	public int Payment { get; set; }
}

void Main()
{
	var people = CreatePeople(1000);
	people
		.Where(x => x.Age == 41)
		.ToList() // 12 people found with age query
		.Where(x => x.Payment > 100)
		.Count();
		
	people
		.Where(x => x.Payment > 100)
		.ToList() // 952 people found with payment query
		.Where(x => x.Age == 41)
		.Count();
}

private List<Person> CreatePeople(int count)
{
	var random = new Random();
	var result = new List<Person>();
	for (var i = 0; i < count; ++i)
	{
		result.Add(new Person {
			Uid = i,
			Age = random.Next(1,70),
			Payment = random.Next(1,2000)
		});
	}	
	return result;
}