Filtering is slow on large amount of data

diggy · October 31, 2018, 9:18am

Moved from GitHub dgraph/2713

Posted by makitka2007:

dgraph 1.0.9

due to required index lookup for every @filter condition, filtering is slow on large amounts of data.

test data: 10m entities created with “sex” predicate having random value “m” or “f”.
since index is required for filtering, it’s created both for entity_key and sex predicates.

entity_key: string @index(exact) .
sex: string @index(exact) .

10m entities loaded:

_:node$x <entity_key> "entity$x" .
_:node$x <sex> "$sex" .

query to get 1 entity takes 1ms:

{
	get_entity(func: eq(entity_key, "entity600000")) {
		uid
	}
}

adding filter by “sex” predicate slows it down to 7 seconds:

{
	get_entity(func: eq(entity_key, "entity900000")) @filter(eq(sex, "f")) {
		uid
	}
}

because filter loads all ~5m entities having sex="f" into memory.

need to improve filters not to use index when index doesn’t exist or by some special directive.
if I use filtering on edge facet it works fast as expected (1 ms):

{
	get_entity(func: eq(entity_key, "entity800000")) @cascade {
		uid
		attrs @facets(eq(sex, "f"))
	}
}

so, predicate filters should use the same logic as edge facets filters (if index is not created or there is a special directive not to use index on this predicate).

diggy · July 13, 2019, 1:16am

campoy commented :

Problem still occurs and seems like should be an obvious optimization.

Adding it for our Dgraph v1.2!

diggy · September 13, 2019, 3:47pm

campoy commented :

Marking it as popular, after second report appeared #3945.

diggy · May 12, 2020, 2:51am

JimWen commented :

Happy to notice this, i used to fiexed this problem by not set the index of “sex” because filter can support nonindex filed now.

diggy · May 12, 2020, 2:54am

JimWen commented :

And using the block varaiable can also fix this problem, but all these workaround are not so good beacause user should know more details of drgaph.

diggy · June 15, 2020, 5:51am

antblood commented :

If any predicate is indexed then it is stored in the following way:

<predicate, predicate_value> => [uid1, uid2, uid3 .....]

hence in this case predicate sex will be stored as:

<sex, "f"> => [0x1, 0x3, 0x5, ....]
<sex, "m"> => [0x2, 0x4, 0x6 ....]

and predicate entity_key as:

<entity_key, "entity1"> => [0x1]
<entity_key, "entity2"> => [0x2]
<entity_key, "entity3"> => [0x3]
...

When we try to find out if the person with “entity800000” has sex predicate as “f”. First we get it’s uid from <entity_key, "entity800000"> => [0xC3500], then we traverse over the list <sex, "f"> => [0x1, 0x3, 0x5, ....] to check if the same uid is present in this list. Traversing over a long list makes this operation very slow.

In the case when we don’t index a predicate then it is stored in the following way:

<predicate, uid> => [value1, value2 ....]

hence in this case:

<sex, "0x1"> => ["f"]
<sex, "0x2"> => ["m"]
<sex, "0x3"> => ["f"]
...

in this case, checking the value of sex predicate for a node is very fast as we only need to traverse over one value.

Hence, it’s better not to index the predicates that can only have a few different values. Like in this case sex predicate has only two values “f” and “m”.

Time taken for query when we index sex predicate : 300 ms
Time taken for query when we don’t index sex predicate : 3 ms

Topic		Replies	Views
Filter performance Dgraph	15	863	March 27, 2020
Complex edge filtering with large data is too slow. I'm not sure what mistake i make.Can you give me some advice and help? Users	1	603	September 28, 2019
Significant Performance Degradation with More Conditions Dgraph	4	680	October 23, 2018
Sharing a little trick Dgraph	4	260	March 20, 2024
Query performance of large database (over 12g edges) Dgraph	5	1782	July 2, 2019

Filtering is slow on large amount of data

Related topics