Filtering is slow on large amount of data

Moved from GitHub dgraph/2713

Posted by makitka2007:

dgraph 1.0.9

due to required index lookup for every @filter condition, filtering is slow on large amounts of data.

test data: 10m entities created with “sex” predicate having random value “m” or “f”.
since index is required for filtering, it’s created both for entity_key and sex predicates.

entity_key: string @index(exact) .
sex: string @index(exact) .

10m entities loaded:

_:node$x <entity_key> "entity$x" .
_:node$x <sex> "$sex" .

query to get 1 entity takes 1ms:

{
	get_entity(func: eq(entity_key, "entity600000")) {
		uid
	}
}

adding filter by “sex” predicate slows it down to 7 seconds:

{
	get_entity(func: eq(entity_key, "entity900000")) @filter(eq(sex, "f")) {
		uid
	}
}

because filter loads all ~5m entities having sex="f" into memory.

need to improve filters not to use index when index doesn’t exist or by some special directive.
if I use filtering on edge facet it works fast as expected (1 ms):

{
	get_entity(func: eq(entity_key, "entity800000")) @cascade {
		uid
		attrs @facets(eq(sex, "f"))
	}
}

so, predicate filters should use the same logic as edge facets filters (if index is not created or there is a special directive not to use index on this predicate).

campoy commented :

Problem still occurs and seems like should be an obvious optimization.

Adding it for our Dgraph v1.2!

campoy commented :

Marking it as popular, after second report appeared #3945.

JimWen commented :

Happy to notice this, i used to fiexed this problem by not set the index of “sex” because filter can support nonindex filed now.

JimWen commented :

And using the block varaiable can also fix this problem, but all these workaround are not so good beacause user should know more details of drgaph.

antblood commented :

If any predicate is indexed then it is stored in the following way:

<predicate, predicate_value> => [uid1, uid2, uid3 .....]

hence in this case predicate sex will be stored as:

<sex, "f"> => [0x1, 0x3, 0x5, ....]
<sex, "m"> => [0x2, 0x4, 0x6 ....]

and predicate entity_key as:

<entity_key, "entity1"> => [0x1]
<entity_key, "entity2"> => [0x2]
<entity_key, "entity3"> => [0x3]
...

When we try to find out if the person with “entity800000” has sex predicate as “f”. First we get it’s uid from <entity_key, "entity800000"> => [0xC3500], then we traverse over the list <sex, "f"> => [0x1, 0x3, 0x5, ....] to check if the same uid is present in this list. Traversing over a long list makes this operation very slow.

In the case when we don’t index a predicate then it is stored in the following way:

<predicate, uid> => [value1, value2 ....]

hence in this case:

<sex, "0x1"> => ["f"]
<sex, "0x2"> => ["m"]
<sex, "0x3"> => ["f"]
...

in this case, checking the value of sex predicate for a node is very fast as we only need to traverse over one value.

Hence, it’s better not to index the predicates that can only have a few different values. Like in this case sex predicate has only two values “f” and “m”.

Time taken for query when we index sex predicate : 300 ms
Time taken for query when we don’t index sex predicate : 3 ms