Interesting performance issue with high cardinality indexing

We have an interesting performance issue hopefully someone can help shed some light on.
In our use case we have nodes which are unique by the sum of their taxonomy, which are arbitrary key/value pairs.
Our schema is such that

key_nodes
|__ value_nodes
    |__ item_nodes

The item nodes could have any amount of key/values associated with them. Our queries are based on the key/values, but our returned result set includes all of the key/values which is what led us to dgraph as we can easily traverse these relationships backwards in our result set.

We did notice quickly however that when using multiple filters to intersect the result sets the ordering mattered for performance.
In order to try and solve this we started indexing the counts of edges. So we would run two queries, the first to get the indexed count of the number of results in each query, then the second ordering these for the actual query with results in a sensible manner.

For context, our current dataset can consist of hundreds of thousands of value_nodes each related to potentially thousands of item_nodes. Although performing the indexed count query beforehand improves the overall performance, the count query itself is pretty slow (in some cases still taking seconds to execute) and looks to be overwhelming dgraph-server. We are wondering if there is anything you can suggest we do to optimise this?

Here’s an example of these queries:

Hey! Just bumping this with the hope someone will be able to help. We’re still quite stumped on how to make these queries less intensive.

It’s a bit unclear what you mean here. Are you using the @count index in your schema? https://docs.dgraph.io/query-language/#count-index

Can you share an example of a mutation in your cluster so we can get a better idea of the structure of your graph?

Hey @dmai!

Yep we do have the @count index in our schema.

To clarify, here’s what our actual schema looks like, please note I’ve also updated the posted github gist of our queries to match the below:

item_canonicalhash      string  exact
tagkey_name             string  exact
tagkey_value            uid     reverse
tagvalue_name           string  exact, trigram
tagvalue_item           uid     reverse         count

A typical nquad mutation for the above would be:

_:item <item_canonicalhash> "90126705c8a72b7978b61be654da1fa4" .

_:tagserver <tagkey_name> "server" .
_:tagserver <tagkey_value> _:value _:valueserver-0001 .
_:valueserver-0001 <tagvalue_name> "server-0001" .
_:valueserver-0001 <tagvalue_item> _:item .

_:tagdc <tagkey_name> "dc" .
_:tagdc <tagkey_value> _:value _:valuedc1 .
_:valuedc1 <tagvalue_name> "dc1" .
_:valuedc1 <tagvalue_item> _:item .

_:tagrack <tagkey_name> "rack" .
_:tagrack <tagkey_value> _:value _:valuerack-0001 .
_:valuerack-0001 <tagvalue_name> "rack-0001" .
_:valuerack-0001 <tagvalue_item> _:item .

We have three “types” of node:

  • tagkeys
  • tagvalues
  • items

tagkeys to tagvalues are 1..N
tagvalues to items aslo 1..N

We search for items by intersecting sets of items found by tagkey/tagvalue patterns.
However if a tagkey/tagvalue pair was not used for filtering that does not matter, we still return all pairs associated with the items in the result set.

When using multiple filters to intersect the result sets the ordering mattered for performance. We’re looking for a way to ensure that the ordering of our tag_value filters results in the most optimal query, as choosing the wrong “root” tag_value filter (eg. the @filter(regexp(tagvalue_name, /web1000.*/)) from our query example) can result in extremely slow response times.

We are currently using the “indexed count query” as a way of figuring out beforehand which filter matches the least items to ensure we build the optimal query but this is still quite slow, we’re pretty sure we’re doing something wrong :wink:

If there is anything you can suggest we do to optimise this that would be so helpful! We’re currently evaluating dgraph and it looks very promising apart from this one stumbling block.

Try these tweaks to improve the regexp:

  1. Put anchors in your expression: /^web10009.*$/
  2. If you can’t use block anchors, use lazy quantifiers: /.*?web10009.*/
  3. Negate the regexp instead of the result inside filter: @filter(regexp(tagvalue_name,/^((?!web10009).)*$/)) {}

Let me know if these changes make any difference.

3 Likes