We have an interesting performance issue hopefully someone can help shed some light on.
In our use case we have nodes which are unique by the sum of their taxonomy, which are arbitrary key/value pairs.
Our schema is such that
key_nodes
|__ value_nodes
|__ item_nodes
The item nodes could have any amount of key/values associated with them. Our queries are based on the key/values, but our returned result set includes all of the key/values which is what led us to dgraph as we can easily traverse these relationships backwards in our result set.
We did notice quickly however that when using multiple filters to intersect the result sets the ordering mattered for performance.
In order to try and solve this we started indexing the counts of edges. So we would run two queries, the first to get the indexed count of the number of results in each query, then the second ordering these for the actual query with results in a sensible manner.
For context, our current dataset can consist of hundreds of thousands of value_nodes each related to potentially thousands of item_nodes. Although performing the indexed count query beforehand improves the overall performance, the count query itself is pretty slow (in some cases still taking seconds to execute) and looks to be overwhelming dgraph-server. We are wondering if there is anything you can suggest we do to optimise this?
tagkeys to tagvalues are 1..N
tagvalues to items aslo 1..N
We search for items by intersecting sets of items found by tagkey/tagvalue patterns.
However if a tagkey/tagvalue pair was not used for filtering that does not matter, we still return all pairs associated with the items in the result set.
When using multiple filters to intersect the result sets the ordering mattered for performance. We’re looking for a way to ensure that the ordering of our tag_value filters results in the most optimal query, as choosing the wrong “root” tag_value filter (eg. the @filter(regexp(tagvalue_name, /web1000.*/)) from our query example) can result in extremely slow response times.
We are currently using the “indexed count query” as a way of figuring out beforehand which filter matches the least items to ensure we build the optimal query but this is still quite slow, we’re pretty sure we’re doing something wrong
If there is anything you can suggest we do to optimise this that would be so helpful! We’re currently evaluating dgraph and it looks very promising apart from this one stumbling block.