What I want to do
Filter a large dataset, with pagination and @cascade on v21.03
What I did
I could not, its too slow.
Dgraph metadata
dgraph version
Dgraph version : v21.03.0
Dgraph codename : rocket
Dgraph SHA-256 : b4e4c77011e2938e9da197395dbce91d0c6ebb83d383b190f5b70201836a773f
Commit SHA-1 : a77bbe8ae
Commit timestamp : 2021-04-07 21:36:38 +0530
Branch : HEAD
Go version : go1.16.2
jemalloc enabled : true
I have a query pattern for my application to use dgraph that requires using @cascade to filter paths on the existence of at least one edge that is being filtered. This works great in v20.11, with the caveat that @cascade and pagination did not work as expected.
So I upgraded our staging system to v21.03 in order to get the pagination+cascade fix in this pr. However, as per the design of this fix, it removes pagination completely and calculates the full possible response, then applies pagination. This seriously breaks my query pattern to the point where I cannot use v21.03.
EG: Imagine, if you will, I have a dataset with: (Device[#100])-[:has_object]->(Object[#3000eachDevice])-[:has_indicator]->(Indicator[#20eachObject])
Each node has a [uid] edge out to other nodes, the existence of which I need to filter on. Here is a made up query explaining the issue (this is not a real query of my application, just has the same idea)
q(func: type(Device), first:2) @cascade(myEdgeThatIsUsedToFilter) {
...fields
has_object @filter(type(Object)) (first:2) {
...fields
has_indicator @filter(type(Indicator)) (first:2) {
...fields
}
}
}
fragment fields {
uid, name
myEdgeThatIsUsedToFilter @filter(...)
#with @cascade, if at least one of these edges pass the filter, the node is included.
# If none of these edges pass the filter, then the parent node is not in the result
# and will not be traversed to the next level
}
In v20.11, this finished quite quickly (its only asking for 2 Devices, 4 Objects, 8 Indicators total)
In v21.03, this does not finish after 200s, an trimming it down to just the first level shows me the debug metrics that say yes, it had to access 6,300,100 uids (100 devices, 300k objects, 6M indicators) to get the result, instead of the 14 nodes I had asked for.
So… I wont be able to use dgraph v21.03 without a dramatic redesign of our application. Would it be possible to get something into the next release that will evaluate paging while evaluating cascade parameters?