V21.03: After pagination+cascade change, queries are too slow to finish

What I want to do

Filter a large dataset, with pagination and @cascade on v21.03

What I did

I could not, its too slow.

Dgraph metadata

dgraph version

Dgraph version : v21.03.0
Dgraph codename : rocket
Dgraph SHA-256 : b4e4c77011e2938e9da197395dbce91d0c6ebb83d383b190f5b70201836a773f
Commit SHA-1 : a77bbe8ae
Commit timestamp : 2021-04-07 21:36:38 +0530
Branch : HEAD
Go version : go1.16.2
jemalloc enabled : true

I have a query pattern for my application to use dgraph that requires using @cascade to filter paths on the existence of at least one edge that is being filtered. This works great in v20.11, with the caveat that @cascade and pagination did not work as expected.

So I upgraded our staging system to v21.03 in order to get the pagination+cascade fix in this pr. However, as per the design of this fix, it removes pagination completely and calculates the full possible response, then applies pagination. This seriously breaks my query pattern to the point where I cannot use v21.03.

EG: Imagine, if you will, I have a dataset with: (Device[#100])-[:has_object]->(Object[#3000eachDevice])-[:has_indicator]->(Indicator[#20eachObject])

Each node has a [uid] edge out to other nodes, the existence of which I need to filter on. Here is a made up query explaining the issue (this is not a real query of my application, just has the same idea)

q(func: type(Device), first:2) @cascade(myEdgeThatIsUsedToFilter) {
  ...fields
  has_object @filter(type(Object)) (first:2) {
    ...fields
    has_indicator @filter(type(Indicator)) (first:2) {
      ...fields
    }
  }
}
fragment fields {
  uid, name
  myEdgeThatIsUsedToFilter @filter(...) 
  #with @cascade, if at least one of these edges pass the filter, the node is included. 
  # If none of these edges pass the filter, then the parent node is not in the result
  # and will not be traversed to the next level
}

In v20.11, this finished quite quickly (its only asking for 2 Devices, 4 Objects, 8 Indicators total)
In v21.03, this does not finish after 200s, an trimming it down to just the first level shows me the debug metrics that say yes, it had to access 6,300,100 uids (100 devices, 300k objects, 6M indicators) to get the result, instead of the 14 nodes I had asked for.

So… I wont be able to use dgraph v21.03 without a dramatic redesign of our application. Would it be possible to get something into the next release that will evaluate paging while evaluating cascade parameters?

4 Likes

As another point, it seems like there are 2 distinct asks buried in this:

  1. stop looking for results when the pagination limit is hit, even while evaluating that all result nodes have every cascade field
  2. if the above isn’t easy then at least do not follow edges out of a level that would be cut by the pagination of that level (after it does get applied)

The latter alone would bring the hit UIDs of my example query to 100+6000+40=6140 instead of 100+300000+6000000=6300100.

But the first point would bring the total UIDs hit to 14, which is even better than 6140. :slight_smile:

@hardik maybe we had a regression here?

Hey @iluminae, since @cascade is a post-processing step, it doesn’t know about the pagination limit at the top level since a node can be filtered out due to some children node being filtered out so the first approach suggested by you cannot be implemented.
However, the second approach can optimize the current solution which is equivalent to having has filters on each level along with the @cascade directive.
So the optimized query will be:-

q(func: type(Device), first:2) @cascade(myEdgeThatIsUsedToFilter)  @filter(has(myEdgeThatIsUsedToFilter)) {
  ...fields
  has_object @filter(type(Object)) (first:2) @filter(has(myEdgeThatIsUsedToFilter)) {
    ...fields
    has_indicator @filter(type(Indicator)) (first:2) @filter(has(myEdgeThatIsUsedToFilter)) {
      ...fields
    }
  }
}

You can use this query and tell us about the latency and correctness of the result. Meanwhile, we discuss internally about the possible optimisations to it.

@minhaj Your suggested query still has @cascade in it, and the moment I add any @cascade, the query does not finish, which is my original issue.

But let me assume that was a mistake and that you meant to omit @cascade. The has() filter will not remove nodes based on no edges passing their own filters. Boiled down to an example query:

{
  use_has(func: eq(<qa.type>,"Device"),first:2) @filter(has(<qa.has_object>)) {
    <qa.name>
    <qa.has_object> @filter(eq(<qa.name>,"impossible")) # results in 0 qa.has_object
  }
  use_cascade(func: eq(<qa.type>,"Device"),first:2) @cascade(<qa.has_object>) {
    <qa.name>
    <qa.has_object> @filter(eq(<qa.name>,"impossible")) # results in 0 qa.has_object
  }
}

results in:

{
    "use_has": [
      {
        "qa.name": "IBM-now-4"
      },
      {
        "qa.name": "IBM-now-8"
      }
    ],
    "use_cascade": []
}

So I cannot use has() to filter a level based on the inclusion of any edges post filter. With @cascade forcing the query to now perform 100k as much work as in v20.11, I also cannot use that. Do you have any other suggestions?

sorry @minhaj I realize I misread your statement and you did mean to leave @cascade in there. To reiterate, this causes the issue and cannot be done with v21.03 (with my dataset).

If you mean that is what the query will have to look like after you implement point 2 in my above post, I understand - very verbose but if it works, great.

In that case then it seems I will not be able to upgrade to v21.03 - if you have any other way for me to accomplish this in the meantime I am open.

Hi Kenan,
Here is a thought (ignoring the indicator type for now):

{
  devices as var(func: type(Device), first: 100) @filter(gt(count,4))
  
  q(func: uid(devices), first:2) @cascade(count) {
    name
    count
    has_object @filter(type(Object)) (first:2) {
      name
      count
    }
  }
}

The devices block helps narrow down the universe. You can put in a higher number in the pagination filter to expand, and of course add more filters. With a smaller universe, the q block is efficient and likely to hit the data you need. You can play with pagination on the devices query block to increase the probability of hitting the nodes you need.

Please give this a try and let us know.

@anand To boil down the solution you are suggesting:

  • run the query without @cascade so pagination takes effect, save the results in uid variables
  • run the query again with @cascade and use the uid vars as the filters

Note that my queries are very deep on average, so it would come out to something like this:

query q($start: int, $end: int, $limit: int = 0) {
  var_a as var(func: eq(<qa.type>, "Device"), first: $limit) {
    var_b as <qa.has_object> @filter(eq(<qa.type>, "Object"))  (first: $limit) {
      var_c as <qa.has_indicator> @filter(eq(<qa.type>, "Indicator")) (first: $limit)
    }
  }
  a(func: uid(var_a), first: $limit) @cascade(<qa.has_timerange>,<qa.has_object>,<qa.has_indicator>) {
    ...commonFields
    b: <qa.has_object> @filter(uid(var_b)) (first: $limit) {
      ...commonFields
      c: <qa.has_indicator> @filter(uid(var_c))  (first: $limit) {
        ...commonFields
      }
    }
  }
}
fragment commonFields {
  <qa.name>
  uid
  timeranges: <qa.has_timerange> @filter(
    le(<qa.timerange_start>, $end) AND ge(<qa.timerange_end>, $start)
  ){
    start: <qa.timerange_start>
    end: <qa.timerange_end>
  }
}

Which… I guess puts us back into the semantics of @cascade we had in v20.11. (It seems pretty slow @>2s execution time but I do not have the execution time on v20.11) The UIDs hit in the debug metrics returned show a better #UIDs hit of "_total": 6499. Sure I could over page the var block and tighten up the paging in the second one, but that will obviously make performance worse.

I do not know why this would be slower (again thats subjective, I do not have v20.11 numbers) than the same dataset on v20.11, so I am not putting too much weight onto that.

Is this going to be the only solution for cascade going forward, or have you as a team discussed a way to optimize @cascade this in a future release?

Hi Kenan,
Can we try without the var_b and var_c constraints? If we just take the first 1000 or 5000 Device nodes (var_a), what’s the chance that we will find Object and Indicator as per your needs in query block a?

If you find that a certain number of devices satisfies your need consistently, you could get reasonable query performance. Query block a would not have to open the entire Device universe and would perform efficiently. The idea is to limit this Device universe as efficiently as possible, as that forms the most expensive and time consuming part of the query.

@anand - Our queries are built from a user query, these are just example queries where pagination is used to make the query bearable - I am basically converting a user-input cypher MATCH syntax into a dgraph query, so I can’t just optimize this one case because I happen to know the size of the levels in this case - I need to handle whatever the customer wants to query. This one represents (a:Device)-[:has_object]->(b:Object)-[:has_indicator]->(c:Indicator), but it could be anything the user wanted to access.

Here are debug metrics from all nested variables vs just the root variable. (note you will see some other predicates with .next I have cut out of previous examples that represent our virtual edges that have an intermediate node)

with nested variables (1.38s, 6499 uids)

  "extensions": {
    "server_latency": {
      "parsing_ns": 380350,
      "processing_ns": 1380744741,
      "encoding_ns": 239606,
      "total_ns": 1381527330
    },
    "txn": {
      "start_ts": 2920081
    },
    "metrics": {
      "num_uids": {
        "": 6183,
        "_total": 6499,
        "qa.has_indicator": 8,
        "qa.has_indicator.next": 16,
        "qa.has_object": 4,
        "qa.has_object.next": 8,
        "qa.has_timerange": 26,
        "qa.name": 14,
        "qa.timerange_end": 101,
        "qa.timerange_start": 101,
        "qa.type": 12,
        "uid": 26
      }
    }

with one root variable (29.25s, 2896418 uids)

  "extensions": {
    "server_latency": {
      "parsing_ns": 399144,
      "processing_ns": 29251738486,
      "encoding_ns": 502454,
      "total_ns": 29252845323
    },
    "txn": {
      "start_ts": 2967173
    },
    "metrics": {
      "num_uids": {
        "": 365468,
        "_total": 2896418,
        "qa.has_indicator": 6000,
        "qa.has_indicator.next": 156000,
        "qa.has_object": 2,
        "qa.has_object.next": 6000,
        "qa.has_timerange": 324002,
        "qa.name": 162002,
        "qa.timerange_end": 695471,
        "qa.timerange_start": 695471,
        "qa.type": 162000,
        "uid": 324002
      }
    }
  }

I would run one with our actual current query that is ~1s in v20.11 but if I let it run for over 5m in v21.03 it OOMkills our 25GB ram alpha.

Hi Kenan,
In the query block a, can you please try the uid_in function and check if it improves performance.
Here is an example:

{
  devices as var(func: type(Device), first: 1000)  @filter(has(has_object)){
			h as has_object (first:10)
  }
    
  q(func: uid(devices)) @filter(uid_in(has_object,uid(h) )) {
    name
    count
    has_object   {
      name
      count
    }
  }
}

Please check if adding the has query also improves the overall performance.

sure:

with has and uid_in() (29.05s, 2896522 uids)

  "extensions": {
    "server_latency": {
      "parsing_ns": 400591,
      "processing_ns": 29048607400,
      "encoding_ns": 474601,
      "total_ns": 29049622369
    },
    "txn": {
      "start_ts": 2995240
    },
    "metrics": {
      "num_uids": {
        "": 365468,
        "_total": 2896522,
        "qa.has_indicator": 6000,
        "qa.has_indicator.next": 156000,
        "qa.has_object": 106,
        "qa.has_object.next": 6000,
        "qa.has_timerange": 324002,
        "qa.name": 162002,
        "qa.timerange_end": 695471,
        "qa.timerange_start": 695471,
        "qa.type": 162000,
        "uid": 324002
      }
    }
  }

but in general the uid_in() function seems not useful to me to filter on my has_timerange edge - note my earlier example that expressed that. I would have to query the universe of timeranges, which will be billions, and use that as the uid variable argument: uid_in(<qa.has_timerange>,uid(allvalidtimeranges)) Maybe there is a way to use it for the other edges I want to filter on but on in the main path of the query, but the timeranges one especially, it seems like a no-go.

Hi Kenan,
Perhaps we can further improve on 1.38 seconds option. In the query below,

q($start: int, $end: int, $limit: int = 0) {
  var_a as var(func: eq(<qa.type>, "Device"), first: $limit) {
    var_b as <qa.has_object> @filter(eq(<qa.type>, "Object"))  (first: $limit) {
      var_c as <qa.has_indicator> @filter(eq(<qa.type>, "Indicator")) (first: $limit)
    }
  }

You can start with an Indicator type and go reverse up to Device. You can start with a smaller number of Indicator types ($limit). There are better chances of getting a couple of hits and this may improve the query performance. The query segment would look as below

q($start: int, $end: int, $limit: int = 0) {
  var_c as var(func: eq(eq(<qa.type>, "Indicator") {
    var_b as <~qa.has_indicator> @filter(eq(<qa.type>, "Object"))  (first: $limit) {
      var_a as q1<~qa.has_object> @filter()<qa.type>, "Device"), first: $limit) (first: $limit)
    }
  }

Also, thanks for the feedback you have provided. A component called Query Planner is being worked on which will address some of the suggestions you have provided.

Thanks @anand, the question was not really about optimizing the query - though I appreciate the effort there. (remember these are user-input queries translated into dql). The issue was just that a query in v20.11 would be quick, and in v21.03 would be so complex it would take minutes and/or crash alphas.

So the solution here seems to be to sidestep the change made in v21.03, and perform my own pagination before using @cascade at all. If that is the only solution I can mark it as such, but that is not a very satisfying conclusion. To re-ask my question from a previous reply:

ps: I have been intrigued about the query planner roadmap item, but I am not sure what form that will take - so we will have to wait and see what that piece can do.

Hi @iluminae we are not actively discussing any optimization on @cascade at the moment. If you have a suggestion or an idea on how @cascade can handle things differently, please do share it with us.

I only ask because @minhaj seemed to elude to a internal conversation about it above.

However, here is an idea: How about iteratively performing paging and cascade together on the database end. Like say ok, he wants 100 of these and is using cascade, let’s paginate then apply the cascade filter - oh I removed 10 due to cascade filtering, but I know there are more results - grab more and repeat the cascade filtering on that new set.

Seems like that would be better than completely removing paging altogether just to apply it later - in my case increasing the complexity of the query by 1 million times. In the happy path where cascade does not do much filtering (or any), the above is extremely fast, single pass - but currently would take many minutes.

Maybe even a buffer to get some extra internally if cascade is anywhere in the query, so it is not getting all, but a little more than first just in case.


But the main problem lies in the offset though. The only way to really fix that is to use after instead of offset when using cascade, but after is not available in GraphQL :frowning:

To illustrate why this is problematic to do paginate with cascade without after

Let’s assume a query gets users that have a certain payment plan and we need the query to start at the root of the user instead of refactoring for a better query, so we use cascade.

Schema
type User {
  username: String! @id
  plan: Plan! @hasInverse(field: "usedBy")
}
type Plan {
  id: ID
  usedBy: [User]
}
Example Data
  <0x1> <dgraph.type> "User" .
  <0x1> <User.username> "user1" .
  <0x1> <User.plan> <0x5> .
  <0x2> <dgraph.type> "User" .
  <0x2> <User.username> "user2" .
  <0x2> <User.plan> <0x6> .
  <0x3> <dgraph.type> "User" .
  <0x3> <User.username> "user3" .
  <0x3> <User.plan> <0x5> .
  <0x4> <dgraph.type> "User" .
  <0x4> <User.username> "user4" .
  <0x4> <User.plan> <0x6> .
  <0x5> <dgraph.type> "Plan" .
  <0x5> <Plan.usedBy> <0x1> .
  <0x5> <Plan.usedBy> <0x3> .
  <0x6> <dgraph.type> "Plan" .
  <0x6> <Plan.usedBy> <0x2> .
  <0x6> <Plan.usedBy> <0x4> .
Example Query with cascade
{
  queryUser @cascade {
    username
    plan(filter: { id: ["0x6"] }) {
      id
    }
  }
}

Cascade being a post-query and pre-pagination and pre-return process this would currently get all 4 users and then weed out the two we don’t want.

Now let’s apply pagination getting the first 1

Example Query with first:1
{
  queryUser(first:1) {
    username
    plan(filter: { id: ["0x6"] }) {
      id
    }
  }
}

This still gets ALL 4 users and then cascades and then returns the first 1.

This logic works the same for first:1, offset: 1 to get ALL then cascade down to the 2 and then removes the offset slice and limits to the first 1.

The problem stated above in this performance is what if there were millions or billions of users and there were only a very small set fitting the cascade post process?

Every query would then need to touch ALL users (even if there are billions of them) and then apply the cascade and then the pagination.

It was suggested to use a has filer to help reduce the galaxy at the root, but if every user has a plan, then a has filter is of no effect in this scenario.

So a user requests the first user out of the 10 that match the cascade out of the billions of users. The performance makes no difference Dgraph side if there is pagination or not applied.

This change was made in v21.03 to fix the problem when cascade was applied after pagination and incomplete results were being returned.

The suggestion above was to do a recursive paginating getting more until the pagination was met. But the question must be answered logically where to start the pagination from if you only can provide offset?

If we paginate any at all before cascade then we will possibly run into incomplete results and need to fetch more before the response.

If we offset at all in our first query then we risk not passing over enough and not knowing how many we have actually passed over that matched the cascade.

So let’s look at how this suggested process might work:

  1. Query with paginating getting the first 1
  2. Apply Cascade
  3. Do we have limit? && are we at the end? if both no, then go to step 1
  4. Return results to client

But when we include an offset it goes something like this:

  1. Query with paginating (without offset) getting the first 1
  2. Apply Cascade
  3. Do we have (offset amount + limit)? && Are we at the end? If both no, then go to step 1
  4. Return results to client

So the first method without an offset works well, but think again if we have billion of nodes, and we offset by 5. Consider that the first 5 are within the first few hundred nodes but the 6th one is the very last node.

  • Limiting to 6 would still need to read all billion nodes, but now instead of one chunk of billion we are caught in a very long loop
  • Offsetting does no good to decrease this loop, but rather makes the loop even worse as now you have to loop at least twice to get a complete result.
  • If we wanted the first:1, offset: 5 knowing ourselves that the 6th node is the billionth node, then we would be caught running one billion queries if we did not buffer the pagination some to get any more than one on each loop
  • A Higher offset acts as a multiplying factor to the number of loops needed to get a complete result set.

Solution: DO NOT USE CASCADE AND PAGINATION TOGETHER if you are experiencing performance issues. Instead refactor the query to have the best performance. If you are cascading over multiple fields, then it would be better to run the query in DQL (with var blocks) instead of GraphQL for better performance, but then you will need to apply any auth rules as well in var blocks.

1 Like

To clarify, none of my issues are with the GraphQL™ layer, I am not using it at all.

The issue is:
v20.11: query runs fast
v21.03: same query runs forever

@amaster507 you have definitely outlined a worst case scenario which would cause… the query to run forever - which it already does. So, which pattern takes longer - a query of a billion that will never finish, or a billion smaller queries that will also never finish?

Ofcourse any happier path will actually return results with the correct pagination+filtering semantics, even among billions of results. Currently, that is only possible if you are ok processing the entire dataset.

Running the entire query without cascade and then again with cascade returns us to semantics present in v20.11, which is the only saving grace here.

Just trying to help better understand the scope here. Any change in DQL also effects GraphQL so I try to think at the higher level down to the lower level.

Right, just showing that even with the proposed solution of doing some kind of pagination prior to cascade and then applying pagination again after will not lead to a perfect solution and may sometimes be a worse solution depending on the effects of

a query of a billion… or a billion smaller queries

I don’t know what a billion smaller queries might cause but possibly a scenario worse then the first, idk?


My previous reply was just my brainstorming, and the problems involved with any kind of “fix”


@iluminae, To Continue since I can’t reply to self:

I am interested though in using DQL prior to v21.03 how do you know if you are at the end of a pagination result without doing an additional count every step of the way?

In the higher GraphQL before we had aggregate edges and queries, there was no way to know how many there might be, so when I got an empty or incomplete result I assumed we were at the end. But if I used cascade then even the first many pages could be empty while the remaining were not. And furthermore it would be easy to assume that when I applied cascade and the first: 1 returned [] then any page offset after that would also be empty, but that was not true because I could have many even hundreds of pages before I find one that cascaded out and remained. This lead me to make incorrect conclusions from my data before that was not accurate. For myself, I just simply refuse to use cascade and pagination anywhere together.

I think the ultimate decision is the difference of use cases. You use pagination to make the query bearable, where others are using pagination to actually paginate the results.

I wonder if your end users are understanding of the fact that an empty result set might not actually be an empty result since you are applying some kind of pagination on user queries? If I would allow a user to build a query to find all of the users with a plan (given my simple illustration above) and then I paginated it to say 1,000 to make it bearable and it returned no responses, would that user might believe that there are actually no responses or would he correctly know that the first 1,000 users did not match the pattern queried? (I don’t know anything about your application, but…) This could be dangerous to let an end user believe something to be true when it is not. For instance if the query was made to look for fraudulent activity and the first 100,000 accounts checked out good, then it might be assumed there were no fraudulent activity but maybe the next 100,001 - 200,000 were all fraudulent. That goes along the lines of bad polling data assuming that a small percentage of the population correctly or even somewhat closely resembles the whole

True, it will (probably) be rare for my users to request a ‘second page’ as opposed to ‘give me a consistent sample of my results’ meaning of using first: without ever requesting a second page.

how do you know if you are at the end of a pagination result without doing an additional count every step of the way?

Maybe this is something only the database itself could do efficiently, since it could know if it reached the end of the tablet or ditched early (not saying they track that now but it could). Or, stream back results to the originating peer and client cancel the grpc stream when pagination requirements are met - or the stream terminates on the far side upon tablet completion.

A some level, I have to think that my example dataset of over 6M truly did not have to be completely iterated to get 12 results. After all, Only a Sith Deals in Absolutes.