Not able to query across tree structure with millions of links

LCL · August 23, 2024, 10:46am

What I want to do

I want to run a basic query. The schema is the following:

type File {
  id: ID!
  name: String! @search(by: [hash, regexp])
  buckets: [Bucket] @hasInverse(field: inBucket)
}

type Bucket {
  id: ID!
  name: String! @search(by: [hash, regexp])
  inFiles: [File]
  parameters: [Parameter] @hasInverse(field: inBuckets)
}

type Parameter {
  id: ID!
  key: String! @search(by: [hash, regexp])
  value: String @search(by: [hash, regexp])
  inBuckets: [Bucket]
}

As an example, there is one parameter which is linked to ~2 million Buckets, which are again linked to roughly 40000 files.

The following query is taking many many minutes to finish:

query {
	var(func: type(Parameter)) @filter(eq(Parameter.key, "someKey"))  {
		Parameter.inBuckets {
			Bucket.inFiles {
				files as uid
			}
		}
	}
  
	query(func: uid(files)) {
		count(uid)
	}
}

Question

Is there a way to speed this up. Would it make sense to have links direktly from parameters to files to improve performance?
In other cases with way less links, the performance is as expected.

vnium · August 23, 2024, 11:27am

try

query {
	var(func: eq(Parameter.key, "someKey"))  {
		Parameter.inBuckets {
			Bucket.inFiles {
				files as uid
			}
		}
	}
  
	query(func: uid(files)) {
		count(uid)
	}
}

LCL · August 23, 2024, 12:01pm

Filtering the parameters themselves is no issue at all, thats blazingly fast. Even aggregating the buckets is quite performant. But as soon i go one level further its super slow…

Raphael · August 23, 2024, 3:21pm

@LCL ,
1- you mentioned a slow query and @vnium suggested a way to re-write the same query. Could you confirm, that the suggested approach is working for you?
Searching for type “Parameter” and then filtering on key value is more CPU intensive than directly using the key index and searching using eq() function. @vnium suggestion should really improve your query perf.

2-Could you elaborate on your other queries when you mentioned “I go one level further”? (A sample querie may help). Thanks

LCL · August 23, 2024, 7:08pm

@Raphael @vnium , sure I can elaborate further.

Running the original query takes 160475274744 ns. Modifying the query as suggested takes 147264831335 ns which is marginally better. I tested it again and it took 172891140041 ns, so a bit worse than the original query. Thats some variance which is to be expected i guess.

Just querying for the parameter “the original way”

query {
    query(func: type(Parameter)) @filter(eq(Parameter.key, "someKey")) {
        count(uid)
    }
}

takes 59506088 ns, querying it the suggested way

query {
    query(func: eq(Parameter.key, "someKey")) {
        count(uid)
    }
}

takes 1159657 ns. So yes, as expected this way of querying parameters is faster, but that does not make up for the remaining time spend for the full query.

What i meant by going one level further is the following. I just queried parameters and counted just the parameters which is super quick (less than a second). The next level would be aggregating the linked buckets, so

query {
    var(func: eq(Parameter.key, "someKey")) {
        Parameter.inBuckets {
	        buckets as uid
        }
    }
  
    query(func: uid(buckets)) {
        count(uid)
    }
}

which takes 777101367 ns, still more than quick enough. The next level is aggregating the linked files of the buckets (original query) which is already stated above, taking critically longer.

Coming to a conclusion: ill for sure start writing queries the suggested way, but this does not help in this specific case. Yes its more performant, but it does not solve the described issue.

Maybe some context to my db. There are 313707 files, 2885203 buckets and 1189707 parameters. The relation can be seen in the schema snippet in the original post. The Parameter.key “someKey” relates to 2 parameters. Linked to those two parameters are 2483757 buckets. Linked to the buckets are 41505 files.

LCL · September 2, 2024, 7:33am

@Raphael could you support here please?

Raphael · September 5, 2024, 4:11pm

@LCL I have no definite answer here. You could explore the following options:

Query optimization:

1- you don’t need to ‘enter’ inBuckets and create a map (variable) of uids.
2- you can count the inBuckets in the first var
So you can use

   var(func: eq(Parameter.key, "someKey")) {
        buckets as count(Parameter.inBuckets)
    }

Now buckets is a map of uid(key) → count of buckets. You can sum the values

query {
    var(func: eq(Parameter.key, "someKey")) {
        buckets as count(Parameter.inBuckets)
    }

    result() {
      count: sum(val(buckets))
     }
}

It should be faster, but I did not test on realistic data set, I just recreated you schema and played with few parameters to test the queries. If you have an extract of your dataset to share in RDF we can do some testing.

Architecture

The other option to explore is the architecture to leverage predicate sharding with multiple Alphas so that the relationships inBuckets, inFIles, and parameters will get a chance to be served by different Alpha nodes. I’ll ask engineers about their thoughts on this use case.
Let me know if the updated query runs faster.

Topic		Replies	Views
Two equivalent queries, one is slow and the other one is fast Dgraph	3	823	January 6, 2020
Query to slow, how to optimize query Dgraph	5	471	April 25, 2021
Query performance of large database (over 12g edges) Dgraph	5	1782	July 2, 2019
How to improve query response time? Dgraph	6	533	November 11, 2019
Understanding why `count(~fooPredicate) @filter(eq(dgraph.type, "fooType"))` is slow Dgraph	1	532	May 11, 2020

Not able to query across tree structure with millions of links

What I want to do

Question

Query optimization:

Architecture

Related topics