Query performance of large database (over 12g edges)

dkato · July 1, 2019, 6:34am

We created very large Dgraph database using bulk loader which has over 12g edges, 1.2TB of p directory size. And we realized that even very simple query on indexed predicate takes over 4 minutes.

Is it reasonable time according to data size (limitation of Dgraph) or is there something wrong with our query or database?

Query example

{
  q(func:has(price), first: 10, orderdesc: price){
    uid
  }
}
--> takes 247sec

Count of predicate

{
  q(func:has(price)){
    count(uid)
  }
}

--> "count": 36479300, takes 17sec

Schema

 <price>              : int @index(int) .
...

Other information

HW: 64 vCPUs, 416 GB memory(n1-highmem-64 of Google Cloud Platform)
single instance
CPU usage: around 200% (full 2cpus out of 64 cpus)
Mem usage: around 19%
Disk IO: it looks almost no io according to GCP instance monitor.
→ It looks no resources are exhausted.

Dgraph version : v1.0.15, Commit SHA-1 : ff5ee1e2

go tool pprof output pdfs
cpu_1.pdf (40.4 KB) (30s profile just after query started)
cpu_2.pdf (22.0 KB) (30s profile a few minutes after query started)
heap.pdf (27.1 KB)

If you need additional information, please ask me.
Thank you for your kind reply in advance.

liveforeverx · July 1, 2019, 12:41pm

Please upvote/like this issue May be adding BTree index. This query is exactly, where query performance can be significantly better with my proposal.

dmai · July 1, 2019, 6:06pm

The has() function does not utilize an index. has() does iteration to find the nodes which have the specified edge.

To utilize the index utilize the appropriate indexed functions for your predicate type.

dkato · July 2, 2019, 1:44am

Thank you for reply.

I don’t fully understand your plan but It looks amazing so I did like your post.
I hope your improvement makes Dgraph more sophisticated database.

dkato · July 2, 2019, 2:18am

Hi dmai, thank you for reply.

The has() function does not utilize an index. has() does iteration to find the nodes which have the specified edge.

In this case, I just need 10 highest prices, so has() is basically redundant. I added it just because query root function is required. Is there any other way to write a query which returns same result?

Anyway, I read pprof result and a bit of Dgraph sources, I learned below things.

To write fast query against large database, We must sufficiently do narrowing result at query root function.
Ordered pagination doesn’t use index, so we also need to narrowing result before pagination.

Are these correct?

liveforeverx · July 2, 2019, 10:06am

I don’t have a plan, but a feature request for Dgraph to have btree index, which should make sorted queries quick, like to get 10 highest prices in a btree index is very quick(matter of milliseconds(may be paar seconds on your data in worst case, I don’t know, but should be really quick) actually).

Topic		Replies	Views
Slow performance on a single node with millions of documents Dgraph performance , area:performance	7	1777	August 24, 2020
Realistic query performance Dgraph	7	663	May 16, 2020
Advice Needed on Optimizing Dgraph Query Performance Users	1	100	July 29, 2024
Why queries are slow during load? Users	7	817	November 18, 2018
[Feature Request] Add GPU Query acceleration Dgraph dgraph , kind:feature , area:performance	0	876	January 27, 2020

Query performance of large database (over 12g edges)

Related topics