Slow performance on a single node with millions of documents

I’m using Dgraph to import some arbitrary data sets and join them together on certain matching keys.

In my tests, I’ve got four large data sets ranging from 700k to 3 million documents each. They are joining on things like email or UUID predicates across the sets.

Ratel tells me this is the data usage across the tablets:

last_name, 1.5TB
full_name, 320.0GB
_system_mapped_title, 50.7GB
first_name, 27.5GB
user_id, 18.7GB
email, 18.6GB
phone, 18.0GB
created_at, 5.7GB
mobile, 5.2GB
dgraph.type, 4.6GB
id, 164.3MB
address, < 64MB
city, < 64MB
company_name, < 64MB
... and 15 more ..., < 64MB

Each predicate has both “hash” and “fulltext” indices enabled. The tool searches through all of the predicates using alloftext() queries joined with an “OR”.

During testing with smaller data sets (~50k documents on average), the query time was very fast. Now it has these larger data sets, the query time has increased massively to10 seconds or longer.

Will a clustered setup get the query time back down or should I be doing something else here? I was contemplating a 3- or 6-node cluster but I would like some advice to how much RAM each node should have and whether this would definitely fix the problem.

Thanks in advance for any help or advice you can give :slightly_smiling_face:

Hi @Jameyg,

I’m curious about this too - I’m also new to dgraph, but as part of the curiosity, do you have a querytime chart? I’m thinking X axis = number of docs in DB, and Y axis is avg query time. This will give us an idea about whether something funny is happening (i,.e whether query time is linearly proportional to number of docs or is it exponential or something).

Hi @chewxy,

It was due to using the has() filter to filter by type. I had one large query composed of a long series of alloftext() queries separated by “OR” and the same but with has() filtering on the type, also separated by “OR”.

It seems like from what I’ve read that Dgraph currently sequentially goes through and cuts out records using has() rather than it being indexed?

The query time has gone from 30 seconds (!) to 0.1 seconds :slightly_smiling_face: The only functionality I’ve lost is that I can’t segregate groups of data sets from others. But in the context of this project we could just deploy a separate version of this application for each group instead under a subdomain perhaps.

1 Like

You can also divide up the predicates by node type. That would help split up the predicates better, if and when you shard data by adding more Alpha groups.

Cheers Manish.

I didn’t mention it in my reply above but that’s what I ended up doing:

{
  var(func:alloftext(<email>, "gmail.com")) {
    email as uid
  },
  var(func:alloftext(<first_name>, "gmail.com")) {
    first_name as uid
  },
  var(func:alloftext(<last_name>, "gmail.com")) {
    last_name as uid
  },
  results(first: 10, offset: 0, func: uid(email, first_name, last_name)) {
    uid
    expand(_all_)
    _system_mapped_title
    dgraph.type
    created_at
  }
}

By assigning the query blocks into variables and then collecting them together. It’s not a relevance-ranked search but it’s definitely good enough for our use case.

2 Likes

Hi @jameyg

Just to follow up on your questions at the end of the post. After doing some research and a lot of great help from @omar, I’d also like to answer said questions:

  1. Will a clustered setup help with query time? Not necessarily. It’s more useful to think of the clustered setup as helping you achieve HA. I’m right now finding out if splitting the predicates by group will help with query times - will update this section when I get the answer.
  2. Should I be doing something else? It seems you already did - by optimizing the query. As you noted, has() is slow, presumably because there is a need to look through all the nodes. Picking out a single node (perhaps using eq()), and then expanding your query from there is the preferred way to go with graph databases. Note that this is quite different from the usual way of doing things in the relational world, where the preferred method is to first filter then select your columns (in Codd’s codification of relational algebra, we say we first select then project - SQL mixed up the terms).
  3. How much RAM each node should have? For each Zero node, you should have at least 8GB of RAM. For each Alpha node, you should not have less than 16GB of RAM.

Hey @jameyg - this is a good resource on arranging your clusters - we chatted about it briefly on the call, and it turns out @pawan had done a lunch-and-learn about it.

1 Like

How was this determined? Based on data size or based upon some other metric?