Sharing a little trick

Hi everyone, long time no see.

I’m here to share a small experience. I’ve been using Dgraph in two real-world projects with minimal hardware resources. I’ve noticed that tweaking the data model significantly boosts query performance. The little trick is to transition from approaches like “myEdge @filter(eq(something, 'great'))” to something along the lines of “myEdge.something.great”.

It’s quite straightforward. This avoids processing the value on the edge, which was causing a ridiculous bottleneck in a EC2 poor machine. Even in a project where each chain data in my model had around 50,000 nodes/obj times 19 parents. Filtering 50k based on the Filter would make everything painfully slow. Imagine multiply it by 19.

The approach is simple. However, you need to create an edge for each state. In my case, I had “myEdge.something.great” and “myEdge.something.notgreat” among other variations. And in the applications, I dealt with the cases individually modifying each of my mutations carefully.

It was a lot of work to migrate the model. Therefore, I think it’s interesting to start with this approach since the beginning. Think of this edge/predicate modeling as similar to what is done in Redis, like “example:test:test2” which creates prefixes. It’s clear that Dgraph inherits this from Badger, which works similarly to Redis. The benefits are substantial, even on low-performance machines I achieved good results.

I’m not saying to abandon using Filter completely. I’m saying that using pure edges as if they were parameters is better. But you will end up accumulating a significant amount of predicates in your schema. Which for some may be confusing or a mess.

Update: Another advantage of this approach becomes evident when you implement sharding. It can significantly accelerate performance in systems with millions upon millions of objects, as each predicate will be segmented by shard. The performance improvement can be simply astronomical.

Cheers!

4 Likes

@MichelDiz, would you be willing to share some more detailed examples of your little trick?

I can’t share details cuz they are private. But it is just simple. Here a pseudo example.

Alice and Bob on a social network can give thousands of likes to a series of films.

This is an obvious approach below. But there are other cases where it is not so simple. Which is the management of object states.

{
  "users": [
    {
      "uid": "0x31",
      "name": "Alice",
      "evaluations": [
        {
          "uid": "0x11",
           "liked": [{ "uid": 1}, { "uid": 3}],  // Alice liked Inception and Interstellar
           "disliked": [{ "uid": 2 }]  // Alice didn't like The Matrix
        },
     },
    {
      "uid": "0x41",
      "name": "Bob",
      "evaluations": [
        {
          "uid": "0x12",
          "liked": [2],  // Bob liked The Matrix
          "disliked": [1, 3]  // Bob didn't like Inception and Interstellar
        }
      ]
    }
  ],
  "movies": [
    {
      "uid": 1,
      "title": "Inception",
      "year": 2010
    },
    {
      "uid": 2,
      "title": "The Matrix",
      "year": 1999
    },
    {
      "uid": 3,
      "title": "Interstellar",
      "year": 2014
    }
  ]
}

In some scenarios, a node may hold state values within itself. Consider the attribute “age” as an example. In a node, you might have:

{
  "uid": "0x31",
  "name": "Alice",
  "age": "30"
}

And in the user list, you would perform users @filter(gt(age, 30)) to filter based on the indexing table. This process is generally very fast. It’s a no-brainer to use this approach when you have a small dataset. However, you can create an edge for age. For instance:

users.above.30 { expand(_all_) }

This method is incredibly fast because it segments the user list by edge, instead of relying on the indexing table. This is advantageous especially when the table could contain millions of users to compute, checking who is over 30. Do you see the difference?

But it is more work for you to deal with. But when you grow your app, this approach pays off.

Another advantage of this approach becomes evident when you implement sharding. It can significantly accelerate performance in systems with millions upon millions of objects, as each predicate will be segmented by shard. The performance improvement can be simply astronomical.

1 Like

Hey @MichelDiz, this is basically the same theory almost of how to optimize the type system itself. Right now Dgraph hold every type as a value in a predicate. If the types were subjects or predicates themselves then the type system would be so much better!

2 Likes

Absolutely! It’s basically the same thing. Ever since I discovered that this approach is better, I’ve been trying to evangelize it at Dgraph(when I was there*). I pitched the idea, and they liked it. However, it never really took off. Having this approach natively would be incredibly beneficial for the Type System.

2 Likes