Dgraph performance for hierarchical data

I compared the performance of dgraph for hierarchical data for developing a multi-level comments sections. I inserted the following data:

1000000 → Root Nodes
1000 comments nested to node 2
500 comments nested to node Path: 2->5->6
500 comments nested to node Path: 2->5->6->10039121
10500 → comments to node Path: 2->5->6->10039121->10034896
1000 —> comments to node Path: 2->5->6->10039121->10034896->10015766
1000 —> comments to node Path: 2->5->6->10039121->10034896->10015766->10195896

We compared the performance of postgres and Dgraph for the same data. Each node contains information like: Id, Name, Email_Id, Content, CreatedAt and UpdatedAt.
Following are the response:
For Complete data:
Dgraph: Docker itself got closed.
Postgres: 5s for this, this includes processing for formatting the data too in nested json format.

For fetching all the children of Root 2:
dgraph: 544.1775ms
postgres: 106.285167ms

Can you pls tell us what we might be doing wrong?

You’re going to have to provide more data including at least the query you are running and preferrably your hardware and runtime environments

Hi @amaster507
This is the 1st query to fetch the complete data:

query findNode() {
		node(func: has(name))  @recurse (depth: 100000) {
			email_id
			name
			content
			created_at
			updated_at
			replies
		}
	}

This is the 2nd query to fetch all the child nodes of a node 2:

query findNode($id: string) {
		node(func: eq(id, $id)) @recurse (depth: 100000) {
			email_id
			id
			name
			content
			created_at
			updated_at
			replies
		}
	}

Screenshot of the schema:

Hardware specs:

  • Apple M1 chip 8GB RAM.
  • Dgraph was running on docker with no limitation on resources

Disclose everything not just abstract ideas or queries. Without it, it’s impossible to know what you’re doing wrong. It’s also important to know what you’re doing in Postgress.

Hi @MichelDiz, we are trying to build a multi level comment section, like redit. In this, we can have any level of nesting for comments.
A sample json would be something like this:

[
  {
    "email_id": "abc",
    "id": "132",
    "name": "132",
    "content": "1231",
    "created_at": "2023-05-17T06:01:17.482Z",
    "updated_at": "2023-05-17T06:01:17.482Z",
    "children": [
      {
        "email_id": "test@gmail.com",
        "id": "1",
        "name": "fwdf",
        "content": "e2we2",
        "created_at": "2023-05-17T06:01:17.482Z",
        "updated_at": "2023-05-17T06:01:17.482Z",
        "children": [....]
      }
    ]
  }
]

We seeded the data using the same from a json of similar format.

For postgres: The following schema was used.

Create table comments (
    id                      ltree PRIMARY KEY,
    name 					VARCHAR(255) NOT NULL,
    email_id				VARCHAR(255) NOT NULL,
    content					TEXT,
    path                    ltree,
    created_at              TIMESTAMP WITH TIME zone DEFAULT Now(),
    updated_at              TIMESTAMP WITH TIME zone DEFAULT Now()
);

Hey Trivedi,

The comparison between PostgreSQL and Dgraph for handling hierarchical data, as in a multi-level comments section, is fair to an extent, but specific system factors must be taken into consideration. Both databases have strengths in different areas due to their underlying structures and use cases.

The observed performance difference could be due to factors such as data storage, computing design and indexing, queries, hardware setup, and Docker resource limits(Which in non-linux systems exists no matter what you do - Docker will never work “natively” in macOS. It is a VM, only in Linux it runs close to the kernel which Apple would never let Docker or others free open source to run something close to the kernel. Only VMWare, Paralels may touch Apple Hypervisor). However, an additional factor to consider in this comparison is the operating system and hardware environment. As you’re using a macOS M1, this could potentially impact the results. Dgraph doesn’t have a build for macOS and likely isn’t running on the recent (and not yet fully tested) ARM version. On the other hand, PostgreSQL has support for Apple ARM. In your case, you executed Dgraph in a Docker container, and for a fair comparison, PostgreSQL should be run in the same manner.

Furthermore, when dealing with more than 1 million nodes, it’s beneficial to have additional memory. Dgraph would need to expand its use of RAM for larger queries, which could also affect performance.

Despite PostgreSQL’s superior performance in this specific task, the choice of database should take into account other factors like ease of use, community support, project maturity(which is a win for Postgres), and specific features. Database optimization and correct setup are also key when comparing different databases.

It is nice to conduct such comparisons in fair scenarios. As these not only provide valuable insights into the performance characteristics of different systems, but also pave the way for improvements. It’s important to remember, however, that the suitability of a database system depends on the specific use case, and performance is just one of many factors to consider.

As you may not doing anything serious, just evaluating the product. Maybe you can call Postgres a winner for your case. Cuz if you are comparing a Graph Database with Postgres in performance. You are not interested in Graph characteristics right? Otherwise you would be comparing with Neo, Tiger, or even RedisGraph.

Cheers.

Echoing some of @MichelDiz points above - Docker Desktop on MacOS runs on a virtual machine which impacts performance. In production one would use Dgraph via Docker on Linux, which as Michel mentioned runs closer to the kernel. Personally I have a 16 GB M1 and I have run into performance bottlenecks, so it doesn’t surprise me that 8 GB is not sufficient. :slight_smile:

Having said this - I would be very curious about the benchmark results above if they were rerun in a linux environment for a more fair comparison.

Hey @MichelDiz @joshua ,

We tried it on an EC2 instance (m6a.large). Both Postgres and dgraph were running on docker.
Yet, the results are similar only. Postgres was ~3.5 times faster for our use case.

Hardware specs of EC2 instance:
CPU: 2 core
RAM: 8 GB
Kernel: Linux/Ubuntu

For this root, dgraph took 1.03276245s
For this root, postgres took 279.365192ms

This is a tree example which is a graph. So wanted to understand this behavior. Shouldn’t dgraph outperform Postgres in this case?
Or maybe we can change the database schema to optimize dgraph’s performance.

In theory, yes. But need to check the methods. Compare the tricks postgres does and what we do.

If you use has() this will do a complete scan over the predicate name. Mean, all nodes with name will be expanded in RAM. And then for each node found it will run a recurse. This is unecessary. Why not find the root and from the root expand?

I have a internal project that deals with millions of nodes every week. And it is pretty fast all the time. I’m using a tree to do a pseudo-indexation. And also running recurse. It has like 50GB of disk consumed so far, 4 weeks running I think. Still running fine every time I query it.

PS. It is running the basic machine in our Cloud. Wich has 32GB of RAM.