Neo4j vs Dgraph - The numbers speak for themselves - Dgraph Blog

As Dgraph is nearing its v0.8 release, we wanted to spend some time comparing it against Neo4j, which is the most popular graph database. We have divided this post into five parts:

  1. Loading data
  2. Querying
  3. Issues faced
  4. Features
  5. Principles behind Dgraph

Set up

  • Thinkpad T460 laptop running Ubuntu Linux, Intel Core i7, with 16 GB RAM and SSD storage.
  • Neo4j v3.1.0
  • Dgraph from master branch (commit: 100c104a)

1. Loading Data

We wanted to load a dense graph data set involving real world data. We at Dgraph have been using the Freebase film data for our development and testing. We feel this data is highly interconnected and makes a good use case for storing in a graph database.

The first problem we faced was that Neo4j doesn’t accept data in RDF format directly 3.1 . The loader for Neo4j accepts data in CSV format which is essentially what SQL tables have. In our 21 million dataset, we have 50 distinct types of entities and 132 types of relationships between these entities. If we were to try and convert it to CSV format, we would end up with 100s of CSV files. One file for each type of entity, and one file per relationship between two types of entities. While this is okay for relational data, this doesn’t work for graph data sets, where each entity can be of multiple types, and relationships between entities are fluid.

So, we looked into the next best option to load graph data into Neo4j. We wrote a small program similar to the Dgraphloader which reads N-Quads, batches them and tries to load them concurrently into Neo4j. This program used Bolt, a new protocol by Neo4j. It is the fastest way we could find to load RDF data into Neo4j. In the video below, you can see a comparison of loading 1.1 million N-Quads on Dgraph vs. Neo4j.

Note that we only used 20 concurrent connections and batched 200 N-Quads for each request because Neo4j doesn’t work well if we increase either the number of connections or N-Quads per connection beyond this. In fact, that’s a sure way to make Neo4j data corrupt and hang the system 3.2 . For Dgraph, we typically send 1000 N-Quads per request and have 500 concurrent connections.

With the golden data set of 1.1 million N-Quads, Dgraph outperformed Neo4j 46.7k to 280 N-Quads per second. In fact, the Neo4j loader process never finished (we killed it after a considerable wait).

Dgraph is 160x faster than Neo4j for loading graph data.

2. Querying

We would have ideally liked to load up the entire 21 million RDF dataset so that we could compare the performance of both databases at scale. But given the difficulties we faced loading large amounts of data into Neo4j, we resorted to a subset dataset of 1.3 million N-Quads containing only certain types of entities and relationships. After a painful process, we converted our data into five CSV files, one for each type of entity (film, director, and genre) and two for the relationships between them; so that we could do some queries. We loaded these files into Neo4j using their import tool.

./neo4j start
./neo4j-admin import --database film.db --id-type string --nodes:Film $DATA/films.csv --nodes:Genre $DATA/genres.csv --nodes:Director $DATA/directors.csv --relationships:GENRE $DATA/filmgenre.csv --relationships:FILMS $DATA/directorfilm.csv

Then we created some indexes in Neo4j for the best query performance.

CREATE INDEX ON :Director(directorId)
CREATE INDEX ON :Director(name)
CREATE INDEX ON :Film(release_date)

We tested Neo4j twice. Once with query caching turned off, and then with query caching turned on. Generally, it does not make sense to benchmark queries with caching turned on, but we decided to set it because that’s the default behavior Neo4j users see. You can set it by modifying the following variable in conf/neo4j.conf.

dbms.query_cache_size=0

Dgraph does not do any query caching. We loaded an equivalent data set into Dgraph using the following schema and the commands below.

scalar (
    type.object.name.en: string @index
    film.film.initial_release_date: date @index
)

The schema file specifies creation of an index on the two predicates.

# Start Dgraph with a schema which specifies the predicates to index.
dgraph --schema ~/work/src/github.com/dgraph-io/benchmarks/data/goldendata.schema
# Load the data
dgraphloader -r ~/work/src/github.com/dgraph-io/benchmarks/data/neo4j/neo.rdf.gz

With data loaded up into both the databases, we benchmarked both simple and complex queries. The results didn’t surprise us.

Benchmarking process

The benchmarks for Neo4j and Dgraph were run separately so that both processes could utilize full CPU and RAM resources. Each sub-benchmark was run for 10s so that sufficient iterations could be run. We also monitored the memory usage for both the processes using a simple shell script.

go test -v -bench=Dgraph -benchtime=10s .
go test -v -bench=Neo -benchtime=10s .
Queries Id Description SQ Get all films and genres of films directed by Steven Spielberg. SQM Runs the query above and changes the name of one of the films. GS1Q Search for directors with name Steven Spielberg and get their films sorted by release date. GS1QM Runs the query above and also changes the name of one of the films. GS2Q Search for directors with name Steven Spielberg and only their films released after 1984-08 sorted by release date. GS2QM Runs the query above and also changes the name of one of the films. GS3Q Search for directors with name Steven Spielberg and only their movies released between 1984-08 and 2000 sorted by release date. GS3QM Runs the query above and also changes the name of one of the films.

Note: If the test id has a P suffix, it was run in parallel.

Read-only benchmarks

Query caching turned off for Neo4j ![](upload://11AbRpUgn3JO1z1jnxXFc4wJ8O3.png)

Query caching on for Neo4j ![](upload://urn5N71ROfzmkpqtFEdlASksRyr.png)

Queries to Dgraph took the same amount of time, as expected in the both cases. But, Neo4j query latency was at least halved. That’s not surprising given all the subsequent runs were using query cache. Thence, Neo4j latency was better than Dgraph with query caching turned on, and worse when off.

Read-write benchmarks

Query caching turned off for Neo4j ![](upload://f9V28sIiNBGBKJ58VYGfp4vD6v2.png)

Query caching on for Neo4j ![](upload://xqCIVgeWDsf4a6KvU2NHgluUlku.png)

For intertwined reads and writes, Dgraph is at least 3x to 6x faster.

We can see that Neo4j is even slower with query caching on because they have to do the extra work of cache invalidation on writes. Dgraph was designed to achieve low latency querying with real world use cases, where reads are typically followed by writes and vice-versa, and the performance benefits show in the numbers.

Not just that, Neo4j takes up much more memory. At the start of the benchmarks Dgraph consumed around 20 MB which increased to 600 MB at the end. In comparison, Neo4j was already consuming 550 MB at the start which increased to 3.2 GB at the end of the benchmarks.

Dgraph consumes 5x lesser memory compared to Neo4j and is at least 3x faster for intertwined reads and writes.

3. Issues faced

  • We couldn’t find a convenient way to load large amount of interconnected graph data into Neo4j apart from breaking it into CSV files. We had to write a loader which could concurrently load RDF data into Neo4j.
  • We hit data corruption issues on sending more than 20 requests concurrently, which the database could not recover from. In comparison, we typically send 500 concurrent requests to Dgraph , each request batching 1000 N-Quads.
  • While loading data concurrently and opening 100 connections, Neo4j started returning bad connection error because it hit the limit of maximum open file descriptors which was set to 1024 (the default). We have never witnessed such a problem with Dgraph.

4. Features

We talked about performance and issues. Now, let’s see how does Dgraph compare against Neo4j regarding features.

Feature Dgraph Neo4j Production Features Highly available, Consistent, Fault tolerant Master-slave architecture (only full data replicas) Data Sharding Yes. Data sharded and replicated across servers, using consensus for writes. No data sharding. Horizontal Scalability Yes. Add servers to cluster on the fly to distribute data better. Supports only full data replicas Transactional Model Linearizability aka Atomic Consistency ACID transactions Backups Hot backups in RDF format available using the HTTP interface Hot full and incremental backups available only as part of paid enterprise edition HTTP API for queries and mutations Yes Yes Communication using clients over binary protocol Yes, using grpc Yes, using bolt protocol Bulk loading of graph data Yes, can load arbitrarily connected RDF data using dgraphloader Only supports loading relational data in CSV format using the loader Schema Optional (supports int, float, string, bool, date, datetime and geo types) Optional (supports byte, short, int, long, float, double, char and string types) Geospatial Queries Yes. Supports near, within, contains, intersects No. Not part of core database Query language GraphQL like which responds in JSON Cypher Order by, limit, skip and filter queries Yes Yes Authorization and authentication SSL/TLS and auth token based security (support by v1.0) Supports basic user authentication and authorization Aggregation queries Work in progress (support by v1.0) Supports count(), sum(), avg(), distinct and other aggregation queries Access Control Lists Work in progress (support by v1.0) Available based on roles as part of enterprise edition Support for plugins and user defined functions Work in progress (support by v1.0) Yes Browser interface for visualization Work in progress (support by v1.0) Yes

Dgraph (2016) is a lot younger project than Neo4j (2007), so reaching feature parity quickly was a tough job.

Dgraph supports most of the functionality that one needs to get the job done; though it doesn’t have all the functionality one might want.

5. Principles behind Dgraph

While Dgraph runs very well on our (Linux) Thinkpads, it is designed to be a graph database for production. As such, it allows the ability to shard and distribute data over many servers. Consistency and fault tolerance are baked deep into Dgraph , to the point where even our tests need to start a single-node Raft cluster. All the writes, irrespective of which replica they end up on, can be read back instantaneously, i.e. linearizable (work in progress, ETA v0.8). A few server crashes would not lose data or affect the end-user queries, making the system highly-available.

Such features have traditionally been a talk for NoSQL databases or Spanner, not for graph databases. But, we think any production system, on which the entire application stack is based, must stay up, perform and scale well. The system must be able to utilize the server running it well, process a lot of queries per second, and provide a consistent latency.

Also, given we’re building a graph database, the system should be able to handle arbitrarily dense interconnected data and complex queries. It should not be confined by pre-optimization of certain edges, or other tricks to make certain queries run fast.

The speed achieved should be due to a better design and across the entire spectrum.

Finally, running and maintaining such a system should be easy to the engineers. And that’s only possible if the system is as simple as it can be, and every piece of complexity introduced to the system is carefully weighted.

These are the principles which guide us towards building Dgraph. And we’re glad that in a short period, we’ve been able to achieve many of these. Now we leave it to you, our users to try out Dgraph, and let us know what you think.

Criticism to these benchmarks (Updated Feb 1, 2017)
  • This reads like marketing material: You’re on Dgraph blog! Having said that, the benchmarking code is open source and available to any one willing to put some time to verify these benchmarks.
  • Query caching was turned off: The benchmarks above showcase results for Neo4j with both caching turned on and off.
  • Neo4j only uses query plan cache, not result cache: That’s not what we observed. In fact, for the same read-write query, Neo4j latency increased when caching was turned on, compared to when off (as you can see in the read-write benchmarks above). A pure query plan cache shouldn’t be affected by data changes.
  • JVM takes time to warm up: Each benchmark was run for 10 seconds by Go, which ran thousands of iterations for the same query to get accurate latency per iteration. We think the JVM should be able to warm up by then.
  • Neo4j queries could be optimized: Just mentioning it doesn’t help. Please send us a pull request to optimize Neo4j data loading and queries. Or, send us a mail.
  • Benchmarking on a laptop isn’t right: We are just looking for relative performance, not absolute performance. It shouldn’t matter which machine you run them on.

Manish will be giving a talk about Dgraph at Gophercon India on 24-25th Feb. If you’re attending the conference, find him to talk about all things Dgraph.

Note: We are happy to accept feedback about any improvements to the loader or the benchmark tests to get better results for Neo4j.

If you haven’t already tried Dgraph , try out the 5 step tutorial to get started with Dgraph. Let us know what you think!

We are building an open source, real time, horizontally scalable and distributed graph database.

We're starting to support enterprises in deploying Dgraph in production. Talk to us, if you want us to help you try out Dgraph at your organization.

Top image: Hubble Gazes at a Cosmic Megamaser


This is a companion discussion topic for the original entry at https://blog.dgraph.io/post/benchmark-neo4j/

Hi, nice post. Although I noticed the neo4j version in Dockerfile is actually: 2.3.1 community edition, which is quite old, would be happy to see results with latest neo4j.

If I will find time I will try to optimize the stuff little bit. Especially when loading data initially, neo4j is fastest with offline loader which on the other hand cannot be done on existing database (it creates new one), but for this kind of benchmark is sufficient.

Q: Do you use indexes on Neo4j ?

These benchmarks look quite old.
Do you have any latest benchmarks?
This compares only with Neo4j, have you compared with other graph databases which claim to be faster than Neo4j, like TigerGraph or RedisGraph?

Hi there,

We are working on new benchmarks, the work is still in progress.

1 Like

Benchmarks of dgraph, cassandra, aws neptune, tigergraph, neo4j would be nice, but also a TCO. tigergraph cost more on aws than neptune, so is it worth the price? Dgraph starter is USD9750 per node per year, maybe running more aws neptune nodes is better. A ten ton truck is better than a six ton truck, unless the six ton truck has half the TCO. Benchmarks are nice, but not the bottom line.

@djbushby: Not sure this is a good comparison. Amazon Neptune is a closed-source paid-service, Dgraph is an Apache licensed open source project. You can run Dgraph anywhere for free, so not sure what you mean by “Dgraph starter price”.

For the USD9750 I was referring to Starter Package (Support | Dgraph) to compare with Neptune that has ACL and backups. The Dgraph Enterprise Package (? $$$) has/will-have Encryption, which Neptune already has. If you want ACL and encryption on dgraph then it is a closed-source paid-service. A db.r5.xlarge costs $0.84 per Hour and a r5.xlarge costs $0.302 per Hour, although the db ec2 includes db management. If Dgraph offered Enterprise on AWS marketplace, like TigerGraph ( AWS Marketplace: TigerGraph Version 2 (Enterprise Edition)), what would be the cost (p.a., per hour, etc.) and recommendations?

Just to clarify, the code for ACL and backups in Dgraph enterprise version is available under the DCL license and is not closed source. We’ll look into offering the enterprise version on AWS marketplace and keep you informed about the pricing once we decide it.

Thanks for the clarification. I thought closed-source and paid-service were identical, which they are not. It would be good to see the Enterprise $/node/year price on your website and AWS Marketplace $/node/hour pricing (or equivalent), particularly useful for initial (poc) projects.

I find the new RDF data in the deep learning age is in this format
entity, relation, entity

and this can be imported with neo4j import --relations relationships.csv command
But how can we import this data into dgraph?

You can use this guide https://docs.dgraph.io/howto/#loading-csv-data
and the relations you can use https://docs.dgraph.io/mutations/#upsert-block

can someone make a simple tutorial to take data from
entity, relation, attribute into dgraph and then run a query to search for entity-relation

2 Likes

An updated benchmark is certainly welcome, especially with the following DB’s:

  • TigerGraph - They claim substantial performance gains
  • ArangoDB - They claim ~2x faster than neo4j
  • OrientDB - They claim being 10x faster than neo4j

While these benchmarks cannot be compared to each other, it gives a very clear picture that neo isn’t exactly setting the standard.

It would also be very meaningful to add a cost ratio, say total cost per 100k ops/sec because some of these systems require a cluster of substantial size (600 nodes) to reach a million ops per second. Naturally, a very significant question is, which of these systems runs the most workload on the fewest nodes and what exactly would that cost me to run?

If a system is roughly equal in performance but cost me only a fraction on a scale, it is clearly the more economical choice and usually, these win the race in the long-run.

1 Like

Neo4j and RedisGraph use a beautiful query language. “Graph”QL is a mess of nested curly brackets and obscurely named functions to learn. ArangoDB similarly has a nice query language but not as intuitive.

Graph databases today are all terrible from a devops perspective because no one offers infrastructure as code examples. Just walls of “tutorial” … you want to leverage graphs? Great, hope you like to manage servers

I just want to make Cypher queries on a Serverless Graph deployed with infrastructure as code. But for whatever reason the various graph db companies don’t like money. It’s all benchmark measuring contest when the fundamental process of deployment and CRUD is an embarrassing fill-in-the-code clusterf**k. I don’t care about your benchmarks, I care how much work it is to deploy your DB and how weird the queries are

It is silly to expect objectiveness on competitor’s site, but at least you should have tried…

  1. Under the hood both databases store key-value, so CSV is really much more suitable, than JSON. I’ve been using neo4j offline bulk import for ages and it is blazing (100Mln in couple of minutes). So neo4j is quite capable.

  2. Versions are too outdated. When you have 1/10 of features it is silly to boast speed.

  3. No advanced tests, like pathfinding on middle and huge weighted graphs. All your testcases do not really need graphdb - even RDBMS could cope fine.

Not really, Dgraph uses Badger which indeed stores KV. But Dgraph has its own structure with a little bit of complex abstraction in relation to KV. That won’t work in a KV manner.

I disagree. CSV was not made with Graphs and complex relations in mind. There is no solid CSV standard that takes relationships into account. You have to do “tricks” to have them.

JSON is perfect cuz it reflects the graph structure. That’s also why GraphQL throws JSON instead of tables or something.

If you remove all tiers from Dgraph, you can have 4 times the speed you mentioned. But you will forgo certain guarantees.

Not sure what you mean. Can you show me what is outdated?

There are a ton of tests that runs every single release and also new PRs I think. If you have suggestions on how to make those, be welcome to open an PR.

This topic is STILL linked to benchmarks on official dgraph site… neo4j 3.1 was outdated even in 2017. Dgraph 0.8 is not relevant too.

I do not understand what tests and PRs you mean. It WAS a good idea to benchmark pathfinding, community detection et c on large graphs.

I’m just using neo4j for several years, never thought that I needed JSON. All input and output is FLAT )

I see, this bench is really old.

Where? I tried to find it. No lucky.

Never mind, if you didn’t understand it is not important.

Do you have examples? Has Neo4j done it publicly? That might help also to do comparisons.

IMHO, thats the problem with Neo4J. They keep insisting on mixing two different paradigms. Just to keep the SQL user comfortable. I think it’s a bad move. Because you don’t think with a Graph mindset. You think of “FLAT” and that’s not how Graphs works. It is a choice, be popular with SQL-Like syntax or be a real Graph DB from the top to the bottom.

This discussion is “comments” to blog post

Yeah, we don’t delete an old blog post. It is there cuz it is a blog post. I thought you saw it somewhere else. It is fine, the date and context is there. It would be “bad” if we have new content pointing to it.

Cheers.