Neo4j vs Dgraph - The numbers speak for themselves - Dgraph Blog


(Manish R Jain) #1

As Dgraph is nearing its v0.8 release, we wanted to spend some time comparing it against Neo4j, which is the most popular graph database. We have divided this post into five parts:

  1. Loading data
  2. Querying
  3. Issues faced
  4. Features
  5. Principles behind Dgraph

Set up

  • Thinkpad T460 laptop running Intel Core i7, with 16 GB RAM and SSD storage.
  • Neo4j v3.1.0
  • Dgraph from master branch (commit: 100c104a)

1. Loading Data

We wanted to load a dense graph data set involving real world data. We at Dgraph have been using the Freebase film data for our development and testing. We feel this data is highly interconnected and makes a good use case for storing in a graph database.

The first problem we faced was that Neo4j doesn’t accept data in RDF format directly 3.1 . The loader for Neo4j accepts data in CSV format which is essentially what SQL tables have. In our 21 million dataset, we have 50 distinct types of entities and 132 types of relationships between these entities. If we were to try and convert it to CSV format, we would end up with 100s of CSV files. One file for each type of entity, and one file per relationship between two types of entities. While this is okay for relational data, this doesn’t work for graph data sets, where each entity can be of multiple types, and relationships between entities are fluid.

So, we looked into the next best option to load graph data into Neo4j. We wrote a small program similar to the Dgraphloader which reads N-Quads, batches them and tries to load them concurrently into Neo4j. This program used Bolt, a new protocol by Neo4j. It is the fastest way we could find to load RDF data into Neo4j. In the video below, you can see a comparison of loading 1.1 million N-Quads on Dgraph vs. Neo4j.

Note that we only used 20 concurrent connections and batched 200 N-Quads for each request because Neo4j doesn’t work well if we increase either the number of connections or N-Quads per connection beyond this. In fact, that’s a sure way to make Neo4j data corrupt and hang the system 3.2 . For Dgraph, we typically send 1000 N-Quads per request and have 500 concurrent connections.

With the golden data set of 1.1 million N-Quads, Dgraph outperformed Neo4j 46.7k to 280 N-Quads per second. In fact, the Neo4j loader process never finished (we killed it after a considerable wait).

Dgraph is 160x faster than Neo4j for loading graph data.

2. Querying

We would have ideally liked to load up the entire 21 million RDF dataset so that we could compare the performance of both databases at scale. But given the difficulties we faced loading large amounts of data into Neo4j, we resorted to a subset dataset of 1.3 million N-Quads containing only certain types of entities and relationships. After a painful process, we converted our data into five CSV files, one for each type of entity (film, director, and genre) and two for the relationships between them; so that we could do some queries. We loaded these files into Neo4j using their import tool and moved onto doing some query benchmarking.

./neo4j start
./neo4j-admin import --database film.db --id-type string --nodes:Film $DATA/films.csv --nodes:Genre $DATA/genres.csv --nodes:Director $DATA/directors.csv --relationships:GENRE $DATA/filmgenre.csv --relationships:FILMS $DATA/directorfilm.csv

Then we created some indexes in Neo4j for the best query performance.

CREATE INDEX ON :Director(directorId)
CREATE INDEX ON :Director(name)
CREATE INDEX ON :Film(release_date)

We tested Neo4j twice. Once with query caching turned off, and then with query caching turned on. Generally, it does not make sense to benchmark queries with caching turned on, but we decided to set it because that’s the default behavior Neo4j users see. You can set it by modifying the following variable in conf/neo4j.conf.

dbms.query_cache_size=0

Dgraph does not do any query caching. We loaded an equivalent data set into Dgraph using the following schema and the commands below.

scalar (
    type.object.name.en: string @index
    film.film.initial_release_date: date @index
)

The schema file specifies creation of an index on the two predicates.

# Start Dgraph with a schema which specifies the predicates to index.
dgraph --schema ~/work/src/github.com/dgraph-io/benchmarks/data/goldendata.schema
# Load the data
dgraphloader -r ~/work/src/github.com/dgraph-io/benchmarks/data/neo4j/neo.rdf.gz

With data loaded up into both the databases, we did some benchmarking for simple and some complex queries. The results didn’t surprise us.

Benchmarking process

The benchmarks for Neo4j and Dgraph were run separately so that both processes could utilize full CPU and RAM resources. Each sub-benchmark was run for 10s so that sufficient iterations could be run. We also monitored the memory usage for both the processes using a simple shell script.

go test -v -bench=Dgraph -benchtime=10s .
go test -v -bench=Neo -benchtime=10s .
Queries Id Description SQ Get all films and genres of films directed by Steven Spielberg. SQM Runs the query above and changes the name of one of the films. GS1Q Search for directors with name Steven Spielberg and get their films sorted by release date. GS1QM Runs the query above and also changes the name of one of the films. GS2Q Search for directors with name Steven Spielberg and only their films released after 1984-08 sorted by release date. GS2QM Runs the query above and also changes the name of one of the films. GS3Q Search for directors with name Steven Spielberg and only their movies released between 1984-08 and 2000 sorted by release date. GS3QM Runs the query above and also changes the name of one of the films.

Note: If the test id has a P suffix, it was run in parallel.

Read-only benchmarks

Query caching turned off for Neo4j

Query caching on for Neo4j

Neo4j does pretty aggressive query result caching. Dgraph, in contrast, does none. With Dgraph, we wanted to build a database that would perform well on arbitrarily complex queries and without resorting to caching queries or results. Caching is cheap and misleading when it comes to testing database performance. User queries are not predictable, and data is not static; so doing query caching would make a database performance look better than what it is and become a hindrance to achieving consistent 95th-percentile query latency.

Queries to Dgraph took the same amount of time, as expected in the both cases. But, Neo4j query latency was at least halved. In particular, queries with filters had the biggest performance gain. That’s not surprising given all the subsequent runs are just returning results from the query cache. Thence, Neo4j latency was better than Dgraph with query caching turned on, and worse when off.

Read-write benchmarks

Query caching turned off for Neo4j

Query caching on for Neo4j

For intertwined reads and writes, Dgraph is at least 3x to 6x faster..

We can see that Neo4j is even slower with query caching on because they have to do the extra work of cache invalidation on writes. Dgraph was designed to achieve low latency querying with real world use cases, where reads are typically followed by writes and vice-versa, and the performance benefits show in the numbers.

Not just that, Neo4j takes up much more memory. At the start of the benchmarks Dgraph consumed around 20 MB which increased to 600 MB at the end. In comparison, Neo4j was already consuming 550 MB at the start which increased to 3.2 GB at the end of the benchmarks.

Dgraph consumes 5x lesser memory compared to Neo4j and is at least 3x faster when it comes to the combination of reads and writes with no query caching.

3. Issues faced

  • We couldn’t find a convenient way to load large amount of interconnected graph data into Neo4j apart from breaking it into CSV files. We had to write a loader which could concurrently load RDF data into Neo4j.
  • We hit data corruption issues on sending more than 20 requests concurrently, which the database could not recover from. In comparison, we typically send 500 concurrent requests to Dgraph , each request batching 1000 N-Quads.
  • While loading data concurrently and opening 100 connections, Neo4j started returning bad connection error because it hit the limit of maximum open file descriptors which was set to 1024 (the default). We have never witnessed such a problem with Dgraph.

4. Features

We talked about performance and issues. Now, let’s see how does Dgraph compare against Neo4j regarding features.

Feature Dgraph Neo4j Production Features Highly available, Consistent, Fault tolerant Single server architecture Distributed Yes. Data sharded and replicated across servers, using consensus for writes. No. Only distributed query caching Horizontal Scalability Yes. Add servers to cluster on the fly to distribute data better. No. Supports only full data replicas Transactional Model Linearizability aka Atomic Consistency ACID transactions Backups Hot backups in RDF format available using the HTTP interface Hot full and incremental backups available only as part of paid enterprise edition HTTP API for queries and mutations Yes Yes Communication using clients over binary protocol Yes, using grpc Yes, using bolt protocol Bulk loading of graph data Yes, can load arbitrarily connected RDF data using dgraphloader Only supports loading relational data in CSV format using the loader Schema Optional (supports int, float, string, bool, date, datetime and geo types) Optional (supports byte, short, int, long, float, double, char and string types) Geospatial Queries Yes. Supports near, within, contains, intersects No. Not part of core database Query language GraphQL like which responds in JSON Cypher Aggregation queries No Supports count(), sum(), avg(), distinct and other aggregation queries Order by, limit, skip and filter queries Yes Yes Authorization and authentication SSL/TLS and auth token based security (support by v1.0) Supports basic user authentication and authorization Access Control Lists Work in progress (support by v1.0) Available based on roles as part of enterprise edition Support for plugins and user defined functions Work in progress (support by v1.0) Yes Browser interface for visualization Work in progress (support by v1.0) Yes

Dgraph (2016) is a lot younger project than Neo4j (2007), so reaching feature parity quickly was a tough job. Dgraph supports most of the functionality that one needs to get the job done; though it doesn’t have all the functionality one might want.

5. Principles behind Dgraph

While Dgraph runs very well on our (Linux) Thinkpads, it is designed to be a graph database for production. As such, it allows the ability to shard and distribute data over many servers. Consistency and fault tolerance are baked deep into Dgraph , to the point where even our tests need to start a single-node Raft cluster. All the writes, irrespective of which replica they end up on, can be read back instantaneously, i.e. linearizable (work in progress, ETA v0.8). A few server crashes or losses would not affect the end-user queries, making the system highly-available.

Such features have traditionally been a talk for NoSQL databases or Spanner, not for graph databases. But, we think any production system, on which the entire application stack is based, must stay up, perform and scale well. The system must be able to utilize the server running it well, process a lot of queries per second, and provide a consistent latency.

Also, given we’re building a graph database, the system should be able to handle arbitrarily dense interconnected data and complex queries. It should not be confined by pre-optimization of certain edges, or other tricks to make queries run fast.

Finally, running and maintaining such a system should be easy to the engineers. And that’s only possible if the system is as simple as it can be, and every piece of complexity introduced to the system is carefully weighted.

These are the principles which guide us towards building Dgraph. And we’re glad that in a short period, we’ve been able to achieve many of these. Now we leave it to you, our users to try out Dgraph, and let us know what you think.

I’ll be giving a talk about Dgraph at Gophercon India on 24-25th Feb. So, if you’re interested in learning about it, come find me.

Note: We are not Neo4j experts and are happy to accept feedback about any improvements to the loader or the benchmark tests to get better results for Neo4j.

If you haven’t already tried Dgraph , try out the 5 step tutorial to get started with Dgraph. Let us know what you think!

Top image: Hubble Gazes at a Cosmic Megamaser


This is a companion discussion topic for the original entry at https://open.dgraph.io/post/benchmark-neo4j/

(Michael Hunger) #2

Just really quickly. Unfortunately, your benchmark has a number of issues, that invalidates all its Neo4j measurements.

We recommend users in general to ignore vendor benchmarks and test with their own hardware, data, use-cases for relevant and reliable results.

Here is a quick list from just skimming over, I didn’t measure or test anything so don’t assume it is all correct / working:

General:

  • Neo4j is no RDF database, so RDF data model makes no sense
  • using a community provided Go Driver for which performance has not been validated, not an official driver like JS, Java, .Net
  • incorrect information in feature table
  • memory usage can be configured

Writes:

  • merge query without labels (:Node) each statement does 2 full all-node-scans
  • no constraint for :Node(xid)
  • no timing published for csv import (11s on my mac 1.1M triples)
  • doesn’t use transactions of eg. 50k or 100k updates per request
  • which can be best achieved with a single query per tx and UNWIND of a payload of a array of structs

Reads:

  • no use of parameters in reads
  • disabled query plan cache (which was incorrectly understood as query result cache, which doesn’t exist)
  • no constraint for :Film(filmId)

Good luck with your development of dgraph, it looks like a good technology for RDF use-cases.

Cheers, Michael@neo4j


(Manish R Jain) #3

Thanks Michael for your comments.

Just to clarify, we ran the tests twice both read-only and r-w workloads – once with Neo4j query cache enabled and once disabled. All four results are presented.

The Go driver that we used did little more than just call Neo4j over Bolt. So, we determined it safe to be used for benchmarking. Also, if you think any information in the feature table was incorrect, can you please send a mail to contact@dgraph.io with the correct version?

The particular issues you raised about optimizing reads and writes seem valid. Would you mind sending a PR to fix the way we query Neo4j? The code is here: https://github.com/dgraph-io/benchmarks/tree/master/data/neo4j

We’ll be happy to re-run the numbers and update our post accordingly.


(Michael Hunger) #4

Hi Manish,

I am not a benchmarking expert (and not a fan either) however, I had to point out the mistakes in your understanding and use of Neo4j.
Through the years, I’ve seen that most benchmarks - especially ones by people not familiar with all technologies tested - have problems that make them unreliable.
Unfortunately, I don’t have the capacity to fix the issues in your test code, as our large community keeps me busy.

I wish you guys the best!!

Go graphs!!!

Cheers, Michael@neo4j


(Patrick Mualaba) #5

Hello Manish,

Thanks for the magnificent work you are doing with DGraph. I believe it is on the way of becoming a beautiful gem for the Database world.

Concerning this benchmark, i was wondering how the predicate instances were stored in Neo4j. Did you model all predicates as edges in Neo4j and their values as new nodes (RDF data modeling) or did you model the predicates as properties on the Nodes in Neo4j where possible (Property Graph modeling) ?

Cheers,

Patrick


(Pawan Rawal) #6

Hi @pmualaba

Thank you. We modelled the predicates as properties on the nodes. So name and release date where properties for the film entities and name was a property for director and genre.


(Kisung Kim) #7

Hi, I agree with @jexp in that there’s no query result cache in Neo4j.
So I think the part of your post mentioning that query_cache is the query result caching should be modified.
And it is not unusual to use the query plan cache in DBMS when benchmarking it.


(Manish R Jain) #8

Hi @kskim80,

That’s not what we observed. Every time we wrote data to Neo4j, it invalidated the query cache. This is mentioned in Neo4j logs. If it was purely just query plan cache, adding data shouldn’t invalidate it. In fact, if you look at the read-write benchmark numbers, Neo4j latency jumped up when we turned on the cache, due to this extra cache invalidation step. You can observe it yourself by running Neo4j, adding some data, querying it, and then adding data again.


(Kisung Kim) #9

Can you explain how you can observe invalidating the query cache in Neo4j logs.
I changed dbms.logs.debug.level to ‘DEBUG’ but I could not see any log mentioning cache invalidating.
Sorry for asking about Neo4j in your blog, but I am also curious about the existence of the query result cache in Neo4j

And one more about dgraph.
There’s a graph database benchmark call LDBC. I’m sure you already know it.
How is the expressiveness of dgraph?
I am curious that dgraph can answer the complex queries of LDBC which has somewhat complex graph patterns.

Thank you.


(Manish R Jain) #10

Hey @kskim80,

This is our query language spec page: https://wiki.dgraph.io/Query_Language. I haven’t looked at LDBC deeply, but if you wish to investigate if Dgraph would be compatible with LDBC, we can definitely help out.


(Martin Rauscher) #11

Although I’m not a Neo4j fan I have to say that I would disagree saying that only read&write operations are “real world” usage. Especially your sample (movies and directors) not not something that is changing very rapidly…

So implementing some caching might not be the worst idea of all…