Strange result in comparison with neo4j

Hi everybody!

First of all, I like what you guys are doing in Dgraph! I’ve been following the progress for a year and I’m absolutely impressed.

I have a product that uses relational DB with a huge number of links inside and now I’m doing the research about graph databases whether they can help in some cases or not.

To do that I exported data sample and tried to bulk upload it to neo4j and dgraph.

About the data

I selected one type of entity:

type Individual {
  id Int
  name String
  surname String
  patronymic String
  dt DateTime
}
  • I did not use the type in the dgraph scheme.

And three types of relations. Each relation type has four properties (facets in terms of Dgraph).

I exported 150M of Individuals and 50M of each type of relation.
Schema in Dgraph was

<id>: int @index(int) .
<name>: string @index(trigram) .
<surname>: string @index(trigram) .
<patronymic>: string @index(trigram) .
<dt>: datetime .
<rel_1>: [uid] @reverse .
<rel_2>: [uid] .
<rel_3>: [uid] @reverse .

About the server

I used 16 Core and 64Gb RAM server with SSD disks.

Dgraph upload results

I used bulk loader as described here https://blog.dgraph.io/post/bulkloader/

RDF examples:

_:individual.{{.ID}} <id> "{{.Hid}}"^^<xs:int> .
_:individual.{{.ID}} <dt> "{{.Dt}}"^^<xs:dateTime> .
_:individual.{{.ID}} <name> "{{.Name}}"^^<xs:string> .
_:individual.{{.ID}} <surname> "{{.Surname}}"^^<xs:string> .
_:individual.{{.ID}} <patronymic> "{{.Patronymic}}"^^<xs:string> .

and for eash type of relation

_:individual.{{.ID1}} <rel_1> _:individual.{{.ID2}} (prop1="{{.Prop1}}", prop1={{.Prop2}}, prop1={{.Prop3}}) .

Bulk loader finished in 2,5 hours and uploaded 900M of edges. There were a lot of fails with OOM, but with some tuning of the bulk loader params the process was finished properly.

Neo4j upload results

I used neo4j-admin bulk import from CSV files. There were four different csv files: one for the nodes and three for the each type of relation.

Neo4j loaded this amount of data in 6 minutes! In neo4j case name, surname, patronymic and dt were not edges, like in Dgraph, but the properties of the node. So the actually counts was: 150M nodes, 150M relations, 1,2B properties.

Query perfomance results

I used an analytic query to test and compare performance.
Something like:

  • find all nodes by the name and surname filter
  • recurcive find all related (bidirectional) nodes with filter on edge properties (facets)
  • return it

Dgraph result was 16 seconds (with indexes)
Neo4j result was 88 seconds (without indexes).
Neo4ji result was <1second after proper indexes was built.

Questions

I think that I did something really wrong or do not understand how the Dgraph works. I do not believe that it is real results.

Could help me explain these results or guide me to the right way?

Hi,

Thank you for the post. We will look at it as a high priority item and respond with our observations and results promptly.

Hi Mike Berezin, can you clarify some points?

I never used Neo4J for real, I just tested it a long, long time ago. So I’m a “newbie” in that DB.

It is very complicated to make a direct comparison of Dgraph with Neo4J. It seems to me that Neo4J is more CSV oriented, I was analyzing here how to use JSON imports and it seems that you need to do more treatments than usual to import JSON.

Does this tool insert the data while the Neo4J Cluster is running or is it a preparation to create a cluster?
This behavior would be equivalent to the Bulkloader we have. And this tool is extremely fast. The load of 21 million RDFs takes an average of 4 minutes with it.

This dataset you’re using is public? I intend to use the yelp dataset (it seems public to me) that the folks at Neo4J uses.

Are you sure it is bulkloader? What is the size in MB of this dataset?

I skipped this part unintentionally

Hi, Michel. Thanks for the answer.

Sorry for the long delay.

Does this tool insert the data while the Neo4J Cluster is running or is it a preparation to create a cluster?

Yes, Neo4j admin tool works like Bulkoader – loads the data before the Neo4j server starts.

The load of 21 million RDFs takes an average of 4 minutes with it.

It looks the same. I have 900M of rdfs and loaded this amount of data in 2,5 hours.
4 min * 900 million / 21 million = 2 hours 50 min

Are you sure it is bulkloader? What is the size in MB of this dataset?

Yes, I’m sure. 5,5Gb in .gzip format.

Hey Berezin,

We are working on it. We have a few engineers to get the bottom of this.
We already have some progress in terms of Bulkloader. But we gonna push even further.

Cheers.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.