Differences between DGraph and Cayley


(Manish R Jain) #1

Because of similarities between the two projects, I get asked this question a lot. So, I decided to make this a wiki page, for easy references and to allow multiple view points.

Note: This article is based on my response on Reddit to a similar question.

I haven’t looked too deep into Cayley, but based on my understanding, it’s a hybrid document-graph engine, sort of like a graph layer on top of an existing database. It supports multiple of them. You can use a distributed database below it, for Cayley to support distribution. This means Cayley itself doesn’t need to tackle data distribution, snapshots, machine failures etc., and can rely on the database for these features. But also means, Cayley query performance would be bound by how data gets divided by underlying database, and affected by the fan-out in terms of number of results, of intermediate steps.

For DGraph, low latency for query execution is the prime goal. In a distributed system, this largely equates to minimizing the number of network calls. For graph processing systems, doing that is really hard. If data distribution across machines is done in a standard key based sharding way, a single graph query could end up hitting a lot, if not all the machines, when the intermediate/final result set gets large.
DGraph tackles this problem by dividing up the triple data (subject S, predicate P, object O) in a way so as to colocate all the (S, O) for P on the same machine (possibly further sharding it if P is too big). Furthermore, it stores the data in sorted lists (O1 … Oi), to allow for really cheap list intersections (think of queries like [movies starring X and Y]).

This allows keeping the total number of network calls required to process a query, linear to the complexity of query, not the number of results. In addition, all the entities (S, O) are converted to uint64 numbers because they are a lot more efficient to work on (CPU wise) and pass around (network wise).
DGraph is aimed at squeezing great performance, so one could use this system in production, directly for user facing queries. It’s built with a very different design and ideology than Cayley.
Btw, do have a look at the product roadmap to get a better understanding of where DGraph is headed: https://github.com/dgraph-io/dgraph/issues/1

Update: Note that a fair comparison won’t be possible without deeply understanding the internal workings of Cayley. So, take the above differences with a grain of salt. I have a lot of respect for Barak, it’s prime author and my ex-colleague at Google. In fact, I’m really happy that there’re are multiple open source Graph database projects to solve the Graph serving problem.


(Manish R Jain) #2

I’ve added an issue on Github, if someone wants to help us do real benchmarks against Cayley:


(Manish R Jain) #3

Update: Cayley benchmarks are here

Based on the analysis so far, Dgraph is approximately 10x faster in data loading and ~37x faster in querying. This was based on all data in one server. Given the design of these two products, the numbers would be even better in favor of Dgraph in a distributed environment.


(Barak Michener) #4

I’ll give you the approximate load time - we’re working on that at present - but the query is hardly comparable. The given benchmark does a lot of JS interpreting, which is super slow.


(Manish R Jain) #5

Happy to get a PR to make the queries faster. As I mentioned in the Github Issue, the idea is to get the best pedestal for both Dgraph and Cayley. We want these benchmarks to be unbiased.


(Manish R Jain) #6

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.