Millions (billions maybe? eventually?) of documents. Each document is a mix of various attributes (either simple or arrays) + full-text + location (optional), and so forth.
From these documents, we are extracting entities & relationships + performing various enrichments. Generally we expect the number of entities to grow quickly at first, then have the slope level off.
Relationships can be among/between entities, documents, etc. Queries include full-text search, geographic, graph, simple attribute comparison, etc.
One way of course to do this, is to store the documents in something like Cassandra and store the graph in… a graph database. In this case we basically store a pointer to the document in the graph database.
Another is to store everything in one database.
My question is how well DGraph supports this latter use case and if so, any gotchas/design suggestions you recommend to minimize refactoring downstream.
My frame of reference on this sort of problem is ArangoDB & Neo4J.
@briantrusso I have a similar use case although on a different scale.
Have you looked at SciDB? It’s used at CERN to store a few hundred of Petabytes while enabling in-DB cluster processing & machine learning.
I did a long evaluation across different systems and here are my insights:
ArangoDB fell flat because of missing GraphQL support, otherwise, it would be a contender. We did a test with Neo4J but concurrency performance was very problematic so we dropped it after countless issues. TigerGraph is my personal preference and has certainly the best overall package but also lacks GraphQL support. DGraph, however, fell flat, because it caused way too many problems during a test deployment, so we never really were able to do a proper test. We are not using it.
Where you able to get DGraph to a usable state?
Currently, FaunaDB and SciDB are the closest contenders to build my system.
@marvin-hansen could you share what was the root of your problem with your original installation/trial of DGraph? Maybe it could be a problem for others as well.