I’m evaluating Dgraph for a use case involving billions of nodes and edges with high ingestion and querying requirements. I’d like to know if Dgraph is suitable for the following scenario:
Ingestion Requirements:
Continuous online ingestion at a rate of approximately 100,000 writes per second.
Bulk ingestion every 2-3 weeks of hundreds of millions of requests. This bulk ingestion needs to be completed in a reasonable time frame without disrupting the online ingestion.
Query and Scalability Needs:
The system needs to scale effectively as the dataset grows to billions of nodes and edges.
What are the limitations in terms of cluster size, replication, and horizontal scaling?
How does query performance hold up at this scale for both complex graph traversals and simple lookups?
Additional Considerations:
Are there any specific limitations or bottlenecks I should be aware of with Dgraph for this kind of workload?
What configurations or optimizations are recommended to meet these requirements?
I’m particularly concerned about scaling the system while maintaining ingestion performance, query efficiency, and reliability. If Dgraph is not a suitable choice for this use case, I’d appreciate recommendations or insights on what alternatives might work better.
I am wondering why no one is answering from the DGraph team, I would be more than happy to get a response, also from users that has similar setup experience
Hey all-- sorry for the delay. Just had this thread sent over to me. I’ll try to answer as best I can, but unfortunately, as with many database things, it’ll depend heavily on your specific situation.
tl;dr - Yes, Dgraph can scale to that sort of workload. At 100k writes per second and 10BN+ records, you’ll need to be really thoughtful about infra, query optimization, admin, etc. The primary consideration is scaling up alpha nodes, and then load-balancing writes across the cluster.
I’ve asked a few of our engineers to run a few tests to give you some more specific sizing considerations. Should have more for you next week.
In my experience what’s going to work best for your use case will vary a lot based on what you’ll do with the data. E.g., I’ve had customers end up offloading time-series data to another store and taking a rolled up subset and putting it into Dgraph or using some sort of stream processing to only write the “interesting” bits.
10BN+ nodes with 100k writes per second on Dgraph alone is definitely possible, but you’ll need several alphas and will need to work to balance writes across the cluster. The number of alphas you need and the sizing for them will vary depending on the complexity of the updates (am I doing a complex search to find out where to insert the update?), as well as whether I am writing many of the same types of data or am hitting different predicates?
Can you share a bit more about what you’re building?
Thank you so much for the response,
I would be more than happy to elaborate more about the use case in a meeting if you are available,
I can not tell more on a public thread.