Is Dgraph Suitable for Large-Scale Ingestion and Querying at Billions of Nodes and Edges?

roee · December 19, 2024, 1:26am

Hello Dgraph community,

I’m evaluating Dgraph for a use case involving billions of nodes and edges with high ingestion and querying requirements. I’d like to know if Dgraph is suitable for the following scenario:

Ingestion Requirements:

Continuous online ingestion at a rate of approximately 100,000 writes per second.
Bulk ingestion every 2-3 weeks of hundreds of millions of requests. This bulk ingestion needs to be completed in a reasonable time frame without disrupting the online ingestion.

Query and Scalability Needs:

The system needs to scale effectively as the dataset grows to billions of nodes and edges.
What are the limitations in terms of cluster size, replication, and horizontal scaling?
How does query performance hold up at this scale for both complex graph traversals and simple lookups?

Additional Considerations:

Are there any specific limitations or bottlenecks I should be aware of with Dgraph for this kind of workload?
What configurations or optimizations are recommended to meet these requirements?

I’m particularly concerned about scaling the system while maintaining ingestion performance, query efficiency, and reliability. If Dgraph is not a suitable choice for this use case, I’d appreciate recommendations or insights on what alternatives might work better.

Thank you for your help!

roee · December 22, 2024, 7:21am

I am wondering why no one is answering from the DGraph team, I would be more than happy to get a response, also from users that has similar setup experience

KVG · December 24, 2024, 9:13pm

Hey all-- sorry for the delay. Just had this thread sent over to me. I’ll try to answer as best I can, but unfortunately, as with many database things, it’ll depend heavily on your specific situation.

tl;dr - Yes, Dgraph can scale to that sort of workload. At 100k writes per second and 10BN+ records, you’ll need to be really thoughtful about infra, query optimization, admin, etc. The primary consideration is scaling up alpha nodes, and then load-balancing writes across the cluster.

I’ve asked a few of our engineers to run a few tests to give you some more specific sizing considerations. Should have more for you next week.

In my experience what’s going to work best for your use case will vary a lot based on what you’ll do with the data. E.g., I’ve had customers end up offloading time-series data to another store and taking a rolled up subset and putting it into Dgraph or using some sort of stream processing to only write the “interesting” bits.

10BN+ nodes with 100k writes per second on Dgraph alone is definitely possible, but you’ll need several alphas and will need to work to balance writes across the cluster. The number of alphas you need and the sizing for them will vary depending on the complexity of the updates (am I doing a complex search to find out where to insert the update?), as well as whether I am writing many of the same types of data or am hitting different predicates?

Can you share a bit more about what you’re building?

roee · December 25, 2024, 3:07pm

Thank you so much for the response,
I would be more than happy to elaborate more about the use case in a meeting if you are available,
I can not tell more on a public thread.

KVG · December 25, 2024, 4:42pm

Totally understand-- will shoot you a dm to schedule some time next week

matthewmcneely · December 26, 2024, 7:10pm

@roee This repo might prove useful if you’re considering doing some of your own benchmarking: GitHub - linuxerwang/dgraph-bench: A benchmark program for dgraph.

Topic		Replies	Views
Can Dgraph do 10 Billion Nodes? Dgraph	5	2468	November 7, 2019
Can DGraph handle inserting 3 million nodes per day? What would I need to worry about? Dgraph	5	637	June 6, 2019
"really" large datasets in dgraph Dgraph	1	556	May 25, 2019
Realtime streaming graph data Dgraph	4	1175	December 20, 2019
SURVEY: Who is using Dgraph? Dgraph dgraph , help-wanted	5	931	July 3, 2020

Is Dgraph Suitable for Large-Scale Ingestion and Querying at Billions of Nodes and Edges?

Related topics