Dgraph Scalability

We’re currently planning out a system that uses Dgraph as the source of truth to store blockchain related data.

The raw data we are importing is on the order of 5 terabytes or so when serialized in the native blockchain data structures, which are pretty compact.

We expect edges to outnumber nodes by about 1000x.

From my understanding, sharding in a cluster is done based on predicates - if we prefix each predict with the source chain (btc_blocknumber and ltc_blocknumber, for example) it would allow dgraph to locate them on separate servers.

It is also my understanding that sharding of a shard is currently not available - the entire predicate must live on each server it is replicated to.

In light of the above, where some predicates may have billions of entries, which of the following scaling options would be ideal:

A small number of large servers

We deploy a cluster of servers with 128-256GB of memory, and ~35 terabytes of usable storage space, 6-12 core CPUs.

In this scenario, we would be looking at 3-5 servers to begin with.

A large number of small servers

We deploy a larger cluster of small servers, with 32-96GB of memory, ~8-12 terabytes of usable storage, and 4-8 core CPUs.

Alternatively, is dgraph not a good option for this kind of dataset at the moment?

That’s correct. Dgraph Zero maintains the info about shard location and size of the database cluster and will periodically rebalance (controlled by the flag --rebalance_interval) the data across Alpha groups evenly by data size.

A single predicate would wholly live in a single Alpha group. For list predicates (e.g., type [uid]) the internal postings list would be as large as the number of edges for a particular <subject, predicate>. In Dgraph v1.1.0 large postings lists will be split in parts, and only the parts needed to process a request would be fetched.

5 TB is a lot of data. I don’t have a clear answer for either configuration. I’d be interested in hearing about benchmark numbers from your experience. It sounds like you’re still in the testing phase, so I’d start with a small number of the small server specs and go from there. Note that Dgraph is highly concurrent, so the more cores you give it the quicker it can process requests.

We’d be happy to have a deeper conversation about your use case and how we can help. Feel free to DM me and we can set up a call.

I can’t seem to DM you (maybe my account is too new?).

Happy to talk over a call if you have some time next week, I reckon you might be able to DM me here, or email me at [username]@gmail.com

1 Like

I’ll send you an email. Thanks!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.