Dgraph Scalability

raghavsood · December 5, 2019, 11:11pm

We’re currently planning out a system that uses Dgraph as the source of truth to store blockchain related data.

The raw data we are importing is on the order of 5 terabytes or so when serialized in the native blockchain data structures, which are pretty compact.

We expect edges to outnumber nodes by about 1000x.

From my understanding, sharding in a cluster is done based on predicates - if we prefix each predict with the source chain (btc_blocknumber and ltc_blocknumber, for example) it would allow dgraph to locate them on separate servers.

It is also my understanding that sharding of a shard is currently not available - the entire predicate must live on each server it is replicated to.

In light of the above, where some predicates may have billions of entries, which of the following scaling options would be ideal:

A small number of large servers

We deploy a cluster of servers with 128-256GB of memory, and ~35 terabytes of usable storage space, 6-12 core CPUs.

In this scenario, we would be looking at 3-5 servers to begin with.

A large number of small servers

We deploy a larger cluster of small servers, with 32-96GB of memory, ~8-12 terabytes of usable storage, and 4-8 core CPUs.

Alternatively, is dgraph not a good option for this kind of dataset at the moment?

dmai · December 6, 2019, 1:42am

That’s correct. Dgraph Zero maintains the info about shard location and size of the database cluster and will periodically rebalance (controlled by the flag --rebalance_interval) the data across Alpha groups evenly by data size.

A single predicate would wholly live in a single Alpha group. For list predicates (e.g., type [uid]) the internal postings list would be as large as the number of edges for a particular <subject, predicate>. In Dgraph v1.1.0 large postings lists will be split in parts, and only the parts needed to process a request would be fetched.

5 TB is a lot of data. I don’t have a clear answer for either configuration. I’d be interested in hearing about benchmark numbers from your experience. It sounds like you’re still in the testing phase, so I’d start with a small number of the small server specs and go from there. Note that Dgraph is highly concurrent, so the more cores you give it the quicker it can process requests.

We’d be happy to have a deeper conversation about your use case and how we can help. Feel free to DM me and we can set up a call.

raghavsood · December 7, 2019, 2:29am

I can’t seem to DM you (maybe my account is too new?).

Happy to talk over a call if you have some time next week, I reckon you might be able to DM me here, or email me at [username]@gmail.com

dmai · December 7, 2019, 2:38am

I’ll send you an email. Thanks!

system · January 6, 2020, 2:40am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unbalanced disk usage Users	3	928	March 7, 2020
Whether going to change sharding mechanism to improve the thoughput? Dgraph kind:question	5	650	May 7, 2021
Cluster Setup - Deploy Documentation	1	631	June 24, 2023
Discussion on dynamic sharding Users	4	1859	July 20, 2016
There are 500 million new tweets everyday. Is dgraph able to scale/shard that volume horizontally? is it true that if a predicate becomes large enough, the only way to deal with that is vertical scaling? Dgraph kind:question	10	1204	November 13, 2021

Dgraph Scalability

Related topics