We’re currently planning out a system that uses Dgraph as the source of truth to store blockchain related data.
The raw data we are importing is on the order of 5 terabytes or so when serialized in the native blockchain data structures, which are pretty compact.
We expect edges to outnumber nodes by about 1000x.
From my understanding, sharding in a cluster is done based on predicates - if we prefix each predict with the source chain (btc_blocknumber and ltc_blocknumber, for example) it would allow dgraph to locate them on separate servers.
It is also my understanding that sharding of a shard is currently not available - the entire predicate must live on each server it is replicated to.
In light of the above, where some predicates may have billions of entries, which of the following scaling options would be ideal:
A small number of large servers
We deploy a cluster of servers with 128-256GB of memory, and ~35 terabytes of usable storage space, 6-12 core CPUs.
In this scenario, we would be looking at 3-5 servers to begin with.
A large number of small servers
We deploy a larger cluster of small servers, with 32-96GB of memory, ~8-12 terabytes of usable storage, and 4-8 core CPUs.
Alternatively, is dgraph not a good option for this kind of dataset at the moment?