"really" large datasets in dgraph


(Lars) #1

Hi everyone,

I’m searching for experiences of running dgraph nodes in a typical cloud environment to figure out, whether dgraph is the right tool for me. For sure, dgraph calls itself distributed, Open Source and production ready. However, I’d like to run it on an AWS cluster environment, preferred in combination with k8s and a dataset of at least a few billion nodes potentially distributed across this cluster.
Is there anyone who has already setup something similar (maybe for testing purposes) on EC2 or EKS environments or tackled problems like backups or up- and downscaling? I would suppose, that if it is true, that dgraph scales better than other Open Source solution, I would expect that some papers, tutorials or experiences might give me some hints. I’m already lost in choosing the right EC2 instance type/size, an appropriate number of instances or the optimal way of setting up a large, resilient cluster. The question is: is it worth the work at all?

What is the largest cluster you’ve run in a distributed environment and what kind of pitfalls (probably running it in a cloud environment) came up? Any dox available somewhere?

Any thoughts might be helpful. Thanks in advance.


#2

Hi Lars.

What do you mean by “really” large datasets?
We’re running two cluster of dGraph - one for prod and one for development. Each of them is run on EC2 servers. One with 5 node and one with 3 nodes - i3en.6xlarge. It runs smoothly (~2-3 billions of edges) but we have not done any up- and downscaling untli now. But you must consider density of your data. If you have one big predicate with tons of data, it is store only on one server (sharding is not supported yet). In this case you have to run bigger EC2 instance.