My team and I are currently looking to run some tests on our sample data using the Dgraph database. However, we are unclear as to how much computing power, memory, and disk space that we would need for a 100 GB dataset. We are currently considering >16 CPU cores, >32 GB of RAM, and 1 TB of SSD. We are unsure if we would actually need this much memory and disk space to be able to handle a dataset of this size. Our calculations have led us to believe that 1 TB is a good amount of disk space to use but it seems rather large. Please advise.
Sounds like a good start. I might give it 64 GB RAM to be on the safe side.
For 100 GB of data, would we really need 1 TB of disk space, and why?
An outside datapoint here, but I have ~100GB per group and I have to use 2TB disks on each server in google cloud to get the disk throughput quota required to insert to dgraph fast.
If your data changes often then the MVCC architecture will also use a lot more disk than you literally would believe is in the graph ‘right now’, but then will compact down. For instance, see here one of my alpha’s disk usage over time, as reported from the prometheus metrics within dgraph:
You probably won’t need a TB of disk – but it’s hard to estimate without knowing much about how you’re calculating disk usage currently. Is it compressed raw data? Is it stored in some other DB and contains all the indices? How many indices would you have in Dgraph, etc.
The defaults in Dgraph Cloud are 16 cores, 64 GB RAM with 320 GB disk. You could go with that and iterate from there.