Can Dgraph do 10 Billion Nodes?


(Michael Burbidge) #1

We have a very large graph that we are looking to build in a graph database. The graph will grow incrementally to around 10 billion nodes over a period of a few months.

Is anyone running that size graph in Dgraph? Is it capable of that?

Thanks,
Michael-


(Michael Burbidge) #2

Do folks from Dgraph monitor and respond on this Forum?

I’m curious why, there has been no response to this and the other question I’ve asked.

Is this an inappropriate question? Is more information needed to respond to this question?

Michael-


(Michel Conrado) #3

Yes, but the question is about third party experience - I think. So it’s good for someone in the community to see and join the conversation. For the question seemed more about of got an opinion than anything I could help.

However, the number of nodes does not matter much. What matters is the load. The more workload your cluster has, the more resources (CPU, RAM, IOPS) it will need. This is kind of obvious. No miracle is done with a low resource instance and astronomical load with billions of nodes.

A well-planned schema is ideal for such cases. If you push too hard without having a strategy. It can be bad in the long run. So ideally, study and understand how Dgraph works so that you can better plan according to your needs.

See, BadgerDB today is being used for huge volumes of data. Some at Petabytes House of Data. Dgraph uses BadgerDB under the hood. So capacity it has, but it will all depend on planning. Because Dgraph is a bit more complex than BadgerDB “physically” speaking. Due its graph nature.

Cheers.


(Igor Miletic) #4

Michael,

I can share experience that we have.

We are trying to get it running for our production needs (some kind of user profiling). At the moment we are facing serious issues how DGraph behave with memory usage. I’ve tried to sumarize it here

In general our feeling is that DGraph is consuming so much RAM memory and we do not know why and for what.

E.g. for about only 3.000.000 it consumes 15GB RAM on each Alpha node and we have 3 nodes. So, 45GB of RAM is not able to constantly handle 3.000.000 nodes.

It is important to say, that our use case is not BULK load of 21M nodes about movies and that’s it, we constantly query and insert data.

I’m afraid, that when we get 21M nodes we will need 200GB of RAM.

This is just our observation and experience. I’ve asked people from DGraph to take a look in example and try to do profiling on their side in case we are doing something wrong, that I don’t think so.

Please keep me posted if you are doing some tests and have some results and experiences.

Cheers,

Igor.


(Michael Burbidge) #5

Thanks @pjolep. That’s the kind of experiences we was hoping to learn from. Please keep us posted.

I would suggest that this is also a great opportunity for Dgraph to jump in and help @pjolep figure out what is going on and keep the rest of the community posted. It would be a great blog post. Dgraph claims to be a horizontally scalable, distributed graph database. But frankly 21M nodes is nothing.

Whether this ends up being a usage problem or a problem in Dgraph, resolving this and documenting it for the community could help build confidence in Dgraph.

@MichelDiz I understand what you’re saying. Sure load and usage patterns are going to dictate the capacity required from Dgraph. But I think it is safe to say that there might be either practical or fundamental limits to the size of graph that can be supported, particularly with new technology.

But you’re certainly right in that I was also looking for practical experiences that others are having with Dgraph in that regard. @pjolep’s post is very helpful.


(Michel Conrado) #6

Hey Igor, Manish in his response to you said what this was about.

Manish Jain 6 days ago
Go GC isn’t the best. In latest versions we are running manual GC.

Manish Jain 5 days ago
I doubt the dataset is using that much. Go can be slow in giving the space back. But the only way to know is to run memory profiles.

I realize this happens, but you can mitigate this by creating a balanced cluster with balanced loads.

This is not accurate. I already did Liveload (which is the same as using a client) several times with larger datasets. But each instance having 21GB of RAM (3 Zeros, 6 Alphas - 129 GB of RAM totaling all instances). It takes a while to load, certainly, but does not consume 200GB of RAM - I always do balanced loads. I don’t send everything to one instance only. As Manish said, the dataset inside Dgraph does not consume this RAM. Any RAM larger than 10GB is the result of problems with GC. However, this has to do with writing only. Queries are safe and you can use best-effort query.

I talked to some devs and we gonna prioritize an analysis on that.

If eventually your 3 million Nodes inserts are consuming 15GB of RAM. This must be because your entities are loaded with some heavy data. They should not be small and have lots of indexes. Making 3 million inserts per second of large entities is not done in simple clusters. I think even MySQL would have a problem with that (MySQL writing is at most 10K per second).

For you to do that with MySQL you might need a GraphDB that uses MySQL but with 300 MySQL instances. It would be possible. but if it would be practical I don’t know. It is best if you load this by obeying the DB write limits currently in your cluster configuration. Or use Bulkload that you will be able up to billion++ Nodes in less than 2 hours (With a good instance e.g https://blog.dgraph.io/post/bulkloader/).

Another example

Take the example of this benchmark here https://github.com/linuxerwang/dgraph-bench
It’s a bit old (one year), He works with 10,000,000 person nodes. The total edges all together exceed 500,000,000. It only uses one Alpha and one Zero with 64G memory and 500GB SATA SSD.

That is, 64G RAM can easily handle 10 million or 500 million. However, he used Bulkload. This means that the GC effect does not happen because it is avoided when using this tool.

In his upgrade from v1.0.9 to v1.0.10 the throughput increases about 50% (One-Hop Friends).

That is, the simplest configuration of Dgraph can handle millions of nodes. And a well-balanced cluster can go much more further. It is a matter of planning. While we analyze this.

Cheers.


Extreme memory usage when constantly query and mutate data