If google would use dgraph, should it use for everything one single dgraph DB, or for every service (Maps, YouTube, GMail...) an own dgraph DB?

Hi!
Google has: youtube, Google Translate, Google Maps, Google+(it’s offline but pls forget that for this example), Google Drive, Google Calendar, Google Search, Google Mail, Google Play, Google Photos, and many more.

All these services are linked together through your google account.

if Google would use dgraph, should it use one dgraph DB for every product. Or for every product an own dgraph DB?

At such a massive scale, it’s better to use for every product an own dgraph DB, or? Or is graph so awesome smart horticontally scalable with it’s 3-vertic algorithm that it automatically pushes all nodes that belong to a google product into own Alphas? , I mean, does it notice that some nodes are always used together and therefore groups them together? We as humans know which schema types belong to which product, but is dgraph also able to find that out by itself and therefore arrange the Alphas smart and performance convenient?

If Google should put everything into one single dragph DB: should google use for youtube comments and Google+ comments the same schema type? Overall it would be a pain in the ass to manage all these 99999 google services within one dgraph DB (also in terms of maintenance)?

When building ecosystems like Google and Amazon have, which rule of thumb should I use, where should I draw the line; to find out what I shall put into one single DB with everything else or in its own dedicated DB? (i know i know i am a small kid no way i’ll reach that scales that I need dedicated DBs and so on, but I just wanna know)

thank you very much!

Who knows. That’s a hard thing to tell. Both strategies can be done. I guess a Dgraph Cluster for each service is better. As the Auth system can link them up.

That’s very opinative. For me a Dgraph Cluster per service. But at Dgraph we have the Dgraph Cloud(https://dgraph.io/products/) that holds several multi-tenant instancing. And it works just fine, people can’t even notice it is shared.

Dgraph will balance the predicates based on disk usage.

I personally think it is a huge/massive/tremendous mistake to put all eggs in the same basket. Especially speaking about a such huge scale as Google or even Facebook. It doesn’t matter how good is the tool, I have to have redundancy and avoid a single point of failure.

That’s a complex topic that only those who work for them can tell. We never know all the approaches the people uses in their companies. Some of them like “Netflix” does expose their ideas over public talks and blog posts. And all of them has their pros and cons. No one will do exactly the same.

1 Like

thanks a lot sempai!!! some last questions ._.

what exactly do you mean with that? do you mean the own dgraph auth system? If I use multiple Clusters(=Databases??) the data within each Cluster/Database can’t interact with each other. or not??

I saw that you offer multi-tenant instancing also as an enterprise solution, that’s magnificent!! If I as an enterprise use that feature, because of multi-tenant instancing and dgraphs horizontally scaling feature, dgraph will automatically divide it into different disks, right? So that means the eggs won’t be all in the same basket, correct? But then, once again, Joins/Edges won’t work anymore between the clusters (to link user data/profiles)… right? Because e.g it was possible afaik to link Google+ stuff and YouTube together, GMail also is linked with Google Calendar (if you get an email about your flight google inserts that into your calendar), everything is linked in Googles ecosystem

I am working since years on a big project, I am building a big ecosystem too, a social network, a local eCommerce Platform, and some more stuff. I think it’s also better to use for each service an own Cluster. And then storing a copy of the user profile in every database/cluster. it ain’t that much of a problem if the user changes his email or postal address, to just perform from my backend 3 updates to each database instead of only 1 update. I think this denormalization won’t hurt, or?? or is there another/better way to manage that?

How should I start? E.g I need now 3 databases/clusters for 3 services. but multi-tenant instancing is only available for enterprises. Should I just start with 3 Shared/Dedicated plans, and then if things work good I upgrade to Enterprise level, and then I can merge everything together into a multi-tenant instancing setup? Or I can on enterprise level also just keep 3 independent ‘projects’ running, or? (btw upgrading from shared to dedicated is no problem too, or?)

thanks a lot!! sorry for this many questions .-. but don’t worry i am lurking on this forum and on the subreddit too, I’ll keep forwarding the knowledge you pass me on

Interesting discussion!

“All eggs in one basket” = all data on Dgraph. Is that what you meant @MichelDiz ?

@Juri Graph db is the future and horizontal scaling with HA still allows you to link data/edges across clusters just not across tenants. Multi-tenant is logically separated and cannot be joined across in a single query. But I believe HA allows multiple fails without any data loss and if properly managed could be setup to autoscale on failure I believe.

But it is still all one codebase if there is a fault in the code it could be on every instance. I think that is what Michel meant by all eggs in one basket.

But overall I foresee and even predict that different big data projects unite into a single world wide data graph with SSO and ACL granular controlled with RBAC and ABAC.

2 Likes

I’m taking your assumption on Googles tech env. They have their own Auth system. If they would adopt Dgraph, I’d prefer to maintain the Auth System and have separate Dgraph Clusters environments. Again, my opinion.

Yes and not necessarily only when needed, the predicates will be atomized in relation to the tenants. So, if you have several hosts with single Alphas and healthy disks. Dgraph will balance all tenants.

When I say “Eggs in the same basket” I mean that I personally don’t like this concept of having a single Dgraph Cluster for all. And that includes Tenants. BUT, if you are a great Ninja in Dgraph’s deploying, and master it completely. You can handle a Tenancy context for sure. Dgraph engineers knows how to handle Tenants and every single issue related to that. If you are starting or low budget, I recommend a Cluster per service.

BUT, if you don’t have Google’s scale. Just use a single Cluster for 3 Services is fine.

Of course, Google would have the best engineers. If they wanna run a single Dgraph Cluster for 99999 services. I believe they would be pretty confident to do so.

And Also, Dgraph has the logic of being resistant to failure. Where other instances can take the place of those that have failed. To have this security, a large and well-managed cluster does the job. But on the Google scale I don’t know what it would be like.

Yep, And neither with Tenants.

BUT, you could potentially use GraphQL and Apollo Federation architecture. So you can “merge” several services in a single GraphQL server. That won’t work for DQL of course. But you can use Lambdas or custom DQL in the GraphQL Federation and all is good. But, more work to do.

I don’t believe that this is related at the Database level. This is application level.

I’m pretty sure Google uses a mix of different solutions and even different DBs. And they are merged via API. I never worked at Google, but when the company is too big, it gets hard to make the whole company follow always the same principles. Things start to become “silos” fast.

Why not a single service that takes care of profiles? Just add a reference to those other services. The RAW profile data could be in that Profile Service. I think that some companies like Stack Exchange does something like this.

If you don’t want to worry about Infrastructure and cluster administration. Go to Dgraph Cloud with Tenant. If your team is small and your budget is short. Evaluate going to the Cloud because you will have a team taking care of the Cluster. Think about the salary you would be paying someone to run the 24-hour Cluster to do the comparison. Or your own time, if you were the one who has to play everything. From project administration, coding, and database administration. You with little budget need to buy time. And put the project out there live.

More or less. I say this in case you still don’t know how to properly manage a Cluster. An error can be fatal(downtime). So having multiple clusters is better. Isolate the problem, but keep some things upright and secure.

1 Like

thanks a lot!!!

the problem is as our buddies said:

If I do everything into one database, that will result on scale into a terrible performance. or not? because dgraph doesn’t know that I e.g: want social media stuff into one alpha node and ecommerce stuff into the other alpha node. dgraph will just balance them randomly based on the disk usage. that means that everything will be mixed up. then my joins have to cross abroad to multiple alpha nodes(=machines), that will create a terrible latency. or not?

it would be really awesome to have different big projects in one single world wide dgraph database, that would allow awesome query capabilities. but if dgraph won’t shard stuff logically(recognize which data is used frequently together to shard between services like socialmedia and ecommerce), but instead based on disk usage which will result in a mix that leads to terrible performance latency and so on. or not?

Well, if you provide enough resources. Dgraph will do its best to balance the predicates. Dgraph is distributed horizontally just for that reason, performance.

Again, we have shared instances at Dgraph Cloud and just works.

It is not like that. You can’t choose where the data goes. Not for now. And the data would be distributed among Groups(a set of Alphas/nodes).

Well, yes. But why is that a problem? if you not separating into services.

Nope, see this video to understand how it works

1 Like

yes yes but I’m not speaking about that multi-tenant instancing. I am speaking of HA multiple cluster setup (where you have ONE single dgraph DB, but multiple alpha nodes (for sharding?? that’s wrong or?) and high availability). My question is, if HA multiple cluster setup won’t have bad latency since queries travel across multiple machines and that causes latency. Because I thought like you said, that dgraph balances/shardes predicates based on disk usage, this leads to my next question about that:

I saw now the video:
Every group is an own machine, right? So we have 3 machines?
Group 1 = name
Group 2 = director film + initial release date
Group 3 + netflix data

I thought dgraphs shards/balances predicates based on disk usage. Why are they now sharded/balanced based on name, directorfilm+releasedate, netflix data? I think that’s the spot where I have a knowledge gap, can you please explain that to me bro?

The brother in the video also said at 13:00 ‘‘if one machine goes down, then you can still have another alpha nodes of your group’’. So, is every alpha node now on ONE machine, or every group on ONE machine? With ‘one machine goes down’, does he mean the actual hardware machine (e.g RAM bricks or CPU bricks or SSD breaks, or electricity outage), or the VM machine?

One last question: The setup on the picture is a HA(high availability) setup, correct? Is every group one machine? My assumption: No because that doesn’t make any sense. I think that Z1 A1 A4 A7 are one machine. Z2 A2 A5 A8 one machine, and Z3 A3 A6 A9 one machine. is that true? But what do the groups mean then? Because in the video I understand that every group is one machine. but that doesn’t make sense. I am confused. (my english understanding is quite bad and the auto generated english subtitles are as you know not the best, so maybe it is an understanding mistake. I am sorry.)

thank you very much and kind regards!!!

For best performance of HA, every node is its own machine. So Z1 A1 A4 A7 are four separate machines and Z2 A2 A5 A8 are four other separate machines. Groups are logical containers of machines.

1 Like

thanks a lot now I understand that setup! The photo is a 3x4node HA(high availability) dgraph cluster setup. Every alpha node is an own machine. A2 A5 A8 are replicas (Z2 manages them), A1 A4 A7 are replicas (Z1 manages them), A3 A6 A9 (Z3 manages them). The groups are as you said logical containers for these machines because they are replicas of each other, they all have the same data

but that means, if we forget now multi-tenant instancing and HA (even though it would be the same it doesnt matter if we have multitenant or HA or not), and have a basic sharding setup, that dgraph will still have a bad performance with many different data (because of services). Because when it comes to sharding to scale horizontally, dgraph will shard/balance predicated based on disk usage. This will create a big mix because dgraph doesn’t care about whether the data belongs together or not (so it will mix different services). So we will have many hops between disks and this will cause latency. Is that true now or not? Because I can remember dgraph had some kind of 3-predicate architecture (that solves the issue I read on the website in the dgraph introduction) but it wasn’t explained that much further (it’s explained in the whitepaper but I understand only banana reading the whitepaper)

can you guys maybe shed a light on me (._.)

also you said ‘dgraph will balance the predicates’ NOT ‘the data’, that means there is some more logic behind that for better performance, so that one dgraph database is able to manage multiple services. can you maybe explain that 3 predicate thing .-.

Data takes the shape of S → P → O.

Subject
Predicate
Object

So looking at user data we can see this in the RDF format.

0x1 <User.name> "foo" .
0x2 <User.name> "bar" .
0x3 <User.name> "baz" .

Everything takes this SPO model. So a friend relationships is of the same shape:

0x1 <User.friend> 0x2 .
0x2 <User.friend> 0x1 .

The term predicate and edge get used a lot interchangeably with Dgraph. The context is the key to understanding the differences. The context here is that the predicates are sharded. So if we had this data on a HA cluster, then it could be sharded with all “User.name” triples on one alpha and all “User.friend” triples on the other.

Data is queried by predicates. The fastest query is to ask for an object by UID (the Subject), but usually, you want something more than the UID, so you request a field.

n(func: uid(0x1)) { User.name }

This would return “foo” structured in JSON for “n” and the field requested.

This is quick because the zero node know who holds the predicate User.name so it goes there and asks for the specific triple.

Now if you were to ask for the friends and the names, then the query could span across multiple alphas and gather the data that is needed. Data is stored in posting lists so it only gets keys first and then it filters those keys and then it gets the fields requested with Zero managing this process. Data is only read once even if it is repeated multiple times very deeply in the query. It is the job of one alpha to respond to the request after Zero has coordinates all of the Alphas to work together to retrieve the data needed to respond.

1 Like