@amaster507 I don’t think Dgraph’s approach to HA and multi tenancy (either poor man’s MT or enterprise MT) is sufficient here. The goal to achieve is to avoid cascading failures.
We have to accept that all software is written by humans, and humans have a reputation for being idiots . Imagine I have two teams (T1 and T2) operating on the same Dgraph cluster. In Dgraph multi tenancy, each team will get their own prefix for 2 predicates (1-a, 1-b and 2-a, 2-b). Now Dgraph does not guarantee that groups are aligned along the multi tenancy boundary, so group 1 in Dgraph may get 1-a, and 2-b, while group 2 get 1-b and 2-a.
Now the result: Team T1, who is just seeing normal load, starts seeing more and more latency on their API response time, even though there was no difference in load. Eventually, Dgraph starts giving more and more CPU over to the bad queries that team T2 writes, until T1 just stops working all together.
Sure, you could add a query-limit, which limits all queries to 500ms or whatever, but that’s an arbitrary limit, and affects all namespaces, so now you need buy in from multiple different teams, not just T1 and T2.
By problem is by no means unique to Dgraph, as all data stores suffer from the same issue. Hardware isolation (ie, separate hardware for each database) is the only solution I know of to solve this.
ETA/TL;DR: The point I’m trying to make is that most HA solutions work on solving hardware or network failure. Resource exhaustion / starvatio due to human error is a much more common problem, at least in my experience.