Splitting predicates into multiple groups

A given predicate always belongs to same group, i.e will always go same node (and its replicas)
So this node/server will eventually fill-up and there is no way to horizontally scale the storage.

Has any work been done on splitting the predicate into multiple groups? ( https://docs.dgraph.io/design-concepts/#group)

If not, what solutions are offered, without vertically scaling (which has limits and be expensive) ?

Vik - Thanks for starting thread and capturing all of my notes in a one sentence succinct fashion.

Here’s my unabridged questions and notes from Slack, a little tuned up. Please let me know if I misunderstand the magnitude of a predicate being restricted to a single server. Are the predicates stored so small it doesn’t matter? If so how would one calculate a server size anyway?

Thanks,
Ryan

Questions:

  1. When you run out of space on a server, due to one predicate - how do you add a new node and rebalance the shards?
  2. How do you calculate an estimated size for a predicate group, such that you can size a server to handle the anticipated growth of a predicate group.
  3. How do you calculate the size of a node to accommodate the size of the predicate
  4. Specifically, I’m looking for more details on the sharding for part of a cluster design - If I design small economy sized nodes and need to add more in the future, how do the predicate groups split themselves to accommodate more data than initially planned for? How do you calculate an estimation of how large the predicate group will be?

The information for this is scattered, here’s a list of references I’ve found:

  1. From the dgraph documentation on Group design (https://docs.dgraph.io/design-concepts/#group), dating back to the 0.8.0 release in July 17, 2017, "In a future version, if a group gets too big, it could be split further. In this case, a single Predicate essentially gets divided across two groups.”

  2. Sydney 5 Min agenda, June 22, 2016: “Distributed: Automatically distribute data to and serve from provided servers. Handle shard splits, shard merges, and shard movement.” (https://github.com/dgraph-io/dgraph/blob/7851bb75ca753e6fe98ad504dc53cc629ed0573a/present/sydney5mins/g.slide#L36)

  3. From GitHub Issue on Dgraph vocabulary (https://github.com/dgraph-io/dgraph/issues/2171) “A predicate is just the middle part in a <Subject> <Predicate> <Object>/ObjectValue . RDF NQuad. Dgraph shards data by a predicate. Therefore all data for a predicate must reside on the same server.” If you were modeling Twitter followers, several hops out - wouldn’t the predicate “follows” eventually fill the server?

  4. From the Dgraph documentation on High Availability (https://docs.dgraph.io/deploy/#ha-cluster-setup), “Over time, the data would be evenly split across all the groups. So, it’s important to ensure that the number of Dgraph alphas is a multiple of the replication setting. For e.g., if you set --replicas=3 in Zero, then run three Dgraph alphas for no sharding, but 3x replication. Run six Dgraph alphas, for sharding the data into two groups, with 3x replication.” - but how does that actually work?
    I can’t seem to find more details on it.

  5. From the Go Time #108: Graph Databases podcast on Dgraph (https://www.youtube.com/watch?v=Mc_oJ9z3mp4) - “partition data set on the predicate… there are many predicate - like name or age or friends, that information is separate so they could be on multiple machines”… “separating the data like that, no mater how much data you’re fetching the number of network requests you’re sending is proportional to the number of predicates you’re fetching not the amount of data.” - Francesc
    Can a predicate overrun the memory of a single node?

  6. From the GoSF Badger talk (https://www.youtube.com/watch?v=VftmLgwk_cY), talking about storing only keys vs storing values with the keys… this makes sense and certainly reduces the size necessary to store the information, but again - how’s the predicate sharding work, especially if additional nodes are added to the cluster to support more data? Is there a slide on this, or a talk someplace?
    During the talk, the following stats were mentioned:
    16-byte keys, 10-byte value pointer to SSD, and 64MB table - 2.5 mil keys per table
    Does this mean for every 2.5 million keys (predicate + subject = group), 64MB is needed? Is this the estimation for how large to build a server? 25 million keys would require 640MB…

  7. June 2016 - Discussion on dynamic sharding. RAFT is labeled as a solution to moving shards around, but I don’t think that includes splitting predicates.

  8. From the Dgraph documentation on New Servers & Discovery (https://docs.dgraph.io/design-concepts/#new-server-and-discovery) I see this about transferring data to new machines, but is that a replica transferred to balance total data, or splitting a shard to add total capacity to the cluster?

  9. From the Dgraph documentation on Client Implementation (https://docs.dgraph.io/clients/#implementation)

    For multi-node setups, predicates are assigned to the group that first sees that predicate. Dgraph also automatically moves predicate data to different groups in order to balance predicate distribution. This occurs automatically every 10 minutes.

We still do not shard the predicate across multiple groups but we plan to add this feature into Dgraph in the year of 2020. We are also working on releasing a paper that provides details of the internals of Dgraph, keep an eye out.

Do you have any suggestions for back-of-napkin math on how to know what the right size of RAM and SSD drive is for the amount of predicates?

As far as I understand, RAM requirements depend on how often you query (read or write) into the database.

Disk requirements are tricky to calculate beforehand, it really depends on your dataset. We perform various encoding and compression of the data that we store and that could vary from dataset to dataset. Though, you could perform a small experiment with a decently large dataset (say 2GB) and extrapolate your disk requirements.

Does this reflect how Predicates, Groups, and Shards relate on a node?

I am not sure I follow everything in the slides, but more details are available here https://docs.dgraph.io/deploy/#understanding-dgraph-cluster. As far as I understand, we have groups and each group can contain one or more than one predicates which will be replicated to each alpha in the group depending upon the replication factor.