Unbalanced disk usage

Hello, I’d like to better understand how Dgraph distributes data across shards.

I have 6 Alpha shards on 6 separate machines. I also have Zero and Ratel running on a separate machine. My data is mostly a constant stream of new nodes of many types, with occasional updates to just one node type.

Here is the current distribution of data across the shards, with ~6m nodes:

9.6G    data/dgraph/p
47M     data/dgraph/w

509M    data/dgraph/p
69M     data/dgraph/w

521M    data/dgraph/p
74M     data/dgraph/w

7.3G    data/dgraph/p
68M     data/dgraph/w

443M    data/dgraph/p
83M     data/dgraph/w

895M    data/dgraph/p
72M     data/dgraph/w

As you can see, 2 of them have significantly more data than the others. IIUC, the data is split off to different shards based on predicates, correct? Is there an easy way to determine what is being stored on each shard? What would be the best way to get the data more evenly distributed? Do I need to split the larger predicates into subpredicates?

My other concern is that Dgraph is using far more space I expected. My raw data (json) is <5gb. Could this be because I am indexing too many predicates?

Thanks for your help.

If anyone finds this, I found some answers on my own in this thread: Splitting predicates into multiple groups.

Dgraph will do it’s best to rebalance the predicates (every --rebalance-interval) based on data size. The data and indices for a predicate would be part of the same Alpha group. Indices are stored on disk, so you’ll see more disk usage with indices.

You can check Zero’s /state to see which predicates are part of a particular group: https://docs.dgraph.io/deploy/#more-about-state-endpoint