Alpha nodes stuck in "opPredMove"

Hi there,
So we have a situation where our the health check on the dgraph cluster (running on GKE) gives:

"ongoing": [
      "opPredMove"
    ]

I’m wondering What does it mean exaclty and how long it will stay in this state? Also, would this effect the dgraph operations ? As we have been facing several errors while writing data to dgraph.

Since we are facing issues in production, any suggestion to fix this asap would be really helpful. thanks!

The cluster is moving the predicate tablet. You should check the health of your disk and the size of it. Moves happens when the disk is full or slow.

In the documentation it says that:

Dgraph Zero tries to rebalance the cluster based on the disk usage in each group. If Zero detects an imbalance, it will try to move a predicate along with its indices to a group that has lower disk usage. This can make the predicate temporarily read-only. Queries for the predicate will still be serviced, but any mutations for the predicate will be rejected and should be retried after the move is finished.

Zero would continuously try to keep the amount of data on each server even, typically running this check on a 10-min frequency. Thus, each additional Dgraph Alpha instance would allow Zero to further split the predicates from groups and move them to the new node.

Is there a way to set this to a different frequency or to stop predicates moving all together and make it a manual process?

You can’t.

You can only make the interval longer. If put a huge time it will never move any predicate.

# dgraph zero -h | grep rebalance_interval
      --rebalance_interval duration   Interval for trying a predicate move. (default 8m0s)

And to move it manually, you can use the HTTP process or Ratel. But you gonna move predicate by predicate. There’s no bulk moving.