Scale down cluster

Testing dgraph intance. Creating cluster and scaling up just working without too much attention. But started testing situation when new alpha unintentionally or any other reason. 3 zeros 3 alphas 3 replicas, added new alpha(s) and new group appeared. No data migrated there just empty group basing what ratel showed.
Now doing stupid thing and deleting all of alphas from new group in ratel. Group remains and cluster is not responsive on some things. Don’t see option to clear this group, not ratel nor api.
Is there a proper way to scale down cluster?

Dgraph metadata

dgraph version

Dgraph version : v22.0.2
Dgraph codename : dgraph
Dgraph SHA-256 : a11258bf3352eff0521bc68983a5aedb9316414947719920d75f12143dd368bd
Commit SHA-1 : 55697a4
Commit timestamp : 2022-12-16 23:03:35 +0000
Branch : release/v22.0.2
Go version : go1.18.5
jemalloc enabled : true

For Dgraph official documentation, visit Get started with Dgraph.
For discussions about Dgraph , visit http://discuss.dgraph.io.
For fully-managed Dgraph Cloud , visit Cloud Data Processing Addendum – Dgraph | GraphQL Cloud Platform.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2022 Dgraph Labs, Inc.

Yep, that’s right. It is how Dgraph works. If you add a new Alpha you have to increase the replication factor too. Not just add more nodes. If you add more nodes in a replica 3 factor. And you already have 3 Alphas. Dgraph will create a new group. If you still want to replicate you have to alter the replica factor along with adding more nodes(Alphas).

The math is like replica factor / number of Alphas. If your replica is set to 6 for example. And you have 3 Alphas. The cluster will expect that you add more Alphas later until reach 6.

If you have replica set to 6 and 18 Alphas you have 3 groups.

In short, in order to scale you need to alter the replica factor too.

For this case you should use “remove node” before anything. And also not reuse the RAFT ID. You may just have messed up with your RAFT context. Just shutdown the cluster and delete all Dgraph directories in there.

Thanks for reply.
Concept of replication factor is familiar to me.
What do you mean by “remove node before anything”? And which directories delete?
Zero follower nodes complain about alpha(deleted) from new group.

E0403 10:49:47.730246      15 pool.go:311] CONN: Unable to connect with alpha4:7080 : rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dial
ing dial tcp: lookup alpha4 on 10.89.0.1:53: no such host"

Zero leader doesn’t complain about alpha. Only info about skipping snapshot for unwanted group.

I0403 10:38:26.818905      17 raft.go:807] Skipping creating a snapshot. Num groups: 2, Num checkpoints: 1

Output from state
here is group which one i want to completely delte

    "2": {
      "members": {},
      "tablets": {},
      "snapshotTs": "0",
      "checksum": "0",
      "checkpointTs": "0"
    }
  },

Here alpha i deleted

  "maxRaftId": "4",
  "removed": [
    {
      "id": "4",
      "groupId": 2,
      "addr": "alpha4:7080",
      "leader": true,
      "amDead": false,
      "lastUpdate": "1680071246",
      "learner": false,
      "clusterInfoOnly": false,
      "forceGroupId": false
    }
  ],

Other alphas with id 1,2,3 attached to group “1”
Three zeros attached to group “0”

Zero has an endpoint to remove nodes - that was it I mentioned
/removeNode?id=3&group=2
https://dgraph.io/docs/deploy/dgraph-zero/#endpoints

Cuz you need to remove the unwanted node from the RAFT context. It will keep trying to connect until you remove it.

Alphas communicate with each other. So they ask Grupo Zero about who is participating in the cluster. After said they communicate directly. So Zero will not always be trying to communicate with a node. It waits for Alpha nodes to request things. Like “pull”.

If this was a leader you should first move its tablets(predicates, indexes) to another group before removing it.

Zero has an endpoint to remove nodes - that was it I mentioned
/removeNode?id=3&group=2
More about Dgraph Zero - Deploy

Yes I know. The same removeNode do ratel as i was looking in logs

Cuz you need to remove the unwanted node from the RAFT context. It will keep trying to connect until you remove it.

removeNode should remove from RAFT context as I understand?

If this was a leader you should first move its tablets(predicates, indexes) to another group before removing it.

I did. I repeated everything on clean cluster. Created HA cluster(RF 3). Then added one new node. No date in this new group.

    "2": {
      "members": {
        "4": {
          "id": "4",
          "groupId": 2,
          "addr": "alpha4:7080",
          "leader": true,
          "amDead": false,
          "lastUpdate": "1680590254",
          "learner": false,
          "clusterInfoOnly": false,
          "forceGroupId": false
        }
      },
      "tablets": {},
      "snapshotTs": "0",
      "checksum": "0",
      "checkpointTs": "0"
    }
  },

Then shot removeNode

curl "localhost:6081/removeNode?id=4&group=2"
Removed node with group: 2, idx: 4

And the same problem as i described above. Group 2 remain spam in logs etc.
It looks like there should be endpoint like remove[Empty]Group

Just tried do cluster backup and as expected ended with

E0404 07:10:04.874895      15 backup_ee.go:209] Error received during backup: Couldn't find a server in group 2
E0404 07:10:04.874951      15 queue.go:237] task 0x108256006: failed: Couldn't find a server in group 2