Testing dgraph intance. Creating cluster and scaling up just working without too much attention. But started testing situation when new alpha unintentionally or any other reason. 3 zeros 3 alphas 3 replicas, added new alpha(s) and new group appeared. No data migrated there just empty group basing what ratel showed.
Now doing stupid thing and deleting all of alphas from new group in ratel. Group remains and cluster is not responsive on some things. Don’t see option to clear this group, not ratel nor api.
Is there a proper way to scale down cluster?
Yep, that’s right. It is how Dgraph works. If you add a new Alpha you have to increase the replication factor too. Not just add more nodes. If you add more nodes in a replica 3 factor. And you already have 3 Alphas. Dgraph will create a new group. If you still want to replicate you have to alter the replica factor along with adding more nodes(Alphas).
The math is like replica factor / number of Alphas. If your replica is set to 6 for example. And you have 3 Alphas. The cluster will expect that you add more Alphas later until reach 6.
If you have replica set to 6 and 18 Alphas you have 3 groups.
In short, in order to scale you need to alter the replica factor too.
For this case you should use “remove node” before anything. And also not reuse the RAFT ID. You may just have messed up with your RAFT context. Just shutdown the cluster and delete all Dgraph directories in there.
Thanks for reply.
Concept of replication factor is familiar to me.
What do you mean by “remove node before anything”? And which directories delete?
Zero follower nodes complain about alpha(deleted) from new group.
E0403 10:49:47.730246 15 pool.go:311] CONN: Unable to connect with alpha4:7080 : rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dial
ing dial tcp: lookup alpha4 on 10.89.0.1:53: no such host"
Zero leader doesn’t complain about alpha. Only info about skipping snapshot for unwanted group.
I0403 10:38:26.818905 17 raft.go:807] Skipping creating a snapshot. Num groups: 2, Num checkpoints: 1
Output from state
here is group which one i want to completely delte
Cuz you need to remove the unwanted node from the RAFT context. It will keep trying to connect until you remove it.
Alphas communicate with each other. So they ask Grupo Zero about who is participating in the cluster. After said they communicate directly. So Zero will not always be trying to communicate with a node. It waits for Alpha nodes to request things. Like “pull”.
If this was a leader you should first move its tablets(predicates, indexes) to another group before removing it.
Removed node with group: 2, idx: 4
And the same problem as i described above. Group 2 remain spam in logs etc.
It looks like there should be endpoint like remove[Empty]Group
Just tried do cluster backup and as expected ended with
E0404 07:10:04.874895 15 backup_ee.go:209] Error received during backup: Couldn't find a server in group 2
E0404 07:10:04.874951 15 queue.go:237] task 0x108256006: failed: Couldn't find a server in group 2