Advice on handling corrupt alpha pod

Hello,

We had an incident today where an alpha pod started throwing a similar error to what was found in this topic: LOG Compact FAILED with error: MANIFEST removes non-existing table 15777621,

After restarted the cluster, most of our replicas began to stabilize, however the alpha pod originally impacted is restarting constantly throwing the following error:

12:27:33.499 2023/01/05 11:27:33 file does not exist for table 17230931
12:27:33.499 Error while creating badger KV posting store 

My assumption is that we need to remove this pod, and it’s PVC following these steps: https://dgraph.io/docs/deploy/kubernetes/#removing-a-dgraph-pod

However, after reviewing the Zero endpoints: https://dgraph.io/docs/v21.03/deploy/dgraph-zero/#endpoints

It’s mentioned that you cannot use the same idx on the restarted alpha pod. Does this mean we simply cannot restart the alpha pod after removing the PVC, in this case dgraph-alpha-1?

Unfortunately, it will be rather difficult for us to change the idx value of that pod. We are also unable to delete the PVC, without first removing the pod entirely.

We had the idea to scale down our dgraph cluster to 1 alpha and 1 zero, remove the PVC, and remove the alpha from Zero via the endpoint. Then scale it back up to 3 alpha/zero. Is there any issues with doing it this way?

Currently our leaders are on alpha-2 and zero-2, if we scale down to 1 of each, will the leaders be re-elected accordingly?

Please let me know your thoughts.

It looks like we were able to resolve the issue by doing the following:

- cordon the node wich runs dgraph-alpha-1 and dgraph-zero-2 kubectl cordon $NODE_NAME

- kubectl delete pod dgraph-alpha-1 (this was the faulty corrupted dgraph pod)

- delete the pvc of dgraph-alpha-1 kubectl delete pvc datadir-dgraph-alpha-1  (this was the faulty corrupted dgraph pvc)

- uncordon the node from step 1

- That would relaunch  dgraph-alpha-1 (the missing pod) on the node with a clean PVC created upon launch.

- The join process started, data started to rebuild from a snapshot of other alpha nodes
1 Like