Advice on handling corrupt alpha pod

xbreid · January 5, 2023, 11:57am

Hello,

We had an incident today where an alpha pod started throwing a similar error to what was found in this topic: LOG Compact FAILED with error: MANIFEST removes non-existing table 15777621,

After restarted the cluster, most of our replicas began to stabilize, however the alpha pod originally impacted is restarting constantly throwing the following error:

12:27:33.499 2023/01/05 11:27:33 file does not exist for table 17230931
12:27:33.499 Error while creating badger KV posting store

My assumption is that we need to remove this pod, and it’s PVC following these steps: https://dgraph.io/docs/deploy/kubernetes/#removing-a-dgraph-pod

However, after reviewing the Zero endpoints: https://dgraph.io/docs/v21.03/deploy/dgraph-zero/#endpoints

It’s mentioned that you cannot use the same idx on the restarted alpha pod. Does this mean we simply cannot restart the alpha pod after removing the PVC, in this case dgraph-alpha-1?

Unfortunately, it will be rather difficult for us to change the idx value of that pod. We are also unable to delete the PVC, without first removing the pod entirely.

We had the idea to scale down our dgraph cluster to 1 alpha and 1 zero, remove the PVC, and remove the alpha from Zero via the endpoint. Then scale it back up to 3 alpha/zero. Is there any issues with doing it this way?

Currently our leaders are on alpha-2 and zero-2, if we scale down to 1 of each, will the leaders be re-elected accordingly?

Please let me know your thoughts.

xbreid · January 5, 2023, 1:20pm

It looks like we were able to resolve the issue by doing the following:

- cordon the node wich runs dgraph-alpha-1 and dgraph-zero-2 kubectl cordon $NODE_NAME

- kubectl delete pod dgraph-alpha-1 (this was the faulty corrupted dgraph pod)

- delete the pvc of dgraph-alpha-1 kubectl delete pvc datadir-dgraph-alpha-1  (this was the faulty corrupted dgraph pvc)

- uncordon the node from step 1

- That would relaunch  dgraph-alpha-1 (the missing pod) on the node with a clean PVC created upon launch.

- The join process started, data started to rebuild from a snapshot of other alpha nodes

Topic		Replies	Views
Alpha node restart failed Dgraph dgraph	12	1566	February 15, 2024
Remove bad node in k8s after error with storage (Alpha node) and Recover Alpha Dgraph	6	801	October 7, 2021
Changing replication Dgraph kind:question	7	634	October 11, 2020
Scale up/down alpha nodes in a kubernetes environment Dgraph	1	617	August 3, 2022
Alpha is failing on start Dgraph dgraph , status:accepted , kind:bug , area:kubernetes , ticket:created	20	2157	November 24, 2020

Advice on handling corrupt alpha pod

Related topics