We have a 6-node cluster - 3 Zeros and 3 Alphas.
There’s an Alpha instance in our Test environment that’s non-responsive. The other two Alphas are fine. I logged onto the instance and the dgraph process was still up and running. I can see the other nodes can disconnect/re-connect to it when I stop/restart the dgraph process on the bad Alpha. However, when attempting to run a query against it (using curl localhost), the query just doesn’t return. I’ve had this happen once before and just terminated and rebuilt the instance because I didn’t have the time to investigate, but I’m curious what causes this to happen.
I tried restarting the dgraph process, but it it didn’t help.
The ERROR log has:
Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>
When I look at the localhost:8080/state, it has the correct cluster metadata. The other five nodes are up and reachable.
In the INFO log, when it restarts, I can see where it gets the first state update from the Zero successfully and displays the schema predicates.
Is there any other debugging that I can try to get the instance fixed before I just terminate and let it rebuild?
Thanks!
Dgraph version : v21.03.2
Dgraph codename : rocket-2
Dgraph SHA-256 : 00a53ef6d874e376d5a53740341be9b822ef1721a4980e6e2fcb60986b3abfbf
Commit SHA-1 : b17395d33
Commit timestamp : 2021-08-26 01:11:38 -0700
Branch : HEAD
Go version : go1.16.2
jemalloc enabled : true