Nabil and Anthony - sorry we did not have more info sooner on the timeouts (context deadline exceeded) in ap-south-1. As I just wrote above, that’s a very broad message that bubbles up whenever a golang HTTP call times out, so it can mean many things - often just a slow query or overloaded system.
But the issue in ap-south-1 was worse and also unique. The context deadline exceeded errors correlated to a drop off in mutations (as seen by monitoring the increase rate of the max timestamp for the cluster). We also saw “num pending txns: 1” many times in the logs. We are pretty sure a problematic mutation transaction was submitted at 9:15 and 9:30 UTC, respectively, on two successive days causing two partial outages where only (read) queries were still working.
To fix this and get the cluster healthy, we shut down the alpha, cleared the write-ahead log (w directory) of queued updates, and brought it back up to work around the issue both times. Note there was no data loss because these updates were queued in a submitted state and never committed at all. Queries were still processing during this partial outage. We have saved the WAL that we suspect has a root cause mutation in it, and are working to clarify that root cause. Because this has only happened twice in ap-south-1 (nowhere else and no other time) it will probably be worked as a normal priority bug. While we don’t have a root cause yet, I hope this helps clarify what we did, the workaround, and what we know so far.
At this point, we have an alert set for a halt in the max timestamp (specifically we watch for the ApplyCh queue size rising too high) and we now know how to work around the issue by clearing the WAL. I hope this workaround helps anyone who encounters it, and we are also working to find the root cause - once we find that we will share the nature of the triggering query or other cause. But again, it seems very rare, so I would not say this is a critical process for everyone to document or monitor for based on what we know so far.