In recent days we noticed that users are facing connectivity issues with Dgraph cloud. In general users are facing two major issues 502 Bad Gateway Errors and 504 Gateway Timeout Errors. This document aims to provide RCA for these issues with the detailed steps we are taking to fix these and the progress in terms of where we are at currently.
Issue 1: 502 Bad Gateway Errors
Root Cause:
Cloud Infrastructure is build upon kubernetes. All the dedicated deployments are part of one kube cluster. And sometimes because of node movements, alpha or zero pods are relocated from one machine to another. In these cases, dgraph instances becomes unavailable for a certain amount of time. Non HA clusters in Dgraph cloud are more prone to these errors because even one alpha movement can lead to dgraph unavailability.
Fix
During our observation we identified that it took an average of 15 seconds for the pod movement. We are putting internal retries to requests that are getting impacted by 502 errors to provide more seamless experience to dgraph cloud users. Dgraph HA clusters are more immune to these errors but we have put these checks for HA clusters too. We are currently testing this change in staging to ensure that we don’t introduce any side effects and we will soon be releasing the fix in prod. I will keep this thread updated and will inform once the fix is deployed in production.
Issue 2: 504 Gateway Timeout Errors
Root Cause:
Users requests are redirected through a proxy to their dgraph instances. We observed that in case of our proxy server restarts because of release or node movement, proxy is not draining the already established connections properly. Hence once the proxy pod is deleted, the connection destination is not available and therefore the clients were seeing gateway timeout errors.
Fix:
- On further investigation we found that there was one memory leak in our proxy servers which was causing frequent restarts therefore increasing number of 504 errors. We have already fixed this memory leaks and the change is already running in production from 3 July. So users must have already seen decline in the frequency of 504 errors.
- We are still in the process of fixing the graceful draining of connections when proxy server is released or if there is any node movement. I will keep this thread updated and will inform once the fix is deployed in production.