Connectivity issues in Cloud: 502 (Bad Gateway) and 504 (Gateway Timeout) Errors

aman-bansal · July 6, 2021, 4:48pm

In recent days we noticed that users are facing connectivity issues with Dgraph cloud. In general users are facing two major issues 502 Bad Gateway Errors and 504 Gateway Timeout Errors. This document aims to provide RCA for these issues with the detailed steps we are taking to fix these and the progress in terms of where we are at currently.

Issue 1: 502 Bad Gateway Errors

Root Cause:

Cloud Infrastructure is build upon kubernetes. All the dedicated deployments are part of one kube cluster. And sometimes because of node movements, alpha or zero pods are relocated from one machine to another. In these cases, dgraph instances becomes unavailable for a certain amount of time. Non HA clusters in Dgraph cloud are more prone to these errors because even one alpha movement can lead to dgraph unavailability.

Fix

During our observation we identified that it took an average of 15 seconds for the pod movement. We are putting internal retries to requests that are getting impacted by 502 errors to provide more seamless experience to dgraph cloud users. Dgraph HA clusters are more immune to these errors but we have put these checks for HA clusters too. We are currently testing this change in staging to ensure that we don’t introduce any side effects and we will soon be releasing the fix in prod. I will keep this thread updated and will inform once the fix is deployed in production.

Issue 2: 504 Gateway Timeout Errors

Root Cause:

Users requests are redirected through a proxy to their dgraph instances. We observed that in case of our proxy server restarts because of release or node movement, proxy is not draining the already established connections properly. Hence once the proxy pod is deleted, the connection destination is not available and therefore the clients were seeing gateway timeout errors.

Fix:

On further investigation we found that there was one memory leak in our proxy servers which was causing frequent restarts therefore increasing number of 504 errors. We have already fixed this memory leaks and the change is already running in production from 3 July. So users must have already seen decline in the frequency of 504 errors.
We are still in the process of fixing the graceful draining of connections when proxy server is released or if there is any node movement. I will keep this thread updated and will inform once the fix is deployed in production.

Topic		Replies	Views
Dgraph Cloud Postmortem: Backends Inaccessible April 14 2021 Dgraph Cloud / Slash GraphQL postmortem	0	672	April 14, 2021
Throughput Issues with Dgraph ^v1 (currently running the nightly through docker) Users	11	851	May 10, 2018
Dgraph Cloud us-west-2 Outage on July 8th 11 pm PDT Dgraph Cloud postmortem	0	462	July 9, 2021
Connections to all alphas started timing out around same moment Dgraph kind:question , dgraph , priority:p1 , area:crash , area:operations	2	876	December 2, 2019
Mutation failed because Dgraph execution: Unhealthy connection GraphQL dgraph	3	648	December 28, 2020

Connectivity issues in Cloud: 502 (Bad Gateway) and 504 (Gateway Timeout) Errors

Issue 1: 502 Bad Gateway Errors

Root Cause:

Fix

Issue 2: 504 Gateway Timeout Errors

Root Cause:

Fix:

Related topics