Dgraph Cloud Postmortem: Backends Inaccessible April 14 2021

dmai · April 14, 2021, 4:40pm

Date: 2021-04-14
Authors: @dmai
Subsystem: Dgraph Cloud
Impact: full-outage for multiple customers for about 10-35 minutes.
Action-Item Status: In Progress

Issue Summary:

The Dgraph Cloud team did a planned deployment that resulted in some downtime in requests coming into backends. This affected multiple customers with 10 to 35 minutes of downtime and intermittent timed-out requests over the day.

Outage: Backend middle-tier requests

Time: 14th April 2021, 13:32 - 13:52 UTC
Impact: All existing backends

The Dgraph Cloud team was deploying a new release that greatly reduces the number of moving parts in the internal architecture and streamlines a lot of out-of-band processes required to service a Dgraph backend. This new architecture expands Dgraph Cloud availability to all AWS regions.

During this deployment, the Dgraph team upgraded the middle-tier services and load balancers which handle all the backend requests. In this process, the traffic was routed to the new services that were not yet fully configured. Once the service was fully configured, at the end of this new rollout, the outage was resolved and requests were successfully processed.

Outage: Dashboard and request validation

Time: 14th April 2021, 22:00 - 00:30 UTC
Impact: All existing backends and the Dgraph Cloud dashboard

Middle-tier services also collect usage statistics for every backend. This upgrade included a database migration for streamlining usage stats collection within Dgraph Cloud. During this migration process, all of the historical usage statistics data was loaded into this new database instance in a short span of time. The database did not have enough resources to handle the new load and began timing out requests under specified timeouts.

This database is part of the process of API key validation for backend requests. This caused API key validation for some backends to fail. Since API key validation is cached, this did not affect every backend during the time window.

To remedy this, the Dgraph team allocated more resources (CPU/memory) to the database which resolved the issue.

This was not caught in the staging environment because the staging data was smaller than prod data. This led to skewed optimism about the production run.

Corrective and Preventative Measures

Set up Dgraph Cloud status page to easily communicate with all customers about uptime and downtime.
Communicate with all customers ahead of time for any planned downtime and schedule Dgraph Cloud re-architecture deployments outside of business hours when possible to reduce the impact.
Perform regular end-to-end health checks to monitor uptime for Dgraph Cloud backends.
For all future releases, load tests will be done against production-level workloads in stage environments to have predictable load expectations.
A performance QA exercise is going to be undertaken so we are fully aware of any reported performance limitations of the Dgraph Cloud services

Topic		Replies	Views
Service Status Page? Dgraph Cloud kind:question	1	559	April 14, 2021
Experiencing Dgraph Cloud downtime right now GraphQL	1	378	September 2, 2021
Backend failing to retrieve any data Dgraph Cloud / Slash GraphQL	4	470	June 8, 2023
Dgraph Cloud and Uptime Dgraph Cloud cloud	14	1164	June 28, 2021
Slash console downtime after live loading Dgraph Cloud / Slash GraphQL	4	566	October 27, 2020

Dgraph Cloud Postmortem: Backends Inaccessible April 14 2021

Outage: Backend middle-tier requests

Outage: Dashboard and request validation

Corrective and Preventative Measures

Related topics