Dgraph Cloud Postmortem: Backends Inaccessible April 14 2021

  • Date: 2021-04-14
  • Authors: @dmai
  • Subsystem: Dgraph Cloud
  • Impact: full-outage for multiple customers for about 10-35 minutes.
  • Action-Item Status: In Progress

Issue Summary:

The Dgraph Cloud team did a planned deployment that resulted in some downtime in requests coming into backends. This affected multiple customers with 10 to 35 minutes of downtime and intermittent timed-out requests over the day.

Outage: Backend middle-tier requests

  • Time: 14th April 2021, 13:32 - 13:52 UTC
  • Impact: All existing backends

The Dgraph Cloud team was deploying a new release that greatly reduces the number of moving parts in the internal architecture and streamlines a lot of out-of-band processes required to service a Dgraph backend. This new architecture expands Dgraph Cloud availability to all AWS regions.

During this deployment, the Dgraph team upgraded the middle-tier services and load balancers which handle all the backend requests. In this process, the traffic was routed to the new services that were not yet fully configured. Once the service was fully configured, at the end of this new rollout, the outage was resolved and requests were successfully processed.

Outage: Dashboard and request validation

  • Time: 14th April 2021, 22:00 - 00:30 UTC
  • Impact: All existing backends and the Dgraph Cloud dashboard

Middle-tier services also collect usage statistics for every backend. This upgrade included a database migration for streamlining usage stats collection within Dgraph Cloud. During this migration process, all of the historical usage statistics data was loaded into this new database instance in a short span of time. The database did not have enough resources to handle the new load and began timing out requests under specified timeouts.

This database is part of the process of API key validation for backend requests. This caused API key validation for some backends to fail. Since API key validation is cached, this did not affect every backend during the time window.

To remedy this, the Dgraph team allocated more resources (CPU/memory) to the database which resolved the issue.

This was not caught in the staging environment because the staging data was smaller than prod data. This led to skewed optimism about the production run.

Corrective and Preventative Measures

  • Set up Dgraph Cloud status page to easily communicate with all customers about uptime and downtime.
  • Communicate with all customers ahead of time for any planned downtime and schedule Dgraph Cloud re-architecture deployments outside of business hours when possible to reduce the impact.
  • Perform regular end-to-end health checks to monitor uptime for Dgraph Cloud backends.
  • For all future releases, load tests will be done against production-level workloads in stage environments to have predictable load expectations.
  • A performance QA exercise is going to be undertaken so we are fully aware of any reported performance limitations of the Dgraph Cloud services