Dgraph Cloud us-west-2 Outage on July 8th 11 pm PDT

mrjn · July 9, 2021, 7:23am

There was a major outage in Dgraph Cloud AWS us-west-2 region starting at [2021-07-09T06:00:00Z], lasting around 45 minutes.

The outage happened when we were trying to launch our k8s node/instance allocator in Dgraph’s internal cluster, which happened to be located in the same AWS region as customer facing us-west-2 region. We have deployed the node allocator in all the user facing AWS regions around the world. The last thing pending was to deploy it in Dgraph’s internal cluster.

There are two k8s clusters running in the us-west-2 region. The user facing k8s cluster (say K1) already had a node allocator running, so it was controlling a bunch of EC2 instances.

As we ran another node allocator for the other k8s cluster (K2), it saw a bunch of EC2 instances running in AWS, which were not registered with K2. Thinking they were left behind, it promptly went on to terminate those instances. This caused all K1 nodes to be terminated in a very short period of time.

We immediately saw the node allocator terminating the instances, and realized what happened. We killed the allocator in K2 and manually brought up the nodes in K1, while letting K1 allocator to bring up more nodes as needed.

It took us some more time to stabilize K2. Everything was back up and online by [2021-07-09T06:45:00Z].

Steps to avoid this

We’re going to make the node allocator understand multiple k8s clusters in the same region.
We’d also add a rate limiter in the allocator when it comes to terminating unregistered EC2 nodes, to limit it to 1 termination per N minutes. Had this been the case, this would have allowed us time to react.

Topic		Replies	Views
Context Deadline Exceeded - Shared Cluster not usable anymore Dgraph Cloud kind:bug	5	690	October 13, 2022
Scheduled Maintenance ( AWS us-west-2) Announce	4	721	April 3, 2022
Connectivity issues in Cloud: 502 (Bad Gateway) and 504 (Gateway Timeout) Errors Dgraph Cloud / Slash GraphQL kind:bug , rca	0	751	July 6, 2021
Which region should I choose for world wide app? No multi-region availability? Which Load Balancer should I choose? BTW What's better for dgraph AWS or GCP? Dgraph Cloud kind:question , dgraph	7	1385	October 1, 2021
Dgraph Cloud and Uptime Dgraph Cloud cloud	14	1164	June 28, 2021

Dgraph Cloud us-west-2 Outage on July 8th 11 pm PDT

Steps to avoid this

Related topics