There was a major outage in Dgraph Cloud AWS us-west-2 region starting at [2021-07-09T06:00:00Z], lasting around 45 minutes.
The outage happened when we were trying to launch our k8s node/instance allocator in Dgraph’s internal cluster, which happened to be located in the same AWS region as customer facing us-west-2 region. We have deployed the node allocator in all the user facing AWS regions around the world. The last thing pending was to deploy it in Dgraph’s internal cluster.
There are two k8s clusters running in the us-west-2 region. The user facing k8s cluster (say K1) already had a node allocator running, so it was controlling a bunch of EC2 instances.
As we ran another node allocator for the other k8s cluster (K2), it saw a bunch of EC2 instances running in AWS, which were not registered with K2. Thinking they were left behind, it promptly went on to terminate those instances. This caused all K1 nodes to be terminated in a very short period of time.
We immediately saw the node allocator terminating the instances, and realized what happened. We killed the allocator in K2 and manually brought up the nodes in K1, while letting K1 allocator to bring up more nodes as needed.
It took us some more time to stabilize K2. Everything was back up and online by [2021-07-09T06:45:00Z].
Steps to avoid this
- We’re going to make the node allocator understand multiple k8s clusters in the same region.
- We’d also add a rate limiter in the allocator when it comes to terminating unregistered EC2 nodes, to limit it to 1 termination per N minutes. Had this been the case, this would have allowed us time to react.