Dynamic AutoScaling of GitHub Runners - Dgraph Blog

In this article we explain our transition to GitHub Actions for our CI/CD needs at Dgraph Labs Inc. As a part of this effort we have built (in-house) & implemented a new architecture for “Dynamic AutoScaling of GitHub Runners” to power this setup.

In the past, our CI/CD was powered by a self-hosted on-prem TeamCity setup - this turned out to be a little difficult to operate & manage in a startup setting like ours. Transitioning to GitHub Actions & implementing our new in-house built “Dynamic AutoScaling of GitHub Runners” - has helped us reduce our Compute Costs, Maintenance Efforts & Configuration Time across our repositories for our CI/CD efforts (with improved security).

Background

Before we begin we would like to give you an overview of CI/CD & explain our needs for it at Dgraph Labs Inc.

![](upload://aScNgdFl8rzNSPFBXL4hbWzR5GP.png)

CI/CD DevOps Infinite Loop image source credits

CI/CD is a two-step process that dramatically streamlines code development and delivery using the power of automation. CI (Continuous Integration) makes developer tasks around source code integration, testing and version control more efficient - so the software can get built with higher quality. CD (Continuous Deployment) automates software testing, release & deployment. CI/CD is often referred to as the DevOps Infinity Loop (as illustrated in the image above).

Why is CI/CD important to us?

At Dgraph Labs Inc, we use CI/CD to facilitate our SDLC (Software Development Life Cycle) for our Dgraph Database and our Dgraph DBaaS (Cloud Offering) components. Like any tech company we want to minimize our bugs & deliver high quality products. The testing standards become strict for a Database company like ours, as a database becomes the most critical component of a software stack. To facilitate this, we follow the Practical Test Pyramid model along with other kinds of measurements instrumented into our CI/CD. To summarize, CI/CD helps us with:

  • Ensuring higher code quality
  • Obtaining continuous feedback
  • Shipping efficiently (with higher confidence)

How we use CI/CD?

![](upload://pA3zxhEvDYcYssDPWSdI2Q48HlD.png)

CI/CD at Dgraph

Our CI/CD use-cases mostly revolve in the following areas: Building for Multi-Architectures (amd64/arm64), Testing (Unit/Integration & Load), Deployments, Security Audits, Code Linting, Benchmarking & CodeCoverage. We will cover some of these topics in our future blog posts.

Old Setup (TeamCity)

As described above, Dgraph Labs Inc ran a self-managed on-prem TeamCity setup for CI/CD in the past. The setup looked similar to the image below.

![](upload://yz4zXv5bxTm8KUKrMYc3WBo85IQ.png)

TeamCity Architecture image source credits

This setup was quite difficult to manage, monitor & have a high-uptime for a small team like ours today. The work infrastructure setup, ensuring right security posture and instrumenting observability on these systems. There were 3 issues here for us, and they were Compute Costs, Maintenance Efforts & Configuration Time.

Firstly, the Compute Costs was additive because we not only needed a Server & Agent Compute Machines, but we also needed Observability Stack (& Instrumentation) for these critical systems. Secondly, the Maintenance Efforts on the issues we encountered (like Security Patching, Upgrades, Disk-Issues, Inconsistent Test Results Reporting etc.) was taxing the team and was taking time away from our development cycles. Lastly, the Configuration Time was also a problem for us because the job configurations were outside our codebase (and in the Server), VCS configurations for new repo’s needed instrumentation & we had to write our custom install/cleanups for basic setup tasks in the job definitions.

As a result, Compute Costs ⬆⬆, Maintenance Efforts ⬆⬆ & Configuration Time ⬆⬆ were all high. This led us to re-think how we can transition to a new system that solved these issues and offered Public & Private repositories support.

NOTE: TeamCity is a great product. As explained above time & easy-of-use were driving factors that made us transition out.

New Setup (GitHub Actions)

Our research led us to GitHub Actions. Given that we were already on GitHub for our VCS, this made us explore this further. We were quite content with what it had to offer, as it came with immediate benefits. Notable architecture differences were around how there was a fully managed Server (unlike the previous setup) & how GitHub semi-managed the Runners (a.k.a. Agents). Below we show an example CD run for multi-architecture release for Dgraph Labs Inc on GitHub Actions.

![](upload://v2CIZDaq2E9jlTMvcF47u8lIdH1.png)

GitHub Actions CD steps for Dgraph (everything well integrated to GitHub eco-system)

Firstly, the Compute Costs were lower because we only manage the Self-Hosted Runners. We made use of the free GitHub-hosted runners wherever possible. For jobs that required higher resource specs, we had the option to run them on higher resources using Self-Hosted Runners. Secondly, the Maintenance Costs reduced because we had fewer components to manage. Although, GitHub had done a great job in simplifying the Runner setup steps - it still needed manual management - this was still a concern. Lastly, the Configuration Time reduced drastically because of the Action Marketplace, which provided pre-templated tasks to perform pre-setup and post-cleanup on the Runners. And with this transition, the code & job definitions lived together (unlike our old setup).

As a result, our Compute Costs ⬆ reduced, Maintenance Efforts ⬆ reduced & Configuration Time ⬇ was a great win. There was still some room to improve here, because we were manually attaching & detaching Runners on a need basis. There was another problem here - the “Idle Runners” - it was leading to wasted resource spends.

Dynamic AutoScaling Of GitHub Runners

As described in the previous step, our problems were limited to Compute Costs (idle runners) & Maintenance Efforts (manual Runner attach/detach). We started exploring potential solutions for “Dynamic AutoScaling of GitHub Runners”. There were 2 solutions we found, which were ARC & Philips-Scalable-Runner. The ARC was specific to container eco-system and did not apply to us. The Philips-Scalable-Runner had too many components to facilitate this. So this led us to building our in-house solution.

Our design needed to solve these:

  • minimal AWS service use
  • support different labels (i.e. Runner types like arm64 / amd64)
  • support different repositories This led us to the below architecture.

![](upload://eRgThUtpzLUOYRs8gQNzBV5ZFoX.jpeg)

Dynamic AutoScaling of GitHub Runners

There are 3 logical pieces in this architecture, and they are:

  • VM Images, We bake custom AMI’s with a specialized startup script in them. The startup script that we bake into the image, has a logic to connect to the SSM Parameter Store & read its configuration at startup. When the AMI comes up as an EC2 instance, it will read its config from SSM Parameter Store & self-configure itself to GitHub to service our Jobs.
  • SSM, we use SSM Parameter Store as a way to store Runner configurations. This is essentially a KV store for configs. We store the Runner configurations as Values with the EC2 Instance as the Key deliminator.
  • Orchestrator, This is essentially our Controller (written in Python). The Orchestrator monitors GitHub events. It has logic to Scale Up or Down based on the Job Queue & available Runner count dynamically. This has hooks to GitHub & AWS (SSM Parameter Store & EC2) to facilitate this process. In the Scale Up phase, the Orchestrator will create an SSM Parameter Store entry and follow it up by creating an EC2 using the AMI (through the Launch-Templates). In the Scale Down phase, the Orchestrator will delete the SSM Parameter Store entry and follow it up by a deletion operation on the created EC2 instance.

Note: We are considering to Open Source this project. For that reason we have only given an overview and skipped the full implementation details. If you are interested to discuss further, do hit us up. We would love to partner.

Financial Analysis

![](upload://5Mn38DMCRM2hdyu1IHT55u9QMYm.png)

CI/CD Cost Graph

We enabled “Dynamic AutoScaling of GitHub Runners” on 2023-Jan-06. And we have seen drastic reduction in our spends since then. Not only has it saved us money, it has also saved us Engineering time by eliminating Maintenance Efforts. The “Idle Runner” problem was real, and it would have affected us in different ways had we not addressed.

High Level Analysis

  • Before AutoScaling
    • our costs increased as we got closer to our release cycles because we attached more Runners
    • increase in costs was primarily because of beefy Idle Runners
  • After enabling AutoScaling
    • costs shrunk drastically
    • our weekend costs touched ~$0 (as it’s a break day)
    • we serviced more runs (almost triple) to what we did in our previous release
    • no manual attach/detach of Runners required by Engineers

Average Daily Runner Cost (dropped by ⬇~87%)

  • Before AutoScaling ~$63.36/day (or $1,900.8/month) ⬆⬆
  • After enabling AutoScaling ~$8.12/day (or ~$243.6/month) ⬇⬇

Average Per PR cost (after AutoScaling)

NOTE: We will continue to see more savings over time (as the savings compound here).

Conclusion

To conclude, the table shows our OKRs and how we went about solving for these. The last column is where we are today in our journey.

OKRs Old Setup TeamCity New Setup Github Actions New Setup GitHub Actions w/ AutoScaling Compute Costs ⬆⬆ ⬆ ⬇ Maintenance Costs ⬆⬆ ⬆ ⬇ Configuration Time Costs ⬆⬆ ⬇ ⬇

Acknowledgements

Like any project, this was a team effort. We would like to thank all our internal contributors who have helped make this a reality. Thanks Aditya (co-author), Anurag, Dilip, Joshua & Kevin.


This is a companion discussion topic for the original entry at https://dgraph.io/blog/post/20230217_dynamic-autoscaling-of-github-runners/
1 Like

is it possible to open source this project? This was a great read, and the current solution around Philips runners are quite hard to setup and use.

Another question, the SSM solution seems like a clean way to manage the dynamic configs needed for the node. Did you guys have a solution for this with Vault or Consul integration that we could go with. We could add this piece if the support is not available (once this is open sourced)

Happy to report that this has been open-sourced @honeybadger . Feel free to check out the repository here.

1 Like