At Wego, we are continuously trying to modernize our tech stack. As part of our multi-region infrastructure efforts, we decided to revamp our continuous delivery pipeline by adopting Spinnaker as our deployment platform. This allows us to perform multi-region deployments with ease. Along the way, Spinnaker continues to delight us with the numerous benefits that it brings to the table. Being the third in the series on multi-region infrastructure at Wego, this article details the reasons why we choose it as the deployment platform for the cloud-native era, as well as our experience adapting it to our needs.
Prior to Spinnaker, we were using Jenkins to deploy our applications. Jenkins is an incredibly mature tool, with hundreds of plugins available and ample community support. However, it’s built primarily to be a build server, not a deployment platform. In other words, its main purpose is to run integration tests and build application artifacts. Like most companies who are trying to move from the data center to the cloud, we extended Jenkins with extra jobs that handle deployment by running configuration management tools such as Ansible. As a result, we end up with custom glue scripts that orchestrate deployments by calling the cloud provider APIs.
There are several problems with this approach. The first problem is that during deployment, we are mutating our servers in place. Ansible and similar tools were designed to automate system administration of machines in physical data centers. It makes sense to mutate physical machines in place since it takes a very long time to provision new machines by installing servers into racks and setting up network configuration.
In the cloud-native age, however, we work with virtual machines instead of physical ones. The former can be spun up and down almost instantaneously by calling the cloud APIs. While we can continue to run configuration updates on a virtual machine during its brief lifetime, there is less value in doing so. In fact, there is considerable value in not doing so, since any untested changes to a running system is a recipe for unplanned downtime.
This brings us to the second problem - non-trivial rollbacks. Following the Anna Karenina principle, successful deployments are all alike; every failed deployment is unique in its own way. That being said, failed mutable deployments can happen due to a variety of reasons, as such, heavy manual intervention is required to restore the server to the last stable state. This makes rolling back much more difficult than it should be. Non-trivial rollbacks make deployment a stressful affair and decrease engineering velocity.
Fortunately, the aforementioned problems can be eliminated by using immutable server pattern. An immutable server is a server that once deployed, is never modified, but merely replaced with a new updated instance. By avoiding mutation of servers in place, we can also avoid unplanned downtime and non-trivial rollbacks. So far, Spinnaker is the only deployment platform that enforces an immutable server pattern.
To understand how Spinnaker performs immutable deployments, we have to understand the concept of server groups and load balancers. A server group, also known as autoscaling group in AWS or managed instance group in GCP, contains instances that are spawned from a machine image. It is then attached to a load balancer, which will accept requests and distributes them among the instances in the server group.
While mutable deployments mutate an existing server group, immutable deployments build a new machine image from a base image, upon which the new version of an application is added. A new server group will be spawned from the new machine image. It will undergo health checks before taking traffic. Only after the new server group is fully deployed and healthy does the old server group get disabled and stop taking traffic.
Assuming the application is built for coping with overlapping versions during the brief window where old and new server groups are taking traffic, this procedure means that deployments can proceed without any application downtime. If there is something wrong with the new server group, rollbacks are trivial to perform by re-enabling the old server group and disabling the new one.
As a result, the blast radius of bad deployments is isolated to only the machine image creation step. Deployment is now a breeze with little to no operational burden.
Concretely, Spinnaker is not a single monolithic service but a collection of independent microservices working in concert to ensure a smooth deployment experience. Following the UNIX philosophy, each service is built to do one thing and that one thing well.
Depending on your scalability requirements, there are various ways to operationalize Spinnaker. For companies like Netflix which has hundreds of engineers, the ideal set-up is to run each service in its own cluster. If a service has a data store dependency, it will have its own data store, which is not shareable across services.
Obviously, for companies with a much leaner engineering team, it is not feasible to scale out to that extent. In our case, we decided to go with the simpler approach to run all services in one EC2 instance, with the lifecycle management automated by the auto-scaling group. We also are using external data stores such as ElastiCache Redis for storing execution history and an S3 bucket for storing pipeline and application configuration. Hence, this ensures the data is still intact in the event of instance failure.
Automating Pipeline Creation
Every engineering team has different requirements when it comes to deployments. Wego is no different. As our Spinnaker pipelines converge to a set of best practices over time, it is essential that we implement their configurations as code to automate the creation of compliant deployment pipelines, as well as to streamline of onboarding process for new hires.
When it comes to automating Spinnaker pipeline creation, it is important to know what a pipeline can do. Spinnaker pipelines can be divided into blocks of common operations called stages which can fall into four broad categories.
The first category is the infrastructure stages, which comprise of CRUD operations on cloud resources (autoscaling group and load balancers etc.). The second category provides stages to integrate Spinnaker with external systems, e.g., integrating with Jenkins for triggering the pipelines. The third category is the testing stages that can perform automated canary analysis and chaos engineering. The last category is for flow control, which provides stages for manual judgment and also branching logic.
All stages can be implemented as code in Spinnaker via tools like Roer and Spin. So, the question is not how do we code-ify but what should we code-ify. At Wego, we have invested heavily in Terraform to implement our infrastructure as code. Given that Spinnaker also allows us to do the same, there is an overlap between Spinnaker and Terraform in their functionality.
Terraform is built specially for managing infrastructure as code, as it can generate execution plans and provide a custom configuration language (HCL) that’s easy to work with. On the other hand, the Spinnaker tools to code-ify infrastructure only work with YAML files. By managing infrastructure using YAML, we will be forgoing the code maintainability and expressiveness of Terraform’s configuration language, as well as the code reusability that comes with hundreds of high-quality Terraform modules. Simply put, Terraform is unparalleled at what it does - implementing infrastructure as code.
With this in mind, we decided to continue using Terraform to manage our infrastructure. For Spinnaker, all stages but those in the first category will be implemented as code. This is to prevent Spinnaker from doing Terraform’s job and to establish a clear boundary between these two tools. Granted that there will be some manual work involved to link Spinnaker pipelines with the Terraformed server groups, this is not a big deal since setting up Spinnaker pipelines is a one-time task.
There are still a lot of things we are planning to do with Spinnaker. We hope to integrate Kayenta service for automated canary analysis to make our deployment much more robust against performance regressions. By retrieving system metrics from Datadog and comparing the current cluster’s metrics with the newly deployed ones, we can be sure that the system is performing well according to a baseline. This will ensure that newly-released features and configurations will not have a negative impact on the user experience.
Apart from that, we are also impressed by the vibrant Spinnaker community. More and more companies are now rallying behind Spinnaker by adopting it as the deployment platform for the cloud-native age. It is used in production by companies big and small. We could foresee it being a dominant player in the continuous delivery space, if not already. We will continue to invest in Spinnaker and adapt it to our needs. If you’re interested in pushing the frontiers of continuous delivery, reach out to us, we’re hiring!