Infrastructure-as-a-code (IaC) - We-go fully automated

At Wego we love automation. So we use lots of tools to automate and make our life easier. Following the latest DevOps trends, we decided to start using tools to automate our infrastructure. The initial decision constraint was to have a tool that is available on different public clouds. We don't want to be locked to any specific vendor. So we chose Terraform. Gaining some serious traction lately, Terraform seems a very capable tool. We pair it with Packer, to build images for different clouds, Terraform to deploy stateful infrastructure, and then Consul to manage the services on the servers.

Deployment process in 4 stages:
  • Golden Image building. We use Packer to prepare golden images on different clouds - AWS, GCE, Azure, Aliyun. Within Packer we use Ansible for initial provisioning (see Ansible-local provisioner).
  • Infrastructure creation. Terraform comes into action here. We define our stacks, VPCs, subnets, security groups etc, and deploy it. Something needs to be changed - no problem, just change it in the code, then deploy in staging environment for test, then apply the same configuration on production. As easy as is sounds. We can also see/review the changes between deployments in Github. Nice!
  • Service auto-discovery. Once a new server is started, it runs a Consul agent on it (the Consul agent is part of the Golden Image hence a static configuration, see below). A deployment server is listening to changes of Consul state and automatically runs Ansible tasks to provision/configure newly appeared servers (we are using the Ansible inventory script for Consul). This works just nice with Auto-scaling groups - whenever an auto-scaling policy triggers a new server - it get's its configuration automatically. In case we want to make changes to server configurations - we just run Ansible tasks on all servers. No need to rebuild the Golden Image for configuration changes. Auto-discovery and deployment
  • Post-deployment monitoring and management. We collect all logs at a central storage (StackDriver Logging) ready to browse, search, create alerts.

This is a simplified diagram of what our deployment looks like (single region example):
Simplified diagram of the deployment

Configuration definitions

We have defined 2 types of configuration:

  • Static - this is the creation of Golden Image, also referred as immutable configuration. We put all configuration that is not application-specific, like basic logging setup, installing some needed packages, some basic configuration for servers.
  • Dynamic - this is everything else, related to the application that sits in the server. Application configuration, secrets, log capture and forwarding, monitoring agents, etc.

Changes in the Static configuration require a new image build and re-deploy of all servers that are based on it.
Changes on Dynamic configuration are performed on running servers. That's why we chose Ansible, as it provides idempotency for the executing tasks and it makes the configuration changes traceable and easy to revert or inspect.

Once we have the infrastructure deployed, we use StackDriver to monitor and keep all logs (plus some Google BigQuery for log analysis). We use our enhanced FluentD -> Kafka -> BigQuery ELT mechanism here.

For application deployment, we rely on Jenkins to start Ansible tasks and all this is wired to Github hooks. Each commit in Github triggers an automatic build and deployment on the related environment. We use the Build-once, Deploy-many approach, which gives us a faster deployment, also we get fast feedback if something goes wrong with build/deploy.

Service management

Consul gives us very powerful tools for service auto-discovery. Paired with Ansible for inventory management, we can easily run configuration tasks on the nodes we need, having total control over the inventory attributes. Just an example - when a new server is created, initially it is just a blank instance, with no attributes. When it runs, we apply some configuration on it, which includes setting some Consul tags and service definitions. Reloading the Consul agent on this server changes it's service definition in Consul, which then triggers the next part of configuration tasks for this server. This way we can split configuration to different parts, applying only what is needed for that server. This way one server can server different roles, if needed, and we can control it in an easy and maintainable way.

Auto-scaling

Let's say few words on auto-scaling. This is a must for any modern app. So many businesses suffer from paying for unused capacity, or not having enough capacity when needed. Automatic/automated auto-scaling makes life easier. We use the features, available at cloud providers. Most public clouds have quite powerful auto-scaling offerings. We wire events such as demand, load, external metrics to provide feedback to Auto-scaling groups, which then manage the load accordingly. We also use auto-scaling for auto-healing. When a server node becomes unresponsive, it gets terminated and replaced with a fresh copy. Since the whole deployment is automated, this process doesn't involve human interaction, but is triggered by rules we have defined.
Deployment diagram

Monitoring

We use StackDriver to monitor servers. It gives us nice dashboards with all important metrics, uptime checks, incidents overview, and the most valuable feature - it provides access to all logs for all servers at the time of an incident, and it is just a single click away. No more browsing logs and matching timestamps. It also integrates quite well with uptime management software like Pingdom, PagerDuty, etc. We get all important alerts in Slack as well.

Summary

Of course, we take all this automation seriously. Sometimes too much automation can become an overkill. Badly defined rules and triggers can lead to undesired loops and to end up with resource exhausting. We also add alerts for everything. The 4 key factors of a good deployment process are efficiency, audit-ability, maintainability, predictability. Using the correct tools can give you better control over them, yet easier to define KPIs and evaluate overall performance and improvements over time.

View Comments