Carlos Nunez

2 June 2017

Why Production Releases Are Dominated By Fear: The Story of a Completely Fictional Airline

Releasing to production is increasingly encumbered by fear.

Universities fear that rolling out the latest image will cause every computer in the network to reformat itself spontaneously.

Retail companies fear that their next system update will take down their entire point-of-sale network and lose them millions of dollars in sales.

Most recently, airlines fear that a few faulty IT processes will come to a head and cost the company hundreds of millions of dollars by rendering core reservation systems unavailable.

Regardless of the story, the outcome is the same: the more complex IT infrastructure becomes, the more process-driven its maintenance becomes as well. More processes introduce more opportunities for error. More errors create more process as a remediation measure. More process means more complexity, and so on.

While it sometimes isn’t possible to reduce the complexity of a system, it *is* possible to automate as much of that process as possible in unified and accessible ways. Migrating the sparsely-located runbooks and scripts outlining the creation, maintenance and destruction of infrastructure into code within a single codebase is the easiest and fastest way of doing this. This post will explain why.

The Scenario

You are an engineering manager for one of the world’s largest airlines. The systems powering your operation are vast and complex: global distribution system integrations (GDS) that ensure that your flights are at the top of the list on Google Flights or Expedia, passenger service systems (PSS) that ensure a smooth experience for your passengers from origin to destination, crew management systems that keep the scheduling and operations of your flight crews in top shape; the list goes on. You are also responsible for the vast amount of infrastructure that most take for granted: the API gateways for your mobile app, the many JBoss or Tomcat servers powering the thousands of pages on your website, the F5s, Palo Altos and Bluecoat appliances that keep the load on your web servers evenly distributed and secure from outside traffic, etc. Then there are the regulatory obligations from the FAA, EASA, CISA and all sort’s of A’s that your code and systems need to adhere to (and that you can’t forget about).

You have a tough job. Your job is made even tougher whenever a change needs to be made anywhere within this web of devices.

Why? Several reasons.

Engineers on your teams need to know where the documents are that describe how this change works. They might have tried the change in their lab, but because of how divergent their lab’s setup is from production, there is no real guarantee that the change will work as advertised. Some parts of documentation are outdated or missing, and the engineer that wrote the latest revision is no longer with the company. Another engineer wrote some scripts to automate this change some time ago, but that code hasn’t been tested, and given that many high-impact changes happen during the weekends, the change window isn’t a good time to find out how well it holds up. Your change also needs the help of networking and security to get done, but since every team operates largely independently of each other, much ticket passing will need to be made to do the prerequisite work for your change.

From what I’ve seen, this is usually the anatomy of how a change that takes minutes in practice turns into a change that takes days. Lagging change approvals from busy managers that are drowning in a sea of change approval requests turn these days into weeks. Feature code deployments dependent on this change begin to queue, slowing them down in the process. Product managers and business development grow upset by “our” inability to push changes fast enough, and lose trust in the process.

The worst part? If that change fails, then we roll back (or try to roll back) and restart the entire process all over again. Except this time, the upper management responsible for answering to shareholders will be looking for names. Consequently, the fear of touching production gets even greater and nothing ever changes.

It doesn’t have to be this way. This is exactly the problem that DevOps was meant to solve.

The Solution? Infrastructure as Code

Infrastructure-as-code is a paradigm in which servers, networking and security devices are managed exactly like software features. The creation, administration and decommisioning of servers, routers, switches, storage devices, firewalls, load balancers, or any other appliances are written entirely in an easy-to-read and domain-specific language, tested and documented using the software testing pyramid and deployed via continuous delivery pipelines.

There are several tools on the market that accomplish this goal. For the provisioning side of the equation, Terraform is a tool by HashiCorp that provides a domain-specific language for provisioning infrastructure components. Configuration management tools solve the configuration that is required post-provisioning by providing domain specific languages and helper tools that express how infrastructure components should be configured. The most popular tools on the market are currently Chef by OpsCode, Puppet by Puppet Labs, Ansible by Red Hat and Salt by SaltStack.

This post will explain what a mature implementation of a code-driven infrastructure looks like. I will use Terraform and Chef as examples for what our fictional world-class airline would look like in this ideal state. Containers and container orchestration as well as implementing continuous delivery pipelines will be covered in separate posts, as they are heavy topics that warrant their own posts.

A Model for an Ideal Code-Driven Infrastructure

Moving your teams’ runbook documents and scripts to a single codebase composed of Terraform provisioning configurations and configuration management code by way of Chef cookbooks or Puppet manifests solve for these problems in three key ways.

1. Clearly defined environments and environment relationships

Terraform modules enable engineers and developers to define reusable pieces of infrastructure. Modules can also contain other modules to express dependency relationships. These can be used to define an environment is an easily-readable and easily-consumable way.

Let’s say that an engineer wanted to migrate a runbook that creates an environment for a business application. This application is comprised of a web tier, an application tier and a persistence tier. It also interacts with your enterprise load balancer and IPAM appliances. Using pseudocode, this Terraform module would look something like the below:

# git://path/modules/application_environment/main.tf # This is a variable that consumers of this module will need to provide. # Because it has a default value already set, this variable is optional. variable "number_of_servers" { description = "The number of servers to provision within this environment." default = 1 } # This is an example of a mandatory variable. # If it is missing, your Terraform run will fail. variable "ssh_key_location { description = "The path to the SSH key to provision onto instances within this environment." } # Every resource can provide "outputs" that provide useful information about itself. # This resource will output an IP address that is # accessible through vm.web_server.ip_address resource "vm" "web_server" { Number_of_servers = ${var.number_of_servers} Vm_memory_gbs = 32 Vm_cpu_count = 4 Instance_tags = { Name = "name_of_instance" Group = [ "Group_1", "Group_2" ] } provisioner "chef" { node_name = "web_server" run_list = ["base::default","web_server::default","ipam::registration"] # rest of Chef settings } } # Every resource can provide "outputs" that provide useful information about itself. # This resource will output an IP address that is # accessible through vm.app_server.ip_address # Also, notice the 'depends_on' parameter. # This tells Terraform which resources are related to what and # applies an order in which infrastructure must be provisioned during a deployment. resource "vm" "app_server" { depends_on = ["vm.web_server"] Number_of_servers = ${var.number_of_servers} Vm_memory_gbs = 32 Vm_cpu_count = 4 Instance_tags = { Name = "name_of_instance" Group = [ "Group_1", "Group_2" ] } provisioner "chef" { attributes_json = <<-EOF { ssh_key_location = "${var.ssh_key_location}" } EOF node_name = "app_server" run_list = ["base::default","app_server::default","ipam::registration"] # rest of Chef settings } } resource "vm" "database_server" { depends_on = ["vm.web_server", "vm.app_server"] Number_of_servers = ${var.number_of_servers} Vm_memory_gbs = 32 Vm_cpu_count = 4 Instance_tags = { Name = "name_of_instance" Group = [ "Group_1", "Group_2" ] } provisioner "chef" { attributes_json = <<-EOF { ssh_key_location = "${var.ssh_key_location}" } EOF node_name = "app_server" run_list = ["base::default","database_server::default","ipam::registration"] # rest of Chef settings } } resource "dns_record" "web_server" { Depends_on = ["vm.web_server"] record_type = "A" Record_name = "webserver1.example.com" Ip_address = vm.web_server.ip_address } resource "dns_record" "app_server" { Depends_on = ["vm.app_server"] record_type = "A" Record_name = "appserver1.example.com" Ip_address = vm.app_server.ip_address } resource "dns_record" "db_server" { Depends_on = ["vm.database_server"] record_type = "A" Record_name = "dbserver1.example.com" Ip_address = vm.database_server.ip_address } # webapp.example.com Resource "load_balancer_entry" "web_server" { Depends_on = ["vm.web_server"] Entry_name = "webapp" Location = dns_record.web_server.record_name } # app-backend.example.com Resource "load_balancer_entry" "app_server" { Depends_on = ["vm.app_server"] Entry_name = "app-backend" Location = dns_record.web_server.record_name }

While this might seem like a lot of code, in practice, this will only need to be staged upon its inception. Once created, whenever an engineer wants to define a new environment, all they have to do is write:

# main.tf module "application_environment" "dev" { source = "git://path/modules/application_environment" number_of_servers = 4 ssh_key_location = "path/to/dev/key" }

in any directory, and then run terraform get; terraform apply to create it.

No more SharePoint documented or runbooks. No more inconsistent environments.

There are two other advantages gained by using Terraform for this:

Terraform keeps track of its state. This allows engineers to make changes against any infrastructure component without having to redeploy an entire environment. Terraform will only reprovision what's changed and will automatically know which parameters will require a complete recreation of a component.

What this means is this: if I change the number_of_servers to 6 instead of 4, Terraform will only deploy two additional servers instead of adding six additional servers or destroying all four existing servers and recreating six new ones. This saves a lot of time and minimizes the need for downtime.
Terraform can tell you what will happen before it happens. If an engineer makes the above change but wants to actually confirm that two servers will be created, she can run terraform plan to get a summary of what will happen next. Terraform will then show you what will be added, removed and recreated.

This removes the chances of deploying changes blind and provides a basis from which Terraform configurations can be tested. (This is discussed in more detail in a later section.)

What's more: whenever this module changes, you can see exactly what changed and by whom. Most Git servers or collaborative version control systems also enable you to apply permissions onto repositories so that you can control who can create, modify or remove modules or Terraform configurations.

2. Natural change control through code reviews and tests

As mentioned earlier, in a world without infrastructure as code, changes are coordinated largely through tickets in change management software such as ServiceNow or Remedy. Additionally, testing the efficacy of a change is usually done manually in a lab (that is usually an older and scaled-down version of production with lots of ‘fixes’ here and there) or, worse, with production servers during or outside of business hours.

Testing your infrastructure becomes a lot easier in an infrastructure-as-code world for two reasons.

Firstly, like any application source code, one can write unit tests that test the correctness of your code and serve as a de facto contract that your infrastructure must adhere to. Given this, unit tests also serve as documentation that always updates and, in a mature organization, will always be correct.

Chef unit tests can be achieved through the ChefSpec testing framework. While Terraform testing frameworks are still in early days, you can test your Terraform configuration code against the calculated state from a Terraform plan run. You can see an example of this approach for the tests behind infrastructure code powering carlosnunez.me, my personal domain, which you can find here: https://github.com/carlosonunez/carlosnunez-me-infrastructure/spec.

Secondly, when the code for your infrastructure resides in version control, you can apply the same forking and branching strategies used by application deployments to create approval gates. These gates ensure that the infrastructure defined in master is known to be valid and suitable for production workloads. The approvals between those gates are enacted through pull requests, or requests to merge code from one branch into another. An example of this deployment workflow could look something like this:

Engineer forks a copy of the modules codebase onto their own account (as they won't have write permission to the parent modules repository.)
Engineer clones the forked repository that they just created.
Engineer makes changes to the application_environment module (to, say, create a CNAME record alongside an A record).
Engineer uses kitchen-terraform to test the changes locally.
Engineer commits the change to their fork and pushes it upstream to the server.
Engineer submits a pull request to merge their fork with its parent.
Lead engineer and the module maintainer review the request.
Lead engineer or the module maintainer approves the request and merges the code.

Using this approach does three things for your organization:

It establishes the infrastructure code within master as the canonical source of truth for your entire organization's technology,
It protects that source of truth from being modified without going through a proper review cycle, and
It removes your engineering team's dependency on change management software to execute simple changes against your infrastructure.

3. Unifies teams through a single codebase and common languages

As an engineering manager, you probably do not expect your systems administrators to understand how to configure Juniper routers and switches like your network engineers can. You also probably don't expect your software developers to become systems administrators and configure iptables on their application servers. You hire people to excel at what they're good at.

However, it is easy to see how this naturally creates knowledge and cultural silos. Software engineers prefer to work with other software engineers, systems administrators prefer to work with other systems administrators, and so on. As an engineering organization grows and becomes more specialized, the proliferation of proverbial lunch tables increases in tandem. Forbes wrote a nice piece on this in 2013 that you can read here.

One can consider building a tiger team of specialists from different parts of the organization as an attempt to break down these silos. While this is well-intentioned in theory, in practice it often leads to a lot of friction and confusion. Trying to turn the network engineer into a systems administrator without a proper transition plan can lead to that engineer feeling like she just got burdened with an additional job and a whole lot of tickets. I wrote a bit about this a few months ago on my post about Getting Into DevOps.

Terraform and Chef are to groups of co-working engineers what English is to most people: the universal lingua franca.

Using a common set of languages on which infrastructure code is built allows teams to work together and understand each other in ways that weren't easily possible in the past. Terraform and Chef are particularly good at this in that they both provide a simple and clear domain specific language that is readable by nearly anyone and requires no formal programming background to understand. Subsequently, while the systems administrator doesn't have to understand how to configure a Juniper switch to see how servers get configured on the network, he can now see how it’s done through Terraform configurations, and he can learn more about why it's done that way from finding the authors of that configuration from browsing its commit history.

Make Your Job Easier

To return to the scenario we posited at the beginning: you have a difficult job as an engineering manager of an airline. In this context, infrastructure as code makes your job easier.

Instead of your engineers relying on slow change approvals, broken documentation and living in fear of change, they can rely on a single-source-of-truth codebase that has been tested and vetted through unit and integration tests and can be modified and experimented with at any time through repository forking. No longer do your engineers have to fear production, in fact, they can use infrastructure as code to run an exact replica of production (albeit smaller) right on their workstations. No longer do your software developers have to "blame IT" for being unable to release features to production, they can create environments themselves whenever they need to and run nearly full-scale integration tests without impacting customer workloads.

Ultimately, the goal of DevOps is to make your company a software-first company. Infrastructure as code gets your airline a big step closer to that goal.

Your passengers and customers will thank you for it!

Why Production Releases Are Dominated By Fear: The Story of a Completely Fictional Airline

1. Clearly defined environments and environment relationships

2. Natural change control through code reviews and tests

3. Unifies teams through a single codebase and common languages

More Articles

Intro to AWS Blox: An Open Source Container Scheduler for ECS

Shifting Compliance Left: Improving Regulatory Reporting for Investment Banks with DevOps

An Introduction to Serverless Computing with AWS Lambda

Why Production Releases Are Dominated By *Fear*: The Story of a Completely Fictional Airline

1. Clearly defined environments and environment relationships

2. Natural change control through code reviews and tests

3. Unifies teams through a single codebase and common languages

More Articles

Intro to AWS Blox: An Open Source Container Scheduler for ECS

Shifting Compliance Left: Improving Regulatory Reporting for Investment Banks with DevOps

An Introduction to Serverless Computing with AWS Lambda

Why Production Releases Are Dominated By Fear: The Story of a Completely Fictional Airline