Although SRE toolsets vary from one team to another, there is one type of tool, Infrastructure-as-Code (IaC), that virtually every SRE needs to manage reliability at scale. If you’re not leveraging IaC, you’re not being all you can be as an SRE.
Keep reading for a breakdown of how IaC works, why it’s so important to SRE, and how SREs can add IaC to their reliability engineering strategy.
What is Infrastructure-as-Code?
Infrastructure-as-Code is the use of computer code to set up and manage infrastructure. In other words, under an IaC approach, engineers write machine-readable code that defines how a server, virtual machine, container, or other type of infrastructure should be configured. Then, they apply the configuration using an IaC automation tool that reads the files and applies the specified configuration to each machine.
IaC can be used to update infrastructure in a similar way by changing IaC files, then redeploying them to the infrastructure that needs to be modified.
Why Is Infrastructure-as-Code Important for an SRE?
There is no shortage of articles out there on the benefits of IaC in general. Usually, they boil down to the idea that IaC saves teams time and effort by making it possible to automate the configuration of large-scale infrastructure.
These benefits apply to SREs in addition to almost any type of IT or development team. However, for SREs in particular, IaC offers some critical advantages when it comes to engineering reliability:
Using IaC, SREs can define infrastructure configurations that maximize reliability, then apply them in an efficient way. This is much simpler than having to consult with IT operations teams about how to configure infrastructure to achieve reliability goals, and then having to count on the IT team to implement those configurations manually. In this respect, IaC helps SREs collaborate more effectively with other types of teams because it eliminates the risk that reliability guidance will be lost in translation or forgotten when it comes time to apply it.
Tracking Reliability Issues Over Time
In addition to automating infrastructure provisioning, IaC files can be used to keep track of exactly how infrastructure has been configured. Additionally, if you version-control your IaC files, you can use the version histories to identify how configurations have changed over time. This becomes very valuable in the event that an outage or other reliability issues occurs and SREs want to know whether a change in infrastructure configuration correlates with the incident. This data can be crucial both for remediating the problem and for performing incident postmortems.
Lower Risk of Human Error
One of the greatest enemies of site reliability is human error. If an engineer who is setting up infrastructure manually accidentally opens the wrong port or deploys the wrong container image, major reliability problems could result. IaC significantly reduces risks like these by allowing teams to apply configurations automatically, without the opportunity for engineers to make typos or other mistakes. As long as your IaC files themselves are properly configured, your infrastructure will be, too.
Validate Reliability Configurations Before Deployment
On that note, you may be thinking: “OK, but what if the IaC files themselves contain typos or other problems?” That can certainly happen. However, another benefit of IaC for SREs is that it makes it possible to scan IaC configurations automatically before deploying them. That way, SREs can validate their configurations before they go live. You can’t do that when you configure systems manually.
How SREs Can Adopt IaC
The wide availability of IaC tools makes it easy for SREs to take advantage of IaC. Popular IaC platforms today include options like Terraform, Ansible, and CloudFormation, to name just a few. All of these platforms are production-ready and don’t have a particularly steep learning curve. If you can code, as most SREs can, you can probably learn to use IaC pretty quickly.
The best IaC platform for a given SRE team will depend largely on which types of environment the team manages. Some IaC tools only support certain public clouds, while others can work anywhere. SREs should consider as well which configuration languages the tools support and whether they enjoy working with those languages. The way you scan IaC files may also depend on which IaC tool you use, so SREs should do their research to determine which scanning and validation processes IaC platforms support before choosing one.
IaC is a great type of tool for engineering teams of all types, but for SREs in particular, IaC offers special advantages for enforcing configurations that maximize reliability across all IT assets. It also makes it easy to minimize the risk of human error and to validate configurations prior to taking them live.