Site reliability engineering (SRE) is the practice of applying software engineering expertise to DevOps and operations problems. SRE, which was popularized by the 2016 publication of Site Reliability Engineering: How Google Runs Production Systems, often means proactively writing code and developing internal applications to combat reliability and performance concerns.
In SRE, service levels describe services provided to users within a given period of time in measurable terms. Service level objectives (SLOs) are the goals set for the availability expected out of a system. Service level indicators (SLIs) are the key measurements and metrics to determine the availability of a system. Service level agreements (SLAs) are the legal contracts that explain what is agreed upon and what happens if systems don’t meet SLOs.
For example, an SLO for a web application might be that videos must start playing in less than two seconds, 99% of the time, during a one-week period. The SLI measures the proportion of videos on the site that start playing in less than two seconds. The SLA includes both this SLO and other SLOs that are agreed upon by the customer and the service provider, the scope of services that will be covered, and the SLIs, which are the metrics that will be used to measure performance.
But how do SLOs, SLIs, and SLAs relate to each other? How will these acronyms help manage service levels that your users expect? Let’s look at each in more detail.
What Are SLOs?
SLOs are the goals you set for how much availability you expect out of your system, expressed as a percentage over a period of time.
The service level objectives help teams collaborate on a shared meaning of “availability” and “uptime.” You use SLOs as a standard to measure your reliability and availability. As described in the earlier example, an SLO states that videos in the web application must start playing in less than 2 seconds, 99% of the time over a week period.
What Are SLIs?
SLIs are the quantitative measurements of how users experience the availability of a system. They represent a proportion of successful outputs for a level of service, expressed as a percentage.
These service level indicators are described in relation to SLOs, but SLIs provide real-time signals into system reliability. SLIs can measure the proportion of requests that were faster than a threshold or the proportion of records coming into a pipeline that result in the correct value coming out. As described in the earlier example, the SLI measures the proportion of videos on the website that start playing in less than two seconds. You can tell how far you are from the objective in the SLO.
What Are SLAs?
SLAs define the level of service your customers expect when they use your service.
These service level agreements are contracts between service providers and their customers that document what services the provider will furnish and define the service standards the provider is obligated to meet. SLAs describe remedies or penalties as results of breaking the SLO commitments.
For the earlier example, the SLA will include all the SLOs for the web application, as well as the scope of services that will be covered, and all the SLIs, which are the metrics that will be used to measure performance against the SLOs. The agreement also includes both the responsibilities of the service provider and the customer.
Who Uses Service Levels, SLOs, SLIs, and SLAs?
While SRE teams and reliability engineers aren’t always responsible for managing service levels, it often falls within their purview. By tracking SLIs and tying them to SLOs, you can set goals around the performance of a system. Google’s SRE book defines the four golden signals of service levels as latency, traffic, errors, and saturation. So, for example, you could look at an API call and track its number of successful/failed requests (the SLI) against a general percentage of requests (the SLO, for example, 95%) that need to be successful for customers to have a good experience.
SRE teams often set strict SLOs on critical components within their applications and services to better understand how strict of an SLA they can agree to with customers. From here, the team can apply error budgets as a way to understand how quickly they must resolve issues in order to stay compliant with their SLOs. Service levels allow teams to aggregate metrics and create a transparent view of uptime, performance, and reliability across the entire organization. At a glance, business leaders can use service levels to monitor compliance across multiple teams, applications, services, etc., to gain a comprehensive understanding of their system’s health.
Leave a Reply