What Is SRE? Site Reliability Engineering in a Nutshell
If World War III broke out and you wanted to check the Internet was still operational the only thing you would do is try Google.com.
Why? Because Google is so agonisingly reliable.
They realised one thing: the prerequisite to success is reliability.
Faced with the challenge of increasingly complex and distributed systems operating at scale, they came up with a data-driven approach to IT centered around improving the reliability, efficiency and scalability of Google’s own technical environment.
And Site Reliability Engineering (SRE) was the result.
SRE takes a data-centric approach that focuses on creating systems that learn from the errors and outages that inevitably arise.
It helps turn fragile systems into antifragile ones.
What does that mean?
‘Antifragile’ was coined by Nassim Nicholas Taleb in his book of the same name. He defines it as follows: “Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.”
That’s exactly what SRE does.
It helps you to create systems that get better the more they are shocked.
(And that is a massive advantage over your competitors we can guarantee! Just ask Google.)
What exactly is SRE?
SRE is the brainchild of Ben Treynor, the Senior VP overseeing technical operations at Google.
He famously described it as “what happens when you ask a software engineer to design an operations team.”
Accordingly, the key issue that SRE exists to solve is that of organisational silos. It works to break down the traditional barriers between Dev and Ops by pulling together their roles into a new hybrid role.
The result is a Site Reliability Engineer whose job is balanced between developing new features on the one hand, and ensuring that production systems run smoothly and reliably, on the other.
SREs enable development teams to deploy faster, while using any failures that occur as pointers towards relentlessly improving the overall health of their system.
How do they achieve this?
What makes SRE so effective?
There are some key principles of SRE that make it so effective.
1. Relentlessly Data-driven and Learning-focused
“The antifragile loves randomness and uncertainty, which also means—crucially—a love of errors, a certain class of errors.”
Nassim Nicholas Taleb, Antifragile.
SRE is driven by a love of errors. It’s about relentlessly pursuing knowledge of the health of your wildly complex IT system.
SRE harnesses the power of data to determine whether the choices you are making are hurting or helping your business. By relentlessly collecting large amounts of data, SRE teams are able to assess and report on the reliability and availability of your system at every stage of the development cycle.
2. Linked to Business Objectives
It’s all very well gathering the knowledge around the health of your system but how does this translate into something useful that the business can understand?
The problem is that business leaders and tech teams measure their success differently. The business cares about Service Level Agreements (SLAs), their contractual agreement with the customer, while the technologists care about Service Level Indicators (SLIs), the raw metrics.
In SRE, everything is tied to strategic objectives. It incorporates a cunning chain of data indicators which forge a valuable link between tech and the business. Bridging the gap between SLAs and SLIs is the Service Level Objective (SLO). This is an important internal tracking target that’s tougher than the SLA and tracked using the SLIs. It represents the goal that the business as a whole wants to reach.
SLOs are continuously monitored and reported on by SRE teams in order to measure a system’s reliability. With this information, they can ensure that even wildly complex distributed IT systems are healthy.
3. Balances Releasing New features with Improving System Health
At any given time, you have a choice between working on deploying new code, or working to reduce toil in the system.
Toil is a term originating from Google which refers to the manual, repetitive work involved in running a production service. A key focus of SRE is on reducing toil through automation.
In every project, you’ll have an agreed error budget - this refers to the amount of downtime you are prepared to accept according to your SLOs. In SRE, the idea is to optimise workflows so that code is deployed all the way up to the point where we run out of error budget. When the error budget runs out, we then pause to reduce toil so that our system can continue to scale, before we start deploying all over again.
These cycles of deployment and toil reduction are the most effective way to deploy work in a complex system, while scaling and improving it.
SRE was designed to overcome key challenges in large-scale, complex computing systems that typically undermine reliability.
It worked wonders at Google but you don’t have to be a seasoned cloud-native tech company to implement it. In fact, many of the same principles of SRE like relentless learning, SLOs and automation, can also be applied to the enterprise. Hooray!
SRE is a core differentiator in the enterprise space. In a market of fragile competition it makes you antifragile and that puts you ahead of the game.
Contino has developed Enterprise SRE (eSRE) to anticipate the key challenges in enterprise IT. It takes the guiding principles of Google-style SRE and adapts them to the enterprise context.
For a closer look at the principles and practices underlying eSRE as well as the potential technology and business benefits, download our white paper Creating Antifragile Systems: Site Reliability Engineering for the Enterprise.