Although site reliability engineering has been around for a while, it has only recently gained fame in general software circles. But there are still a lot of questions as to what a site reliability engineer (SRE) does. Much of what we know comes from the book Site Reliability Engineering from Google. And we’ll refer to that book a few times in this post.
SREs have been compared to operations groups, system admins, and more. But the comparison falls short in encompassing their role in today’s modern software environment. They cover more responsibilities than operations. And though they usually have a background in system administration, they also bring software development skills to the role. SREs combine all these skills and ensure that complex distributed systems run smoothly.
So how do they do all this? Read further to find out how SREs accomplish this through the responsibilities they fulfill.
Automate All the Things
One difference between the SRE role and the traditional operations team involves automation. In the past, operations folks would keep things running by executing scripts, pushing buttons, and carrying out other manual endeavors. However, in the SRE world, there’s a heavy emphasis on automation. Where did this drive come from? The engineering aspect of the SRE role.
When you put software developers in a position where the same functions repeat day in and day out, they’ll be driven to automate. That’s what software developers do best. And automation doesn’t stop at automating a software build and some acceptance tests. Their automation includes CI/CD and infrastructure creation and patching, as well as monitoring, alerting, and automating responses to certain incidents. In Google’s SRE book, this is also referred to as eliminating toil:
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
—Site Reliability Engineering
But why do we focus so much on reducing toil? Not only does reducing toil make the processes repeatable and automated, but it also increases the amount of time SREs have to build tools and investigate infrastructure changes that further improve site reliability. In summary, the less toil there is, the more time and resources are dedicated to making sure your software ecosystem runs reliably and the faster you can deliver business value.
Monitor Distributed Systems
With the popularity of distributed systems, there’s a greater need for increased monitoring. It’s not enough that your application is up and running. We also need to ensure that infrastructure works properly and that all our other internal dependencies are accessible and functioning. Additionally, business functions of the application should have proper monitoring to validate they’re working properly.
For this portion of the job, SREs can use a product like Scalyr to monitor and alert on any potential issues. This allows them to monitor the system in real time as well as track long-term trends that may indicate reduced reliability.
Provide On-Call Support
Similar to traditional operations roles, SREs spend time rotating on-call responsibilities. In addition to monitoring infrastructure and their own services, they also make themselves accessible to development teams for consultation and troubleshooting.
What does on call look like? Typically SREs rotate the on-call role based on a schedule that allows other SREs to focus on engineering while also not causing burnout for the on-call engineer. On-call rotations vary from a couple days in a row to a week or more.
When a high-priority page triggers, the engineer will investigate and diagnose the issue. The SRE might also pull in additional engineers or software developers to resolve the issue. Depending on the system’s SLA, they may all need to work together to solve the issue in a matter of minutes. For low-priority issues, the SRE typically handles them during business hours. And that’s great news for the engineers who don’t like to jump out of bed at 3 a.m. for every little thing that happens.
An important part of the SRE role involves managing incidents. Now you may say this is no different than the on-call responsibility. You find a problem and then fix it. How hard could it be?
Well, for managing incidents, SREs need to employ additional professional skills to make sure everything goes smoothly. When an outage occurs, for example, there could be dozens of ways to diagnose and attempt to resolve the issue. Therefore, to manage the incident properly, someone must monitor and facilitate the actions of all involved. And that requires clearly defined roles.
Though not all companies include these Google-recommended incident roles, we should at least consider them. These roles include the following:
- An incident commander who maintains a high-level view of everything occurring
- The engineers who execute processes or modifications to the infrastructure or systems
- A communication role for relaying the right message to customers and management
- A planning role in charge of planning any meetings, handoffs, and logistical needs
Without clearly defined roles for our SREs, we could have SREs that step on each other’s toes as they try different solutions without up-front coordination and communication.
Now that we’ve lived through an incident and resolved it in the sections above, we’re ready for the postmortem. Typically, an SRE facilitates or participates in these postmortems.
A postmortem brings together all relevant parties for analysis of the incident. The goal is to analyze what occurred during the incident and find the root cause. The participants also determine how the incident can be prevented or fixed in the future. Some of the items that come out of a postmortem are listed below:
- Stories to improve reliability or monitoring
- Additional documentation to assist with future incidents
- Further investigations or testing to prove out any hypothesis related to the incident
Another responsibility of the SRE requires tracking outages. This eventually helps in identifying long-term trends and assists with creating reasonable SLOs and SLAs.
One use of tracking includes monitoring low-priority incidents. These incidents may not cause real issues for consumers, but looking at the long-term trends and timing can help isolate and resolve pesky bugs that don’t seem to have a root cause.
Work With SRE and Development Teams
In addition to supporting development teams during on call, SREs also provide consulting and troubleshooting. This assists both other SRE teams and software development teams that struggle with operational or reliability issues.
In this scenario, the SRE will assess current issues and determine which can be improved with automation or engineering effort. The SRE may also suggest solutions to reliability problems. And perhaps most importantly, the SRE will drive changes to team processes. These changes will ensure that site reliability engineering enhances the team’s ability to deliver value.
Create Service Level Indicators and Objectives
When you hear that a service has attained or is striving toward an uptime of 99.99%, you’re talking about service level objectives (SLO). Service level indicators (SLI) measure these objectives. In other words, the SLI is an agreement on how the SLO will be measured. SREs assist with these by providing data for historical service performance. They also help provide realistic objectives for the future and might advise on proper SLAs for customers.
The SRE then works to make sure your application meets, though does not exceed, the stated SLO. Now you may think that it’s odd to not work to exceed an SLO. However, it would be a waste of resources to make something more reliable than it needs to be. And SREs balance the needs of the customer with the goals of the services provided.
Responsibilities May Vary
In this post, we’ve discussed various activities that site reliability engineers participate in. Although these activities are done by many SREs, they aren’t set in stone. Companies do vary their SRE roles and responsibilities based on need. In general, companies that are at different points in the SRE journey may have different needs.
For example, a newer company may need SRE support in getting general outages under control. And most energy goes toward that base level of reliability. However, other companies that are further along in the journey may have eliminated company-wide outages. They may spend more time on improving or validating service metrics that are business related. For example, your pizza shop application may need new monitoring on its pizza recommendations once the site’s general availability is stable and reliable.
As you have read, SREs spend time on both technical and process-oriented responsibilities. They do more than an operation or system administration team. They employ their engineering skills to automate and reduce the manual intervention necessary for administration tasks. Additionally, they work with other engineering teams to provide proper monitoring, incident response, and management.
Over time, these functions improve the reliability and maintenance costs of your distributed systems. And finally, they spread the culture of site reliability engineering through your organization so that all teams learn to make decisions with reliability in mind.
This post was written by Sylvia Fronczak. Sylvia is a software developer that has worked in various industries with various software methodologies. She’s currently focused on design practices that the whole team can own, understand, and evolve over time.