This is an article from DZone’s 2022 Performance and Site Reliability Trend Report.
Site reliability engineering aims to keep servers and services running with zero downtime. However, outages and incidents are inevitable, especially when dealing with a complex system that constantly gets new updates. Every company has a relatively similar process to manage incidents, mitigate risks, and analyze root causes. This can be considered an opportunity to identify issues and prevent them from happening, but not every company is successful at making it a constructive process.
In this article, I will discuss the advantage of the blameless postmortem process and how it can be a culture of change in a company — a culture for a better change and not to blame!
An SRE’s Role in Postmortem
Postmortem is a process in which a site reliability engineer (SRE) records an incident in detail. This information includes the incident description, the impact of the incident on the system, and the actions taken to mitigate the issue. SREs are engineers who are responsible for taking care of incidents. That’s why they are the ones who prepare most of the postmortem information into a report, which not only addresses the root cause but also suggests possible actions to prevent the same incident from occurring again. Therefore, a postmortem process for SREs is an opportunity to enhance the system.
The Standard Postmortem Meeting Structure
A postmortem meeting usually is arranged days after a team handles an incident. Let’s look at the typical format for this meeting:
- Keep to a small group. Only related people from various roles and responsibilities are invited to this meeting. The group stays small to ensure that the meeting will be short and productive.
- Start with facts. One important thing about this meeting is that there is no time for guessing. Instead, facts are shared with the team to help people understand the issue and perhaps identify the root cause.
- Listen to stories. After highlighting the facts, there might be some extra discussion from team members who were either involved in the incident process or might have some knowledge about that particular issue.
- Find out the reasons. Most of the time, the root cause is found before this meeting, but in cases where the root cause is still unknown, there will be a discussion to plan for further investigations, perhaps involving a third party to help. However, the incident might occur again since the root cause is not found yet, so extra measures will be taken to prepare for possible incidents.
- Create action points. Depending on the outcome of the discussion, the actions will vary. If the root cause is known, actions will be taken to avoid this incident. Otherwise, further investigations will be planned and assigned to a team to find the root cause.
Why You Should Have a Blameless Postmortem
Traditionally, the postmortem process was about who made a mistake, and if there was a meeting, the manager would use it as an opportunity to give individual warnings about the consequences of their mistakes. Such an attitude eliminates opportunities to learn from mistakes, and facts would be replaced with who was behind the failure.
Sometimes a postmortem meeting turns into another retro in which team members start arguing with each other or discuss issues that are not in the scope of the incident, resulting in people pointing at each other rather than discussing the root cause. This damages the team morale, and such an unproductive manner leads to facing more failures in the future.
IT practitioners learned that failures are inevitable, but it is possible to learn from mistakes to improve the way of working and the way we design systems. That’s why the focus turned to actual design and processes instead of the people. Today, most companies are trying to move away from a conservative approach and create an environment where people can learn from failures rather than blame.
That’s why it is essential to have a blameless postmortem meeting to ensure people feel comfortable sharing their opinions and to focus on improving the process. Now the question is, what does a blameless postmortem look like? Here is my recipe to arrange a productive blameless postmortem process.
How To Conduct a Blameless Postmortem Process
Suppose an incident occurred in your company, and your team handled it. Let’s look at the steps you need to take for the postmortem process.
Figure 1: Blameless postmortem process
Prepare Before the Meeting
Here you collect as much information as possible about the incident. Find the involved people and any third parties and add their names to the report. You could also collect any notes from engineers who have supported this issue or made comments on the subject in different channels.
Schedule a Meeting With a Small Group
This means arranging a meeting, adding the involved people, and perhaps including stakeholders like the project manager, delivery manager, or whoever should be informed or consulted for this particular issue. Make sure to keep the group small to increase the meeting’s productivity.
Highlight What Went Right
Now that you are in the meeting, the best thing to do is to start with a brief introduction to ensure everyone knows the incident’s story. Although this meeting is about failures, you need to highlight positive parts if there are any. Positives could be good communication between team members, quick responses from engineers, etc.
Focus on the Incident Facts
To have a clear picture of what happened, you don’t want to guess or tell a story. Instead, focus on the precise information you have. That’s why it is recommended to draw attention to facts, such as the order of events and how the incident was mitigated at the end.
Hear Stories From Related People
There might be other versions of the incident’s story. You need to specify a time for people with comments or opinions about it to speak. It is essential to create a productive discussion focused on the incident.
Dig Deeper Into the Actual Root Cause
After discussing all ideas and considering the facts, you can discuss the possible root cause. In many cases, the root cause might have been found before this meeting, but you can still discuss it here.
If the root cause is known, you can plan with the team to implement a solution to prevent this incident from happening again. If it is not known, it would be best to spend more time on the investigation to find the root cause and take extra measures or workarounds to prepare for possible similar incidents.
Document the Meeting
One good practice is to document the meeting and share it with the rest of the company to make sure everyone is aware, and perhaps other teams can learn from this experience.
Best Practices From Google
Today in modern companies, a blameless postmortem is a culture with more activities than the traditional postmortem process. SREs at Google have done a great job implementing this culture by ensuring that the postmortem process is not only one event. Let’s review some of the best practices from Google that are complementary to your current postmortem process:
- No postmortem is left unreviewed. Arranging regular review sessions helps to look into outstanding postmortems and close the discussions, collect ideas, and draw actions. As a result, all postmortems are taken seriously and processed.
- Introduce a postmortem culture. Using a collaborative approach with teams helps introduce postmortem culture to an organization easier and faster by providing various programs, including:
- Postmortem of the month: This event motivates teams to conduct a better postmortem process. So every month, the best and most well-written postmortem will be shared with the rest of the organization.
- Postmortem reading clubs: Regular sessions are conducted to review past postmortems. Engineers can see what other teams faced in previous postmortems and learn from the lessons.
- Ask for feedback on postmortem effectiveness. From time to time, there is a survey for teams to share their experiences and the feedback they have about the postmortem process. This helps evaluate the postmortem culture and increase its effectiveness.
If you are interested in learning more about Google’s postmortem culture, check out Chapter 15 of Google’s book, Site Reliability Engineering.
Site reliability engineers play an essential role in ensuring that systems are reliable, and keeping this reliability is a continuous job. While developers are thinking of new features, SREs are thinking of a better and smoother process to release features. Incidents are part of the software development lifecycle, but modern teams like SRE teams define processes to help turn those incidents into opportunities to improve their systems. SREs know the importance of blameless postmortem meetings where failures are accepted as part of development. That’s why they focus on reliability.
The future of incident management will be more automation and perhaps using artificial intelligence, where a system can fix most of the issues itself. For now, SREs are using blameless postmortems to improve uptime, productivity, and the quality of team relationships.
This is an article from DZone’s 2022 Performance and Site Reliability Trend Report.