Undetanding Postmortem Culture: Learning from Failure
Recently, when I was stadying SRE, I first learned about a method called “post-mortem.” So I summarize it for the sake of my organization.
Gmail had a problem that the mail data was deleted due to a bug in the software of the server group that stores the mail data. At that time, Gmail had an estimated 200 million users, 0.02% of the total, and some of the emails of about 40,000 users were deleted. However, Google backed up the user’s email data to magnetic tape, so the deleted data could be restored within four days of the failure. This process was later released as a post-mortem.
What is post-mortem ?
A post-mortem is a document that summarizes the effects of an incident on a system, actions taken to mitigate or resolve it, the cause of the incident, and measures to prevent recurrence.
Difference from trouble report
A post-mortem is similar in content to the trouble report, but it has a different purpose than the reader.
A failure report is a report that explains to a user who has suffered a disadvantage due to the occurrence of a failure. Post-mortem, on the other hand, is a report to learn from disabilities and improve service. Therefore, the reader is a relative engineer. It is an incident summary.
What worked well
What didn’t work
Important things to do in post-mortem
Do not criticize
This is the most important thing. When criticized, the person who made the mistake is afraid of punishment and speaks nothing, unaware of the true cause behind it. Then you would not be able to learn from the disability that is the purpose of the post-mortem. It is important to focus on cause analysis and improvement rather than criticism.
Thoroughly analyze and understand the root cause of the disorder
Inadequate analysis of a failure can lead to incorrect recurrence prevention measures, resulting in the same failure recurrence. It is important to keep asking questions objectively and simply when you analyze.
Solve by mechanism, not by people
The conclusion that recurrence prevention measures are “careful” or “do your best” will be resolved by relying on people, but people cannot be easily “corrected”. Instead, it’s important to modify systems and processes to help people make the right choices.
Share knowledge with many people
Disability cause analysis and recurrence prevention measures are a collection of knowledge learned from disability. That knowledge would be know-how that can be utilized by other teams. Since the purpose of post-mortem is to improve services, it is important to share knowledge more and more.
How to practice
Determine the criteria for writing post-mortems
Postmortems spend a fair amount of time and effort, making it impractical to address all obstacles. Therefore, it is necessary to decide the judgment criteria.
・ Downtime exceeds a certain threshold
・ Data loss occurred
・ Intervention of on-call engineer was required
・ It took more than a certain amount of time to resolve.
・ Obstacles of monitoring itself
Continuing activities to take root in culture
In order to take root in the culture of writing post-mortems, it is good to have a post-mortem reading party that looks back on past post-mortems, or a bad luck circle that role-plays past post-mortems.