Postmortems: not for root causes and action items
It’s becoming commonplace, that after a production incident, a post-mortem meeting is organized. What is the goal of this meeting though? After the meeting finished, which goals should you have achieved?
According to many people, the goal is twofold:
- Find the root cause of the incident.
- Agree on action items that prevent this kind of incident in the future.
Both these activities are useful, but shouldn’t be the main goals of your postmortem. Learning and knowledge sharing are much better things to pursue. But before I tell you about those, let me say something about root cause analyses and action items.
Why a post-mortem is a bad place for root-cause analysis
There are 2 kinds of root causes that you can look for: the technical root cause and the systemic one. The difference is as follows:
- technical root cause: the problem/bug that caused the outage. Fixing this is what you have to do to resolve the outage.
- systemic root cause: the reason that the problem/bug causing the outage made its way into production.
Ideally, the technical root cause has been found before you had your post-mortem meeting. If not, you can’t be sure that the incident is fully resolved. The problem might’ve gone away, but if you don’t know what caused it, it can also come back at any time. Organizing a meeting just to go through logs and monitoring metrics doesn’t seem ideal. You’ll be better off having a couple of people investigating the issue until they’re reasonably sure they have a good picture of what happened.
A systemic cause is something you can discuss though. But why are you only looking for a single cause? Most likely, there’s a whole event chain that led to the problem unfolding in your production environment. If any of the steps along the chain had gone differently, the problem might’ve never occurred. An example:
- A developer writes a DB migration that triggers a full table rewrite
- Problem is not caught in code review
- Problem is not caught in test environments, due to the delay between deployment and actual testing. The table got locked in all environments, but nobody noticed it.
- Because the DB is a couple of major versions behind, the DB migration locks a table on the production DB for 2 minutes. On newer versions, the DB would’ve handled the migration more gracefully.
Is there a single “root” of the problem here? It’s clear to see that if any of the steps played out differently, the outage could’ve been avoided.
Don’t look for the root cause, look for the event chain.
Don’t mess up your priorities
Now that you know the event chain, you might want to make changes at any of the aforementioned steps, to prevent future outages of the same kind.
Sounds like a good plan; just be aware of all the improvements that are already planned by you and your team. Most likely, you already have some work planned to make your development environment better. Don’t let the availability bias get the better of you. Think hard before agreeing on implementing big changes as a result of an outage. Of course, quick-win action items should be implemented.
Treat the outage as a learning experience, that you can use later to make better decisions on where your priorities lie.
Learning and knowledge sharing
That brings us to where I think think the true value of a post-mortem lies: learning. People working on an incident gain a lot of experience and insight into how the system works. Because of this gained experience, they’ll also be involved in future outages, which can lead to a cycle of those people gaining much more experience and understanding than other team members.
Breaking this cycle is one of the goals of a post-mortem. You want to make sure that all the people that dealt with the outage put their knowledge together. That combined knowledge then has to be spread to the rest of the team. You can do this in two ways:
- Making the post-mortem meeting open for anyone to join.
- Having a clear write-up of the outage, with the target audience of the document being people that were not involved in the outage.
By doing this, you’ll spread valuable information about how your system behaves in production and the typical failure modes. In the long term, the benefits of this are:
- Better code quality, because the devs can estimate better where they need to pay attention.
- Better priority setting, because other teams will also have a feel for the kinds of problems occurring in your production environment.