Post-Mortem Analysis :: Mateusz Jabłoński - blog, podcast, kursy o programowaniu i rozwoju

Post-Mortem Analysis

Oops. We took the app down. The client is calling - the website has been down for an hour, and no new orders are coming in. It’s stressful, but eventually, the situation is brought under control. The question is: how can we prevent similar incidents in the future? Maybe a post-mortem analysis will help.

Publication date

05 May 2025

Level

Undefined

Mateusz Jabłoński

Frontend Developer with a passion. Husband in love. Proud father. Gamer by choice.

From this article you will learn:

What is post-mortem analysis?
When should we perform a post-mortem analysis?
What tools can we use to create a post-mortem?
Where should we store a post-mortem?
What is a positive post-mortem?

Everything has a limit. In the case of human life, that limit is death. As dramatic as it may sound, it’s simply a certain end — the conclusion of a particular stage. And endings are often a time for reflection, helping us understand what we could do differently in the future to avoid such outcomes — assuming they are not entirely acceptable.

Reflection is a crucial aspect of our existence. In literature, the term post mortem (Latin for “after death”) often refers to an analysis of the causes of death, both in a symbolic and literal sense. Unsurprisingly, the term has also been adopted in the IT world. Fortunately, it doesn’t involve someone actually dying. In this context, a post-mortem refers to the process of analyzing an incident that led to a major system failure. The goal of such an analysis is to help us understand what exactly happened, why it happened, and how we can avoid similar situations in the future.

Why Is Post-Mortem Analysis Important?

IT systems are prone to various types of failures—sometimes related to security, sometimes caused by bugs in the code or incorrect data passed to the application. In other words, failures are any undesired situations affecting our application. Identifying the root causes of such incidents is crucial not only for understanding their origin but also for preventing them in the future. Post-mortem analysis is a tool designed exactly for that purpose. Technically speaking, it’s a control mechanism whose main goal is to ensure the stability and reliability of systems.

Business applications that are continuously developed are especially vulnerable to unexpected issues, often caused by the deployment of new functionalities. This is particularly true when a new feature replaces a previous solution or conflicts with existing ones. In such cases, effective management of the software delivery process becomes essential. Post-mortems allow us to detect potential future issues early on and reduce the risk of them recurring.

What’s especially important is that a well-executed post-mortem analysis does not aim to assign blame or point fingers. Its primary purpose is to foster a safe environment in which every team member feels comfortable sharing their observations and concerns about the application or the processes involved.

To sum up, a post-mortem is a retrospective analysis of an incident (e.g., a system failure) aimed at understanding what happened and why, as well as drawing conclusions to help prevent similar issues in the future.

Structure

A critical part of conducting an effective post-mortem is bringing together all teams or individuals who may have been involved in the incident. It’s essential to determine what happened, why it happened, how the team responded, and what should be done differently in a similar situation in the future. Assigning blame and punishing people is counterproductive and strongly discouraged—it can lead to hiding key facts, shifting responsibility, and ultimately drawing incorrect conclusions.

Organizations like Google emphasize that eliminating the “blame game” cultivates a learning culture, improves performance, and helps teams focus on preventing similar mistakes in the future.

Critics of such tools argue that conducting a post-mortem without identifying who was at fault is unrealistic, mainly because humans naturally tend to judge. They claim that forbidding blame creates an unhealthy and uncomfortable situation that stifles open communication and falsely boosts the morale of those who were genuinely responsible.

However, the idea behind a post-mortem is that removing blame helps uncover the actual areas that need improvement—instead of resorting to quick fixes. For example, if employee A made a mistake that caused the app to be down for several hours, firing that person won’t fix the situation. In fact, without properly analyzing what happened and why, there’s no guarantee that employee B won’t make the same or a similar mistake in the future. In my opinion, dismissal or punishment is just a temporary and ineffective solution.

Failure is part of the software development process. It’s unrealistic to expect that developers (or, nowadays, AI systems) will never make mistakes. Errors often arise because certain factors can’t be predicted during the early stages of development.

Incidents are also opportunities to learn.

Golden Rules of Writing a Post-Mortem

Like any tool, a post-mortem can be used well—or poorly. There are a few fundamental principles we should follow when preparing this kind of analysis. First and foremost, we should focus on facts, not opinions. Opinions won’t help us understand the incident; on the contrary, they can introduce noise and unnecessary variables. While opinions can be valuable when choosing a solution, they are not helpful when analyzing what went wrong.

To fully understand the situation, it doesn’t matter who caused the error or failure—but why it happened. Today it might have been a newly hired junior, tomorrow it could be a senior developer with years of experience. Since the incident already occurred, let’s focus on why—it’s the only way to find effective solutions.

We should go through the entire situation—step by step—identifying what happened first, and what happened last. This timeline helps us walk through the whole incident and identify the weakest points in the process.

We must document everything. And by everything, we mean every decision and every action taken in response to the incident. Properly describing each step will help us better understand how to prevent similar situations in the future.

And to end on a more positive note—don’t forget to include a section about what went well.

Where Should You Store a Post-Mortem?

There are two main schools of thought when it comes to storing post-mortems. The first one—my personal preference—suggests keeping post-mortems close to the code, typically in the code repository. In my opinion, a well-written post-mortem is also a part of the documentation. To ensure easy access, it makes sense to store it alongside the source code.

The second approach recommends using dedicated documentation or note-taking tools such as Confluence, Google Docs, or Notion. There are also specialized tools like Jellyfish, Blameless, or Incident.io designed specifically for incident management.

Ultimately, it doesn’t really matter which method you choose. What matters is that you start tracking and documenting incidents. When creating these documents, it’s worth applying best practices. For example, categorize post-mortems based on the type of issue or affected system. It’s also a good idea to version control them. Consider adding links to relevant incidents in your infrastructure monitoring tools like Grafana or Datadog.

Post-mortem reports should be accessible to everyone involved in developing and maintaining the application—developers, QA teams, security engineers, and product managers.

A sample post-mortem template might look like this:

When Should You Create a Post-Mortem?

Every major incident should be followed by a post-mortem. This includes data breaches, service outages, or regressions. It’s also recommended whenever a client reports a serious issue or when an SLA (Service Level Agreement) is breached. An SLA defines the expected quality of service—such as response time, time to resolution, or maximum allowable downtime.

There is also the concept of a positive post-mortem, which is created after successful yet risky actions from a development or operational perspective.

Summary

A post-mortem is a powerful tool for consciously preventing future incidents. It helps improve the quality of the product and encourages the team to approach future problems with more insight and responsibility. However, implementing the tool is only half the battle—you need to use it regularly and draw meaningful conclusions after each incident to better prepare for future challenges. Document, don’t blame, and keep growing—both as a team and in terms of the product itself.