Failure as a Clue: What Postmortems Get Wrong
Most postmortems feel performative.
Not on purpose.
Not because people don’t care.
But because the structure of the meeting forces everyone into a narrow, safe narrative:
“Let’s explain what happened without offending anyone, making too many waves, or triggering a political problem.”
When you do that, you get a timeline, not a diagnosis.
A sequence of events, not a pattern.
A story, not a signal.
And then—surprise—nothing changes.
The same issues return, the same failures repeat, and the organization chalks it up to “bad luck” or “another outage” instead of acknowledging the structural roots that made the failure inevitable.
This is the part most people miss:
Failures aren’t anomalies. They’re clues.
They’re diagnostic data points revealing how the system actually behaves—regardless of how leaders believe it behaves.
And when you treat failure as a clue instead of a crime scene, everything changes.
The Three Conversations Happening in Every Postmortem
There are always three layers swirling under the surface:
1. The technical sequence
What happened in what order.
2. The organizational forces
Incentives, shortcuts, political pressures, time constraints, and hidden rules.
3. The human layer
Fear, expectations, risk appetite, communication gaps, trust levels.
Most companies focus only on #1, politely avoid #3, and never touch #2.
But the truth is simple:
Most failures originate in layer #2 and only express themselves in layer #1.
If you don’t look at incentives, constraints, and misalignment, you’re not doing a postmortem.
You’re just documenting an autopsy.
Why the Most Important Clues Never Make It into the Report
When a system fails, people instinctively try to:
- soften blame
- avoid sounding critical
- protect coworkers
- avoid managerial consequences
- minimize the perceived scope
- restore confidence
So the real signals—the structural ones—get deleted:
- “We’ve been avoiding this migration because the team is underwater.”
- “We rely on a person, not a process, for this decision.”
- “Everyone knows this part of the system is fragile, we just hope it holds.”
- “Product promised something engineering couldn’t deliver.”
- “We cut corners because leadership needed a demo.”
- “We’ve normalized this failure mode for so long it doesn’t even register.”
These are the actual causes.
But they’re almost never written down.
Why?
Because they implicate structures, not individuals.
And structures don’t defend themselves—but the people inside them do.
The Difference Between Postmortems That Change Something and Postmortems That Don’t
The teams that get better follow a simple rule:
Postmortems are not about blame. They are about reality.
And reality doesn’t care about feelings, titles, departments, or the organizational chart.
Reality cares about:
- constraints
- incentives
- throughput
- cognitive load
- process mismatches
- communication structure
- system design
- error pathways
If you want real improvement, you investigate the system, not the people.
You treat a failure as a symptom, not a sin.
And you ask the most powerful question in any diagnostic investigation:
“What made this the predictable outcome?”
Because it was predictable.
Failures always follow the path of least resistance.
A Better Way: The Diagnostic Postmortem
The Diagnostic method focuses on structures, incentives, and design—not blame.
Here’s the approach I’ve seen transform organizations:
1. Start with truth, not comfort
Reward clarity, not self-protection.
2. Look for incentive fingerprints
The outcome tells you whose incentives were misaligned.
3. Identify structural weaknesses, not individual mistakes
Most “errors” were set in motion long before a human touched anything.
4. Ask what the system allowed—not what the engineer did
If a person could make a catastrophic mistake, the system was designed to allow it.
5. Extract the reusable pattern
Every major failure has a sibling hiding somewhere else.
6. Make the system safer for the next engineer
If success depends on heroics, you don’t have reliability—you have gambling.
Why This Matters
Any organization can build dashboards.
Any team can build automation.
Any stack can scale with the right budget.
But very few teams build the ability to learn from failure.
The companies that do become resilient.
Everyone else becomes lucky.
The difference is whether leaders treat failure as:
- evidence
or - embarrassment
Because here’s the truth:
Systems don’t break because people are bad.
They break because the system was designed in a way that made failure easy.
And that is the most actionable clue you’ll ever get.
If You Want to Go Deeper
This post is the written foundation of one of my most requested talks:
“Postmortems Are Autopsies: How to Actually Learn From Failure.”
If you’re interested in having me speak at your engineering organization, SRE team, or conference, you can find more details at:
Failures aren’t the enemy.
They’re messages.
And they’re telling you something important—if you’re willing to listen.