When Everything Breaks: A Manager's Guide to Incident Response
The two-phase framework that turns crisis into learning (and saves your sanity)
The system is down. Nothing works.
Hannah is trying to figure out what's wrong but she's stuck and needs help.
What do you do?
I learned this lesson the hard way.
There are two completely different jobs when things break:
• Managing the incident (get the system back up)
• Learning from the incident (prevent future problems)
How and what you do as a manager is different in each case.
During the Crisis: One Goal Only
When your system is down, you have exactly one mission.
Get it working again. Fast.
That's it.
Two metrics matter right now:
• MTTR (Mean Time to Recovery)
• MTBF (Mean Time Between Failures)
During the incident, only MTTR counts. Everything else is a distraction that will slow you down.
After the system works again, then we focus on MTBF through proper investigation.
The separation is critical.
The Technical Truth Nobody Talks About
Despite what millions of programmers believe, computers are deterministic machines.
If something worked before and doesn't work now, something changed.
Could be:
• Inputs (sudden user spike)
• Environment (server hardware failed)
• Code (new deployment)
• Configuration (timeout settings changed)
The fastest way to restore service?
Map all changes between "working" and "broken" time periods. Then roll back everything possible.
If code shipped to production, roll back to the last working version. If config changed, revert it.
Go backwards. Don't try to fix forward. 95% of the time.
Your Real Job During the Call
You're not there to solve the technical problem.
You're there to help the engineers solve it faster.
Three specific ways:
Get the right people on the call. You need to know both the official org chart and the real one. Who's actually the expert when database queries slow down? Who do people really call when the API breaks? Bring them in.
Keep the discussion productive. Engineers naturally want to investigate root causes while fixing. Your job is to notice when the team drifts from "restore service" to "understand why." Pull them back. Investigation comes later.
Enable parallel work. Every 15 minutes, stop and share status. What happened? What did we try? Who's working on what? Use a shared doc if you don't have incident management tools. This lets multiple people work simultaneously and helps new people join the call quickly.
When You're Both Manager and Tech Lead
Sometimes you're the most technical person on the team.
If that's you, contribute technically but set a timer for every 30 minutes. When it goes off, stop coding and do the management work above.
Better yet, send technical insights privately to individual engineers instead of discussing in the group call. You're building an organization that needs to function without you.
After the Fire is Out: Time to Learn
Once the system works and stress levels drop, switch modes completely.
Now we optimize for learning.
Same two metrics, different focus:
• MTTR (how to recover faster next time)
• MTBF (how to prevent similar incidents)
But add a third one:
• MTTD (Mean Time to Detection - how quickly we notice problems)
The 5-Why Investigation
Start by collecting raw data before you start asking why.
• When exactly did the incident begin?
• When did we first detect it?
• How did we detect it? (alert vs angry customer)
• When did we identify the fix?
• How much time went in wrong directions?
• When did the fix reach production?
• When was the system fully restored?
Facts first. Emotions and blame second (actually, never).
Then start asking why. But here's what most people miss...
The answer to each "why" isn't single. You're not building a list. You're building a tree that grows exponentially with each question.
Example:
System crashed because config parameter was wrong. Why?
Config was wrong because Module X didn't match Module Y version. Why?
Modules didn't match because dependency file had an error. Why?
Dependencies are set manually, and test environment was updated but production wasn't. Why?
Test and production use different dependency management systems.
See how we went from "someone made a config mistake" to "our environments are inconsistent"?
The first conclusion leads to finger-pointing. The second leads to systemic improvement.
Three Manager Rules for Investigation
No names in the investigation. The moment people start blaming each other, learning stops. Jump in immediately when this happens. Explain we're focused on learning, not finding fault.
Stay focused on the tree. The investigation reveals multiple branches to explore. Help the team decide which branches to follow and which to abandon. This is time management - make the investigation time count.
Human error is never the root cause. If your investigation ends with "someone made a mistake," you're not done. There's always a system that should catch human errors. What broke in that system?
The Follow-Through Problem
Here's where most organizations fail.
Every investigation ends with action items. But if those action items never get implemented, you've done worse than nothing. You've created cynical engineers who see incident reviews as bureaucratic theater.
You need two things:
Culture and mechanisms that demand follow-through. Track the action items. Review them in team meetings. Make implementation visible.
Business context for the team. Not every action item is worth doing. Help engineers understand why some fixes get prioritized and others don't. Build an open prioritization system that explains how decisions get made.
The Real Lesson
Your job as a manager is to build the team, not to build what the team builds.
But sometimes you really do need to roll your sleeves and dive into the tech.
The real lesson is that there are no absolutes in management. In each and every moment you need to be whatever your team needs you to be.
I learned the hard way that I'm struggling with keeping in sync during an accident.
In my mind, I'm focused and I'm working hard to fix the problem.
But what others see is that they are not so sure what I'm doing and what my progress is.
I've "paid the price" for it on several occasions. I guess it's important to take step back every once in a while and deliberately do a status check with the others.
I think the separation of when to resolve and when to reflect is so critical here. It's easy when tensions are high for people to try to determine why it happened and what was at fault, to the extent that it distracts from the task of resolving the outage.
Great post!