Building Incident Response Workflows That Actually Get Followed


I wrote previously about monitoring as a foundation for platform maintenance. That piece covers the "what's happening" side: uptime checks, log aggregation, APM, system telemetry. But knowing something is broken is only half the problem. The other half is what your team does about it.
Most incident response documentation I've seen falls into one of two categories: either it doesn't exist, or it exists as a 40-page PDF that nobody has read since it was written. Neither is useful at 2 AM when your primary database is unresponsive and Slack is lighting up.
This is what I've found actually works.
The runbook that gets used
A runbook is only useful if someone under stress can follow it. That means:
Keep it short. Each runbook should fit on one screen. If it requires scrolling, it's too long. Break it into linked sub-runbooks instead.
Start with triage, not diagnosis. The first thing an on-call engineer needs to know is "how bad is this?" not "why is this happening." Structure every runbook as:
- Severity assessment - Is this user-facing? How many users? Is data at risk?
- Immediate mitigation - What can we do right now to reduce impact? (Restart service, scale up, enable maintenance mode, roll back)
- Diagnosis - Now that bleeding is controlled, figure out root cause.
- Resolution - Fix it properly.
- Verification - Confirm the fix holds.
Use commands, not descriptions. Don't write "check the database connection pool." Write the actual command:
# Check active connections against pool limit
psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
# If above 80% of max_connections, check for long-running queries
psql -c "SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '30 seconds'
ORDER BY duration DESC;"Copy-paste beats memory every time, especially under pressure.
Include the "who to call" section. Every runbook should list:
- Who owns this service
- How to reach them (not just a Slack channel, an actual phone number for P1s)
- When to escalate (specific thresholds, not "when it seems bad")
Escalation paths that scale
The simplest escalation model that works:
P1 (service down, data at risk): On-call engineer has 15 minutes to assess. If not mitigated, it auto-escalates to the engineering lead. If not resolved in 1 hour, it goes to the director. Every 30 minutes without resolution, the next level up gets notified.
P2 (degraded performance, partial outage): On-call engineer owns it. Escalate to the team lead if no progress in 2 hours. Daily check-ins until resolved.
P3 (non-urgent, no user impact): Ticket it. Handle it during business hours. No pager, no escalation.
The key insight: escalation is not punishment. It's bringing more resources to bear. Engineers who hesitate to escalate because they don't want to "bother" someone are the ones who turn 30-minute incidents into 3-hour incidents. Make this explicit in your culture.
Communication during incidents
The worst thing about a major incident isn't the technical problem. It's the communication chaos. Three rules:
1. Designate an incident commander immediately. This person does not debug. They coordinate. They keep the status page updated, field questions from stakeholders, and make sure the people doing the actual work aren't interrupted.
2. Use a single channel. Create a dedicated Slack channel (or equivalent) for each P1 incident. All communication happens there. No DMs, no side threads, no "I'll just call them quick." Everything in one place, timestamped, searchable.
3. Update stakeholders on a cadence, not on demand. "We will post updates every 30 minutes" stops the "any update?" messages that distract the team. Even if the update is "still investigating, no new information," post it.
Post-mortems that prevent recurrence
I've sat through quite a few post-mortems. The ones that actually prevent recurrence share three traits:
They're blameless, genuinely. Not "blameless but we all know who screwed up." If your post-mortem makes someone feel bad about themselves, your process is broken. The question is never "who caused this?" It's "what about our system made this failure possible?"
They produce specific action items with owners and deadlines. "We should improve our monitoring" is not an action item. "Add connection pool utilization alert at 80% threshold, owned by Sarah, due Friday" is an action item. If it doesn't have an owner and a date, it won't happen.
They get reviewed. Schedule a 30-day follow-up to check whether action items were completed. If they weren't, that tells you something important about your team's capacity or your prioritization.
A template that works
Here's the post-mortem template I use:
## Incident: [Title]
**Date:** [Date]
**Duration:** [Start time] to [Resolution time]
**Severity:** P1/P2/P3
**Impact:** [Who was affected, how many, what they experienced]
## Timeline
- HH:MM - Alert fired / report received
- HH:MM - On-call engineer acknowledged
- HH:MM - [Key actions taken]
- HH:MM - Mitigation applied
- HH:MM - Root cause identified
- HH:MM - Fix deployed
- HH:MM - Verified resolved
## Root Cause
[Clear, specific explanation. Not "the server crashed" but
"the connection pool exhausted because a schema migration
added a full table lock during peak traffic."]
## What Went Well
- [Things that worked as designed]
## What Went Poorly
- [Things that slowed detection or resolution]
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Specific action] | [Name] | [Date] | Open |Making it stick
The hardest part of incident response isn't writing the docs. It's keeping them alive. Three practices that help:
Run game days. Once a quarter, simulate an incident. Walk through the runbook. If someone can't follow it, fix it right then. This is also how you onboard new team members to on-call.
Review runbooks when you touch the service. If you ship a change to a service, check whether its runbook still applies. Add it to your PR checklist.
Measure time-to-mitigate, not time-to-resolve. Resolution can take days for complex issues. Mitigation (reducing user impact) should happen in minutes. Track that number. It tells you whether your runbooks and escalation paths are working.
Good incident response isn't about preventing all failures. It's about failing well: detecting quickly, mitigating faster, and learning from every one.