Incident Response & Disaster Recovery
Throughout this page, we will follow a single scenario: your team operates a production web application that processes customer orders. On a busy Friday afternoon, the service goes down during peak hours. We will walk through the complete incident lifecycle, from the first alert firing to the postmortem that prevents it from happening again, using this scenario to ground every concept in practice.
What Is an Incident?
Section titled “What Is an Incident?”An incident is any unplanned disruption to a service that requires a coordinated human response. Not every alert is an incident; a single transient error that self-heals is just noise. An incident begins when the disruption is significant enough that someone needs to stop what they are doing and respond.
The key distinction is between a problem you can fix in the normal course of work (a bug you triage into the next sprint) and a problem that demands immediate, coordinated action (a service is down and customers cannot place orders). Incidents sit firmly in the second category. They are time-sensitive, they affect real users, and they benefit enormously from a structured response.
Severity Levels
Section titled “Severity Levels”Severity levels give your team a shared vocabulary for describing how bad things are. Without them, one engineer’s “the site is slow” might be another’s “we are losing thousands of dollars per minute.” A common scheme uses three or four tiers.
SEV-1 (Critical) means a core business function is completely unavailable or data integrity is at risk. In our scenario, a SEV-1 would be the order processing pipeline returning 500 errors to every customer. Revenue is directly affected, and customer trust is eroding by the minute.
SEV-2 (Major) means the service is degraded but partially functional. Perhaps the checkout page loads but takes 15 seconds instead of 2, or a subset of users in one geographic region cannot connect. The business impact is real but not total.
SEV-3 (Minor) means there is elevated risk or a non-critical feature is broken. The admin dashboard might be down, or a background job queue is backing up. Customers are not yet affected, but the situation could escalate if left unattended.
SEV-4 (Low) covers cosmetic issues or minor inconveniences that do not require an urgent response but should be tracked so they do not accumulate into something worse.
Examples of Incidents
Section titled “Examples of Incidents”Incidents come in many forms. A failed deployment that introduces a crashing bug is one of the most common. A database running out of disk space and refusing writes is another. A misconfigured firewall rule that blocks all inbound traffic, a certificate that expired overnight, a dependency on a third-party API that starts returning errors: all of these are incidents. What they share is urgency and the need for coordination.
The Incident Lifecycle
Section titled “The Incident Lifecycle”Every incident, regardless of severity, follows a natural lifecycle. Understanding this lifecycle helps your team move through it deliberately rather than thrashing in panic. The five phases are: detect, triage, mitigate, resolve, and learn.
Phase 1: Detect
Section titled “Phase 1: Detect”You cannot respond to what you do not know about. Detection is the phase where someone or something notices that things are wrong. The faster you detect, the less damage accumulates.
In our scenario, detection might happen in several ways. An automated alert fires because the error rate on the order service has exceeded 5% for three consecutive minutes. A customer support agent notices a surge in complaints. A synthetic monitoring check (a script that places a test order every 60 seconds) fails five times in a row. Or an engineer glances at a Grafana dashboard and sees a wall of red.
The best detection systems are layered. No single method catches everything, so you want multiple independent signals.
Alerting on metrics is the most common approach. A monitoring system like Prometheus scrapes your application’s health endpoint every 15 seconds and evaluates alert rules against the collected data. When the error rate exceeds a threshold, it sends a notification through a tool like PagerDuty or Opsgenie, which pages the on-call engineer.
Synthetic monitoring simulates real user behavior. A probe running outside your infrastructure attempts to load the homepage, log in, and complete a purchase every minute. If any step fails, the probe raises an alert. Synthetic checks are especially valuable because they catch problems that internal metrics might miss, such as DNS failures or CDN misconfigurations.
Log anomaly detection watches for unusual patterns in your application and system logs. A sudden spike in “connection refused” errors from the database, or a flood of “out of memory” kernel messages, can signal trouble before it becomes user-visible.
User reports are the detection method of last resort. If customers are telling you the site is down before your monitoring does, your observability stack has a gap that needs to be addressed in the postmortem.
Phase 2: Triage
Section titled “Phase 2: Triage”Once you know something is wrong, you need to quickly determine how bad it is, what is affected, and who needs to be involved. This is triage.
In our scenario, the on-call engineer receives a page at 2:47 PM. Their first action is to open the monitoring dashboard and confirm the alert is real, not a false positive caused by a flaky metric. They check several signals in rapid succession.
# Check the HTTP error rate from the load balancer logscurl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])" | jq '.data.result[].value[1]'
# Verify the application process is runningssh web-prod-01 'systemctl status order-service'
# Check if the database is accepting connectionsssh db-prod-01 'pg_isready -h localhost -p 5432'The engineer sees that the order service is returning 502 errors and the database is not accepting connections. This is a SEV-1: the core ordering pipeline is completely down during peak hours. They escalate immediately.
Impact assessment answers the question “who is affected and how badly?” Check your analytics to see how many users are currently active. Look at the error rate as a percentage of total traffic. Determine whether the problem is global or regional. In our case, the database being down means every user on every endpoint that touches the database is affected, which is nearly the entire application.
Deciding who to page depends on severity and the nature of the problem. For a SEV-1 with a database component, the on-call engineer pages the database specialist and the engineering manager. They also notify the customer support lead so the support team can prepare for an influx of tickets.
Phase 3: Mitigate
Section titled “Phase 3: Mitigate”Mitigation is about reducing customer impact as fast as possible. This is the most counterintuitive phase for engineers, because the instinct is to diagnose the root cause. Resist that instinct. At this stage, your goal is to stop the bleeding, not to understand why the patient is bleeding.
Think of mitigation strategies as a toolkit. You reach for whichever tool can restore service fastest, even if the fix is temporary or inelegant.
Rollback is the single most powerful mitigation strategy. If the problem started shortly after a deployment, rolling back to the previous known-good version is almost always the right first move. Modern deployment pipelines make this straightforward.
# Roll back to the previous release using your deployment toolkubectl rollout undo deployment/order-service
# Or if using a blue-green deployment, switch traffic back to the blue environmentaws elbv2 modify-listener --listener-arn $LISTENER_ARN \ --default-actions Type=forward,TargetGroupArn=$BLUE_TARGET_GROUPFeature flags let you disable a specific feature without redeploying the entire application. If the crash is caused by a newly launched recommendation engine, you can flip a flag to disable it while keeping the rest of the order flow intact.
Traffic shifting moves users away from a broken component. If one availability zone is experiencing problems, you can drain it from the load balancer and send all traffic to healthy zones. If a single database replica is corrupt, you can remove it from the read pool.
Scaling helps when the root cause is a capacity problem. If traffic has spiked beyond what your current fleet can handle, adding more instances buys time while you investigate the underlying cause.
Graceful degradation means intentionally reducing functionality to preserve the core experience. If the recommendation service is overloading the database, you can serve a static “popular items” list instead. The experience is worse, but customers can still place orders.
In our scenario, the on-call engineer discovers that the database server’s disk is 100% full. The transaction log has grown unchecked because a misconfigured backup job was writing snapshots to the database server’s local disk instead of an external volume. The fastest mitigation is to free disk space.
# Check disk usage on the database serverssh db-prod-01 'df -h'
# Identify and remove the errant backup filesssh db-prod-01 'ls -lhS /var/lib/postgresql/backups/ | head -20'ssh db-prod-01 'sudo rm /var/lib/postgresql/backups/snapshot-2026031*.sql.gz'
# Verify the database can accept connections againssh db-prod-01 'pg_isready -h localhost -p 5432'After clearing the backup files, the database comes back online and the order service begins processing requests again. The error rate drops from 100% to near zero within two minutes.
Phase 4: Resolve
Section titled “Phase 4: Resolve”Resolution is distinct from mitigation. Mitigation stops the immediate harm; resolution addresses the underlying cause so the same failure does not recur. In our scenario, deleting the backup files was mitigation. Resolution requires fixing the backup configuration so snapshots write to the correct location, and adding a disk usage alert so the team is warned long before the disk fills up again.
Resolution often happens after the incident is over, during normal working hours. Once the service is stable and customers are no longer affected, it is acceptable (and often wise) to defer the permanent fix to the next business day rather than making complex changes under pressure at 3 AM.
Verifying the fix is a critical step that teams sometimes skip in their relief at restoring service. After mitigation, monitor the system closely for at least 30 minutes. Watch the error rate, latency percentiles, and any metrics related to the failed component. Run your synthetic checks manually. Check that downstream systems (payment processing, email notifications, inventory updates) are all functioning correctly.
# Watch the error rate in real timewatch -n 5 'curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~\"5..\"}[1m])" | jq ".data.result[].value[1]"'
# Confirm orders are flowing through the pipelinessh web-prod-01 'journalctl -u order-service --since "5 minutes ago" | grep "order completed" | wc -l'Monitoring for recurrence means keeping a closer-than-usual watch on the system for the next 24 to 48 hours. Set a temporary tighter alert threshold if your monitoring system supports it. Have the team check in at the start of the next business day to confirm that the system remained healthy overnight.
Phase 5: Learn
Section titled “Phase 5: Learn”The final phase of the incident lifecycle is learning. This is where you extract lasting value from a painful experience. Without this phase, you are doomed to repeat the same failures.
Learning takes the form of a postmortem (sometimes called a retrospective or incident review). We will cover postmortems in more detail later in this chapter, but the key principle is that they must be blameless. The goal is to understand the conditions that allowed the failure to happen, not to find someone to punish.
Incident Roles
Section titled “Incident Roles”When an incident is declared, confusion is the enemy. People need to know who is making decisions, who is executing technical work, and who is communicating with stakeholders. Defining roles in advance eliminates the “who’s doing what?” chaos that wastes precious minutes.
Incident Commander (IC)
Section titled “Incident Commander (IC)”The Incident Commander owns the overall response. They do not necessarily do the technical work themselves; instead, they coordinate. The IC declares the severity level, assembles the right people, makes decisions about mitigation strategies, and keeps the response focused.
A good IC asks questions like: “What is the current customer impact?” “What have we tried so far?” “What is our next action and who is doing it?” “When is our next status update due?” The IC also time-boxes investigation efforts. If an engineer has been debugging for 15 minutes without progress, the IC may redirect them to a different approach or bring in additional help.
When you take on the IC role, announce it explicitly in the incident channel: “I am acting as Incident Commander for this incident.” This removes ambiguity and gives everyone a clear point of contact for decisions.
Operations Lead
Section titled “Operations Lead”The Operations Lead (sometimes called the Technical Lead) does the hands-on diagnostic and mitigation work. They run commands, check logs, deploy fixes, and report findings back to the IC. In a small team, this might be the on-call engineer who received the initial page.
The Operations Lead should narrate their actions in the incident channel as they work. “Checking database connectivity on db-prod-01.” “Disk is 100% full. Investigating which files are consuming space.” “Clearing backup files now.” This running commentary serves two purposes: it keeps the IC informed, and it creates a real-time timeline that will be invaluable during the postmortem.
Communications Lead
Section titled “Communications Lead”The Communications Lead manages all messaging to stakeholders outside the immediate response team. This includes updating the public status page, sending internal notifications to leadership, and coordinating with customer support.
The Communications Lead translates technical details into audience-appropriate language. The incident channel might say “PostgreSQL is OOM-killed due to shared_buffers misconfiguration,” but the status page should say “Some customers may experience errors when placing orders. Our team is actively working on a fix and we expect to have an update within 30 minutes.”
Scribe
Section titled “Scribe”The Scribe records the timeline of events during the incident. They note when alerts fired, when people joined the response, what actions were taken, and what the results were. This role is especially important for complex incidents that span hours, where memory becomes unreliable.
Detection Methods in Depth
Section titled “Detection Methods in Depth”We touched on detection during the lifecycle overview, but it deserves a closer look because the quality of your detection directly determines how quickly you can respond.
Threshold-Based Alerts
Section titled “Threshold-Based Alerts”The simplest form of alerting compares a metric against a fixed threshold. “If the error rate exceeds 1% for more than 3 minutes, fire a SEV-2 alert.” Threshold alerts are easy to reason about and easy to configure. Their weakness is that they require you to know what “normal” looks like in advance. A threshold set for average traffic will fire too often during a traffic spike, and one set for peak traffic will miss problems during quiet periods.
To handle this, use percentage-based thresholds (error rate rather than error count) and build in hold-down periods (the condition must persist for several minutes before the alert fires). This reduces false positives without significantly delaying detection.
SLO-Based Alerts
Section titled “SLO-Based Alerts”A more sophisticated approach ties alerts to your Service Level Objectives. Instead of asking “is the error rate above 1%?”, you ask “at the current error rate, will we burn through our monthly error budget within the next hour?” This is called burn-rate alerting, and it naturally adapts to traffic volume. During peak hours, a 0.5% error rate might be alarming because the sheer volume of errors will exhaust your budget quickly. During quiet hours, the same rate is less urgent.
Synthetic Monitoring
Section titled “Synthetic Monitoring”Synthetic monitors are automated scripts that perform actions a real user would perform, running at regular intervals from locations outside your infrastructure. They are your early warning system for problems that internal metrics cannot see.
A synthetic check for our order service might look like this: load the homepage, search for a product, add it to the cart, proceed to checkout (stopping short of actually placing an order). If any step fails or takes longer than a defined threshold, the check raises an alert.
Log-Based Detection
Section titled “Log-Based Detection”Application and system logs contain a wealth of information that metrics alone may not capture. A sudden increase in “connection timeout” log entries from the application, or “out of memory” entries from the kernel, can signal an impending failure before it becomes user-visible.
Modern log aggregation systems (such as Loki, Elasticsearch, or CloudWatch Logs) let you define alerting rules over log patterns, effectively turning qualitative log data into quantitative signals.
Triage: Classifying and Assessing
Section titled “Triage: Classifying and Assessing”Triage is the bridge between detection and action. Its purpose is to answer three questions quickly: how severe is this, what is affected, and who needs to respond.
Severity Classification in Practice
Section titled “Severity Classification in Practice”When an alert fires, the on-call engineer’s first job is to confirm it is real and assign a severity. They do this by cross-referencing multiple signals. A single metric spike might be a collection artifact, but if the error rate is up, latency is climbing, and the synthetic monitor is failing, you have a real problem.
Use your predefined severity criteria. If the order pipeline is completely down (matching your SEV-1 definition of “core business function unavailable”), declare SEV-1 immediately. Do not waste time debating; you can always downgrade later. It is far better to over-classify and stand down than to under-classify and let damage accumulate.
Impact Assessment
Section titled “Impact Assessment”Impact assessment quantifies the blast radius. How many users are affected? What percentage of requests are failing? Is the problem global or isolated to a specific region, service, or customer segment?
# Count unique affected users in the last 10 minutes from access logsssh web-prod-01 'zcat /var/log/nginx/access.log.gz | awk -v cutoff=$(date -d "10 minutes ago" +%s) "\$4 > cutoff && \$9 >= 500 {print \$1}" | sort -u | wc -l'
# Check if the problem is isolated to one availability zonecurl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])%20by%20(zone)" | jq '.data.result[] | {zone: .metric.zone, error_rate: .value[1]}'Escalation
Section titled “Escalation”Your severity levels should map directly to escalation paths. A SEV-1 pages the on-call engineer immediately (phone call, not just a Slack message). A SEV-2 sends a push notification. A SEV-3 creates a ticket for the next business day.
Know your escalation paths before you need them. At 3 AM, you should not be searching Confluence for the database team’s on-call number.
Communication During Incidents
Section titled “Communication During Incidents”Poor communication during an incident amplifies the damage. Stakeholders who do not know what is happening will start asking questions, pulling responders away from the technical work. Customers who see no acknowledgment on the status page will flood support channels. Leadership who learn about an outage from Twitter instead of from their own team will lose trust in the engineering organization.
Internal Communication
Section titled “Internal Communication”Open a dedicated incident channel (for example, #inc-2026-03-15-order-outage in Slack) as soon as the incident is declared. Pin a message at the top with the current severity, the IC’s name, and the next update time. All incident-related discussion should happen in this channel, not in DMs or side conversations.
Post structured updates at regular intervals. For a SEV-1, update every 15 minutes; for a SEV-2, every 30 minutes. Each update should follow a simple template.
Status Update - 15:15Severity: SEV-1Impact: Order processing is fully down. ~2,400 active users affected.Current Status: Database server disk was full. Cleared 45 GB of errant backup files. Database is back online. Order service is recovering. Error rate dropping.Next Steps: Monitoring error rate. Expecting full recovery within 5 minutes.Next Update: 15:30 or sooner if status changes.External Communication
Section titled “External Communication”Your public status page is the single most important external communication channel during an incident. Update it early, update it honestly, and update it regularly. Customers are far more forgiving of downtime when they can see that you know about it and are working on it.
A good status page update follows this structure: acknowledge the problem, describe the customer-visible impact, state that you are actively working on it, and give a time for the next update. When the incident is resolved, post a final update confirming that service has been restored and that you will be publishing a postmortem.
Bridge Calls
Section titled “Bridge Calls”For complex SEV-1 incidents involving multiple teams, a voice bridge (video call or phone conference) can be more effective than a text channel. Voice communication is higher bandwidth and allows for faster back-and-forth during time-critical decisions. Keep the bridge focused: the IC runs the call, mutes participants who are not actively speaking, and periodically summarizes the current state for anyone who has just joined.
Recovery and Verification
Section titled “Recovery and Verification”Recovery is the transition from “the fire is out” to “we are confident the system is healthy.” It is tempting to declare victory as soon as the error rate drops to zero, but premature celebration has burned many teams.
-
Confirm the mitigation is holding. Monitor key metrics (error rate, latency, throughput) for at least 30 minutes after the fix. If any metric shows instability, do not close the incident.
-
Verify downstream systems. In our scenario, orders depend on payment processing, inventory management, and email notifications. Check that all of these are functioning correctly. A fix that restores the order form but breaks payment confirmation is not a fix.
-
Test with synthetic checks. Run your synthetic monitoring manually to confirm that the full user journey works end to end.
-
Check for data inconsistencies. If the outage caused failed writes or partial transactions, you may need to reconcile data. In our scenario, check for orders that were partially processed (payment charged but inventory not updated) and resolve them.
-
Communicate resolution. Update the incident channel, the status page, and any stakeholders who were notified during the incident. Include a brief summary of what happened and what was done to fix it.
-
Schedule the postmortem. While the incident is still fresh in everyone’s memory, schedule the postmortem meeting for within the next 48 hours.
Disaster Recovery: Planning for the Worst
Section titled “Disaster Recovery: Planning for the Worst”Not every incident is a single-server disk filling up. Some failures are catastrophic: a data center loses power, a cloud region goes offline, ransomware encrypts your database, or a critical third-party service shuts down without warning. Disaster recovery (DR) planning prepares you for these scenarios.
RTO and RPO
Section titled “RTO and RPO”Two metrics anchor every DR plan.
Recovery Time Objective (RTO) is the maximum acceptable duration of an outage. If your RTO is four hours, your DR plan must be capable of restoring service within four hours of a disaster being declared. RTO drives architectural decisions: a four-hour RTO might be achievable with a cold standby and manual failover, but a 15-minute RTO probably requires automated failover to a hot standby.
Recovery Point Objective (RPO) is the maximum acceptable amount of data loss, measured in time. If your RPO is one hour, you can afford to lose at most one hour of data. RPO drives your backup and replication strategy: an RPO of one hour means backups every hour at minimum, while an RPO of zero requires synchronous replication to a secondary site.
These two metrics are driven by business requirements, not by engineering preferences. The cost of achieving a lower RTO or RPO increases dramatically, so it is essential to have a conversation with business stakeholders about what level of availability and data durability they actually need, and what they are willing to pay for.
Backup Strategies
Section titled “Backup Strategies”Backups are the foundation of disaster recovery, but they are only useful if they actually work. A backup you have never tested is a hypothesis, not a safety net.
Full backups capture the complete state of a system at a point in time. They are simple to reason about but consume significant storage and take a long time to create. Running a full backup daily is common for databases of moderate size.
Incremental backups capture only the changes since the last backup (full or incremental). They are faster to create and consume less storage, but restoring from them requires replaying the full backup plus every subsequent incremental, which takes longer and introduces more points of failure.
Continuous replication streams changes from the primary to a replica in near-real-time. PostgreSQL’s streaming replication, MySQL’s binary log replication, and cloud-native options like AWS RDS Multi-AZ all fall into this category. Continuous replication provides the lowest RPO (potentially zero) but does not protect against logical errors like accidentally dropping a table, because the drop command replicates too.
Failover Architectures
Section titled “Failover Architectures”Cold standby means you have backups stored offsite and a documented procedure for provisioning new infrastructure and restoring from those backups. This is the cheapest option but has the longest RTO, typically measured in hours.
Warm standby maintains a secondary environment that is running but not serving traffic. Data is replicated to the standby (often asynchronously), and failover involves promoting the standby and redirecting traffic. RTO is measured in minutes to tens of minutes.
Hot standby (active-active) runs your application in two or more independent sites simultaneously, with traffic distributed across all of them. If one site fails, the others absorb its traffic automatically. This provides the lowest RTO (potentially seconds) but is the most complex and expensive to operate.
Failover in Practice
Section titled “Failover in Practice”A failover typically involves two actions: promoting the standby database to primary, and redirecting traffic to the surviving site. Traffic redirection can happen at the DNS level (updating a DNS record to point to the new site) or at the load balancer level (removing the failed site from the target group).
# Promote a PostgreSQL standby to primaryssh db-standby-01 'sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main'
# Update the DNS record to point to the standby siteaws route53 change-resource-record-sets --hosted-zone-id $ZONE_ID \ --change-batch '{"Changes":[{"Action":"UPSERT","ResourceRecordSet":{"Name":"orders.example.com","Type":"A","TTL":60,"ResourceRecords":[{"Value":"10.0.2.50"}]}}]}'DNS-based failover is simple but limited by DNS TTL: even with a 60-second TTL, some clients will cache the old address for longer. Load-balancer-based failover is faster and more reliable but requires both sites to be behind the same load balancer or a global traffic manager.
Building an Incident Response Program
Section titled “Building an Incident Response Program”An incident response program is not something you build during an outage. It is something you build during calm times so that it is ready when you need it.
On-Call Rotations
Section titled “On-Call Rotations”Someone must always be reachable when things break. An on-call rotation distributes this responsibility fairly across the team, typically in one-week shifts. A good on-call program has several properties.
Clear expectations. On-call engineers should know their response time SLA (for example, acknowledge a SEV-1 page within 5 minutes), what tools they need access to, and who to escalate to if they cannot resolve the issue alone.
Reasonable workload. If the on-call engineer is getting paged multiple times per night, the system has too many reliability problems to sustain a healthy rotation. Track the number of pages per shift and treat a high page count as a signal to invest in reliability improvements.
Compensation and time off. On-call work is real work. Whether your organization compensates it with extra pay, time off in lieu, or some other mechanism, acknowledge that being on call is a burden and share it equitably.
Runbooks
Section titled “Runbooks”A runbook is a document that tells the on-call engineer exactly what to do when a specific alert fires. Good runbooks are written for someone who is groggy, stressed, and unfamiliar with the specific subsystem. They should be concise, step-oriented, and testable.
A runbook for our “database disk full” scenario might look like this:
Alert: PostgreSQL disk usage > 90%Severity: SEV-2 (escalate to SEV-1 if database stops accepting writes)
Prerequisites: - SSH access to the database server - sudo privileges
Steps: 1. SSH to the affected database server. 2. Run: df -h /var/lib/postgresql Expected: usage above 90%. 3. Identify largest files: du -sh /var/lib/postgresql/* | sort -rh | head -10 4. Common culprits: - WAL files in pg_wal/: run a manual checkpoint (psql -c "CHECKPOINT;") and verify pg_wal shrinks. - Backup files in /var/lib/postgresql/backups/: move or delete stale snapshots (confirm they exist in offsite storage first). - Temp files from long-running queries: identify and terminate the query (pg_terminate_backend). 5. Verify: df -h shows usage below 80%. 6. Verify: pg_isready returns "accepting connections."
Escalation: - If disk cannot be freed, contact the DBA on-call. - If the database has stopped accepting writes, escalate to SEV-1 and page the IC.Game Days
Section titled “Game Days”A game day is a planned exercise where you intentionally inject a failure into your system and practice responding to it. Think of it as a fire drill for your infrastructure. Game days build muscle memory, reveal gaps in your runbooks, and give engineers a chance to practice incident roles without the pressure of a real outage.
Start small. Your first game day might be as simple as having one engineer secretly stop the application process on a staging server and seeing how long it takes the team to detect and recover. As your program matures, you can progress to more ambitious exercises: failing over to a standby database, simulating a cloud region outage, or running a tabletop exercise where the team walks through a catastrophic scenario on a whiteboard without actually touching any systems.
Postmortems That Drive Change
Section titled “Postmortems That Drive Change”A postmortem (also called an incident review or retrospective) is the mechanism by which your team converts a painful incident into lasting improvement. The most important attribute of a postmortem is that it is blameless. Blaming individuals discourages honesty, and without honesty, you cannot learn.
A good postmortem document includes the following sections.
Summary. One paragraph describing what happened, when, and how it was resolved. “On March 15, 2026, the order processing service was unavailable for 23 minutes due to the database server running out of disk space. The root cause was a misconfigured backup job writing snapshots to the database server’s local disk. The on-call engineer freed disk space by removing the errant backups, restoring service. The backup configuration has been corrected and a disk usage alert has been added.”
Impact. Quantify the damage. How many users were affected? How many orders were lost or delayed? What was the duration of user-facing impact? “Approximately 2,400 active users experienced errors for 23 minutes. An estimated 340 orders failed and needed to be retried by customers.”
Timeline. A timestamped record of key events: when the alert fired, when the IC was paged, what actions were taken, and when service was restored.
Root cause analysis. What conditions allowed this failure to happen? The “5 Whys” technique works well here. Why was the disk full? Because backup files were stored locally. Why were they stored locally? Because the backup script’s destination path was misconfigured. Why was it misconfigured? Because the configuration was never reviewed after the server was migrated to new hardware. Why was it not reviewed? Because the migration checklist did not include a step to verify backup destinations.
Action items. A short, prioritized list of changes that will reduce the likelihood or impact of a recurrence. Each action item should have an owner and a due date. “Fix the backup script destination (owner: J. Chen, due: March 18). Add a monitoring alert for disk usage above 80% on all database servers (owner: S. Patel, due: March 20). Add backup destination verification to the server migration checklist (owner: M. Torres, due: March 25).”
What went well. Acknowledge what worked. “The on-call engineer responded within 3 minutes of the page. The runbook for database issues was helpful. The synthetic monitor detected the outage 45 seconds before the first customer report.”
Putting It All Together
Section titled “Putting It All Together”Incident response is not a one-time project; it is a practice that improves through repetition. Start by defining your severity levels and mapping them to response expectations. Assign roles, even if your team is small enough that one person fills several. Write runbooks for your most common failure modes. Set up on-call rotations with clear expectations and fair compensation. Run game days to test your processes. Conduct blameless postmortems after every significant incident. Review your DR plan at least annually and test your backups monthly.
The goal is not to eliminate all incidents (that is impossible) but to detect them quickly, respond to them calmly, resolve them efficiently, and learn from them systematically. The team that practices these fundamentals will handle a 2 AM SEV-1 with composure rather than chaos, and their systems will get more reliable over time, not less.