Incident Response Case Study

This activity puts into practice the concepts from the Incident Response and Postmortems lecture. Instead of analyzing a fictional scenario, you will walk through the GitLab.com database outage of January 31, 2017, using GitLab’s initial incident report and public postmortem as your sources. By the end, you will have a working incident-analysis document that records every framework decision this exercise asks you to make: severity, phases, mitigations, recovery criteria, communications, and a blameless postmortem skeleton with owned action items.

What You Will Need

A blank document in your preferred editor, note-taking app, or word processor.
The Incident Response and Postmortems lecture, open in another tab or on a second screen for quick reference. You will look at it often.

Before you begin, set up six numbered sections in your document, one per step below. Leave a clearly separated space at the end for the final postmortem.

The Case Study

On January 31, 2017, GitLab.com experienced a major outage that lasted over eighteen hours and resulted in permanent data loss. The team published a remarkably honest public postmortem afterward, which is part of why this incident is still used to teach incident response a decade later.

The Situation

GitLab.com runs on a PostgreSQL database with a primary-replica setup. The primary handles all write traffic. The replica is kept in sync via streaming replication, meaning PostgreSQL continuously replays the primary’s write-ahead log (WAL), the ordered record of database changes, on the secondary so the team can fail over if the primary is lost. On the afternoon of January 31, the team had been investigating recurring load issues they initially suspected were driven by spam activity. GitLab later concluded that a background job deleting a mis-flagged employee account also contributed to the spike.

The Timeline

The following events are reconstructed from GitLab’s initial incident report and later public postmortem. Some timestamps are approximate.

Time (UTC)	Event
17:20	An engineer manually takes an LVM (Logical Volume Manager) snapshot, a point-in-time disk copy, of the primary so a fresh copy can be loaded into staging. It is not intended as a disaster-recovery backup, but it will become the newest usable recovery artifact.
~19:00	GitLab.com begins experiencing elevated database load initially suspected to be spam. Comments and merge request actions slow noticeably for users.
~23:00	An engineer notices that the PostgreSQL secondary is severely lagging. The WAL segments, the chunked files that carry the write-ahead log used for replication, needed to catch up have already been recycled on the primary, so replication cannot recover on its own.
~23:00 to ~23:30	Engineers attempt to rebuild the secondary. They wipe its data directory and run `pg_basebackup`, PostgreSQL’s built-in tool for copying a full base backup from a primary to seed a secondary, which appears to hang. They raise `max_wal_senders`, the limit on replication connections from the primary, from 3 to 32. PostgreSQL refuses to restart because `max_connections` is set to 8000 (too high for the host’s resources); they reduce it to 2000 and try again.
~23:30	An engineer intends to clear the data directory on the secondary in preparation for another rebuild attempt. They accidentally run the command on the primary instead. The process is terminated within a few seconds, but by then about 300 GB of data has already been removed.
~23:30 onward	Engineers search for a backup. They discover that the regular `pg_dump` job, PostgreSQL’s logical backup tool, has been silently failing because it is using PostgreSQL 9.2 against a PostgreSQL 9.6 database, and that the failure emails had been rejected by the receiving mail server for DMARC, an email-authentication policy. Azure disk snapshots were not enabled on the database hosts. The only recent usable recovery artifact is the 17:20 LVM snapshot that had been copied into staging.
+18 hours	The team restores from the 17:20 LVM snapshot. Recovery is slow because the usable copy lives on lower-performance staging storage.
February 1, ~18:00	GitLab.com service is restored.

What Was Lost

Data created between 17:20 and about 23:30 UTC on January 31 was permanently lost: approximately 5,000 projects, 5,000 comments, and 700 new user accounts. Git repositories and wikis were stored separately and survived intact.

Step 1: Establish Severity

Severity is not a property of the incident as a whole; it can change as new information arrives. Your first job is to assign severity at two different moments and notice how it shifts.

In section 1, label the heading Severity. Underneath, write two rows: At ~19:00 and At ~23:30.
Use the severity scheme (SEV-1 through SEV-4). For the ~19:00 row, write the severity you would assign at the moment GitLab.com first shows user-visible database slowness, and a one-sentence justification grounded in user impact and SLO burn.
For the ~23:30 row, write the severity you would assign in the minute after the primary deletion is discovered, again with a one-sentence justification. Pay attention to the data-integrity dimension; it is different from latency.
Underneath both rows, write a single sentence answering: when in the timeline did severity need to change, and what signal would have triggered the change in a well-run response?

Step 2: Map the Timeline to Phases

The five phases (detect, triage, mitigate, resolve, learn) are postures, not stopwatch segments. Your second job is to mark when the response shifts from one posture to another, and to notice where it should have shifted but did not.

In section 2, add the heading Phases. Create a table with two columns: Time and Phase. Copy the timestamps from the case study into the Time column.

For each timestamp, write the phase the response was in (or should have been in) given the posture definitions. The first row is filled in below as a worked example:

Time (UTC)	Phase
17:20	(pre-incident: manual snapshot for staging; not yet a response)
~19:00 (load spike noticed)	Detect
~23:00 (replication lag confirmed)	?
~23:00 to ~23:30 (rebuild attempts)	?
~23:30 (primary deletion)	?
~23:30 (search for backups)	?
+18 hours (LVM restore in progress)	?
Feb 1, ~18:00 (service restored)	?

Underneath your table, mark with arrows the two points where the team’s posture should have shifted but did not. For each, write a one-sentence note explaining what the team kept doing when they should have stopped doing it.

Step 3: Choose a Mitigation

You are now the on-call engineer at 19:00, when GitLab.com first shows user-visible database slowness. You hold the Incident Commander role until further notice. The actions of the Operations Lead and Communications Lead are your own internal monologue: you must decide what they would each be doing.

In section 3, add the heading Mitigation Plan. Use five mitigation tools: rollback, feature flags, traffic shifting, scaling, and graceful degradation. List all five as separate lines.
For each of the five tools, write a one-line answer to the question: does this tool apply here, and if so, what specifically would it look like in this incident? Some of the answers will be “no, because…” and that answer is valuable; it tells you what part of the mitigation toolkit was simply unavailable to the team.
Of the tools that do apply, mark the one you would try first at 19:00. For that choice, write its reversibility, blast radius, and time-to-undo in three short phrases. Use those three principles to judge whether a mitigation is sound.
Before you would have executed your chosen mitigation, you need to define what success looks like. Write your recovery criteria in one sentence: a specific, measurable statement of when you would consider this incident mitigated. A recovery criterion that says “things look better” is not a recovery criterion.
Finally, write one sentence comparing your plan to what actually happened. Did the GitLab team define recovery criteria before they committed to the secondary rebuild path? What evidence in the timeline tells you?

Step 4: Draft the Communications

By 19:30, the response has been running for thirty minutes and the Communications Lead role needs output. You are still the IC, so the communications below are drafts you would hand to a Comms Lead. The discipline is to produce them anyway, in writing, before you need them.

In section 4, add the heading Comms. Under it, create two subsections: External (status page) and Internal (incident channel).
In the External box, write a 2 to 4 sentence status page update suitable for posting 30 minutes into the response, when you know there is elevated database load and user-visible slowness but do not yet know the cause. Apply four rules: describe what customers observe, do not speculate about cause, say what you are doing about it, and set the time of the next update.
Read your draft aloud to yourself once. Revise any phrase that violates one of the four rules. Notice especially any internal vocabulary (service names, hostnames, or replication jargon such as “WAL segments”) that leaked into the customer-facing version.
In the Internal box, write a 3 to 5 sentence pinned channel summary suitable for the same moment, this time for engineers. Include the current severity, the IC’s name (yours), the current customer impact, the leading hypothesis, the mitigation in progress, and the time of the next update.
Underneath both boxes, write one sentence answering: what fact appears in your internal version that you intentionally kept out of the external version, and why?

Step 5: Time-Box Under Uncertainty

Time-boxing is a structural check against the System 1 failure mode of staying anchored on a hypothesis that is not yielding. In this incident, the team had already been operating under load pressure for hours by the time the secondary rebuild began around 23:00, and the pg_basebackup path still continued long enough to set up the deletion mistake at about 23:30. A different IC posture could have forced an earlier re-evaluation.

In section 5, add the heading Time-Box. Before answering, remind yourself what a time-box does: it sets a fixed deadline for an investigation so the team must stop and re-evaluate at the boundary.
Imagine yourself as IC at 23:10, ten minutes into the rebuild attempt that is not progressing. Write down the explicit time-box you would have set for that investigation: a specific clock time after which you would force a re-evaluation regardless of progress.
Below your time-box, write the next action that would have been taken when the box expired. Pre-commit to that next action before the box starts, so the responders are not deciding under sunk-cost pressure at the moment it expires.
Now identify one earlier moment in the timeline where a time-box would have prevented escalation. Record the time in your document and write a sentence describing what would have changed.
Finally, write one sentence answering: what cognitive trap is time-boxing protecting against, and why is the IC the right person to enforce it rather than the Operations Lead? Use the System 1 / System 2 vocabulary here.

Step 6: The Blameless Postmortem

The technical incident ended on February 1. The learning work began afterward. This is the part of incident response that distinguishes teams that improve from teams that simply survive. Your final task is to produce the skeleton of the postmortem the GitLab team would have written, in your document, with your name on it.

6a: Five Whys

The 5 Whys technique is simple to describe and hard to do well. The discipline is to keep asking “why?” until you reach a cause that is structural (something about the system, the tools, the process, or the culture), not merely proximate (“the engineer made a mistake”). A 5 Whys that stops at human error has not yet found anything actionable.

In section 6a, add the heading 5 Whys. Create two columns or clearly separated lists: Chain A: The primary was wiped and Chain B: Recovery took 18 hours.
In Chain A, start with the symptom and ask “why?” five times. The first level is given: the primary was wiped because the engineer ran rm -rf on the wrong server. Continue from there. Resist the urge to end at “the engineer was tired” or “they made a mistake”; push past those answers to what the system, the tooling, or the operational practice allowed.
In Chain B, do the same starting from recovery took 18 hours because the only usable backup was 6 hours old and lived on slow storage. Again, push past “backups were broken” to what allowed the broken state to persist undetected for so long.
Highlight or otherwise mark the final answer in each chain. The marked answers should be statements about the system, not about an individual. If your final answer names a person, push one level further.

6b: Action Items

A postmortem with elegant analysis and no action items has produced nothing. Each action item should be specific enough to be a ticket, small enough to plausibly complete, and pointed at the structural causes you just identified.

Below your 5 Whys, label section 6b Action Items. Number 1 through 5. For each item, leave room to write one owner and one due date.
Write three action items pointed at your Chain A structural cause. For each, label it as a system change (a change to code, tooling, or infrastructure), a process change (a change to runbooks, checklists, or workflow), or a cultural change (a change to norms, expectations, or training). Then assign exactly one owner role or team and give it a due date.
Write two more action items pointed at your Chain B structural cause, again labeled, each with exactly one owner and a due date.
Look at the distribution of system / process / cultural labels in your five action items. Write one sentence answering: if this distribution describes how your team would actually invest in preventing the next incident, where is the weight, and is that the right place for it?
At the end of the section, write one sentence naming who will check these five items in three weeks. Then sign your name and write today’s date. This is now your postmortem skeleton: a written artifact you can return to and refine. The signature marks it as real work; the three-week follow-up is what keeps the action items from becoming theatrical.

Going Further

You have walked the GitLab incident through every framework in this activity in one working document. The natural next step is to read the source material in full: GitLab’s initial incident report captures the live response, and the later official postmortem reconstructs the outage and recovery in more detail. Reading them after doing this analysis will let you compare your framework decisions to the ones the actual team made. Plan about thirty minutes for a careful read.

After that, the cleanest exercise to build on this one is to pick a different public postmortem and run the same six-step framework against it in a fresh document. The danluu/post-mortems repository is the canonical index; the Cloudflare 2019 regex outage is rich enough to support the same analysis and short enough to read in one sitting. The framework is the same; the failure modes are different, and the contrast is where the concepts settle into operational instinct.

PagerDuty’s Incident Response guide is also a great next step. It is more prescriptive than the Google SRE book, with specific checklists for each role and phase.