Imagine you are three weeks into a new job on a platform team. At 2 AM, your pager fires: the payment service is returning 503 errors and the on-call runbook says “ask Jordan.” Jordan left the company last month. You have no architecture diagram, no record of why the service was deployed the way it was, and no step-by-step guide for restarting it safely. You are, in the language of operations, on your own.
This chapter is about making sure that situation never happens on your team. Documentation is not paperwork; it is the connective tissue that lets a group of people operate a system reliably over time, across shift changes, vacations, and departures. We will examine the types of documentation that matter most, how to write them well, how teams collaborate around them, and where to go next in your career.
Every operations team has a “bus factor,” the number of people who could be suddenly unavailable before critical knowledge is lost. If only one person knows how the database failover works, the bus factor for that procedure is one. Documentation raises the bus factor by externalizing knowledge from individuals into shared, searchable, version-controlled artifacts.
Three scenarios make the case clearly. First, on-call handoffs: when one engineer ends a rotation and another begins, the incoming person inherits every alert, every half-finished migration, and every known fragility. Without written context, the handoff becomes a game of telephone. Second, onboarding: a new team member should be able to set up a development environment, understand the production architecture, and handle common alerts within their first week. If that process lives entirely in someone’s head, onboarding stretches into months. Third, incident response: during an outage at 3 AM, you do not rise to the level of your expertise; you fall to the level of your documentation. A clear runbook is the difference between a five-minute fix and a two-hour escalation.
Not all documentation serves the same purpose. Operations teams typically maintain four categories, each with a distinct audience and shelf life.
Runbooks are step-by-step procedures for recurring tasks or known incident scenarios. They answer the question “how do I do this specific thing right now?” A runbook for restarting a queue worker, for example, should include prerequisites, exact commands, expected output at each step, and rollback instructions. Runbooks are high-urgency, low-context documents: they assume the reader already knows why they need to act and just needs the mechanics.
Playbooks are broader than runbooks. A playbook describes how to respond to a class of problems (such as “elevated error rate on the API gateway”) and includes decision trees that branch based on symptoms. Where a runbook says “run this command,” a playbook says “if you see X, try A; if you see Y, try B instead.”
Architecture documentation describes what exists and how the pieces fit together. It includes component diagrams, data flow descriptions, network topology, and dependency maps. Architecture docs answer the question “what is this system and how does it work?” They change less frequently than runbooks, but they must be kept current or they become actively misleading.
Decision records capture why a choice was made. They are sometimes called Architecture Decision Records (ADRs). A decision record explains the context at the time of the decision, the options considered, the choice made, and the expected consequences. Decision records answer the question “why is it this way?” and prevent future teams from relitigating settled questions or repeating past mistakes.
A runbook succeeds when someone who has never performed the procedure can follow it under pressure and get the expected result. That is a high bar, and meeting it requires deliberate writing.
Start with a title that names the task precisely. “Database Failover” is better than “DB Stuff.” Follow with a brief statement of when and why someone would use this runbook. List prerequisites: access credentials, VPN connections, required CLI tools and their versions. Then present the steps.
Each step should include the exact command to run (copy-pasteable, with no placeholder values that the reader must guess), the expected output or behavior, and what to do if the output differs. Here is an example that illustrates the difference between a weak and a strong runbook step.
Weak step:
Restart the worker. Check that it’s running.
Strong step:
Connect to the queue host and restart the worker process:
Terminal window
sshqueue-prod-01.internal
sudosystemctlrestartcelery-worker
Verify the service is active. You should see “active (running)” in the output:
Terminal window
sudosystemctlstatuscelery-worker
If the status shows “failed,” check the journal for the most recent error:
The weak version assumes familiarity and leaves the reader guessing about which host to connect to, which service manager to use, and what “check that it’s running” actually means. The strong version eliminates ambiguity.
For complex procedures, include a rollback section at the end. If the runbook describes a database migration, the rollback section should describe how to reverse it. This transforms a one-way procedure into a safe, reversible one, which dramatically lowers the stress of executing it during an incident.
Architecture docs give the reader a mental model of the system. The most useful format combines a diagram with concise prose that describes each component and the connections between them.
A good architecture document typically includes three layers. The first is a high-level overview: a diagram showing the major components (load balancer, application servers, database, cache, message queue, external APIs) and the arrows between them. Label the arrows with protocols and ports. The second layer is a component description for each box in the diagram: what it does, where it runs (which hosts, which cloud region), how it is deployed, and who owns it. The third layer is a data flow narrative: trace a typical request from the user’s browser through the system and back, noting where data is transformed, stored, or handed off.
Architecture docs go stale quickly, so attach a “last verified” date and assign an owner responsible for reviewing them quarterly. A diagram from two years ago that no longer matches reality is worse than no diagram at all, because it builds false confidence.
Decisions are easy to make and hard to remember. Six months from now, nobody will recall why the team chose PostgreSQL over MySQL, or why the deployment pipeline pushes to a staging environment before production. Decision records preserve that reasoning.
The ADR format, popularized by Michael Nygard, is intentionally lightweight:
Title. A short noun phrase: “Use PostgreSQL for the billing database.”
Status. Proposed, accepted, deprecated, or superseded.
Context. What situation or problem prompted this decision? What constraints applied?
Decision. What was chosen, stated in active voice: “We will use PostgreSQL 15 on RDS.”
Consequences. What follows from this decision, both positive and negative? For example: “We gain strong JSON support and mature tooling. We accept the operational cost of managing a relational database and its backups.”
Store ADRs in a docs/decisions/ directory in the relevant repository, numbered sequentially (0001-use-postgresql.md, 0002-adopt-terraform.md). Because they live in version control, they accumulate naturally and provide a searchable history of the team’s reasoning. When a decision is revisited, write a new ADR that references and supersedes the old one rather than editing the original. This preserves the historical context.
Documentation is a team activity, and teams need structure to collaborate effectively. Four patterns recur across well-run operations teams.
Tickets and issues are the unit of trackable work. Whether you use GitHub Issues, Jira, or a simple shared board, the principle is the same: every piece of planned work should have a ticket that states what needs to happen, why, and what “done” looks like. Tickets create accountability and make it possible to see, at a glance, what the team is working on. They also create a searchable archive: six months from now, you can find the ticket where the team discussed and resolved a tricky networking problem.
Pull requests are the mechanism for reviewing changes before they land. A good pull request is small (under 400 lines of diff when possible), has a descriptive title, and includes a summary that explains the motivation for the change. Reviewers should focus on correctness, clarity, and maintainability. Code review is not adversarial; it is a collaborative process that catches mistakes, spreads knowledge, and improves the overall quality of the codebase and its documentation.
Pair work (sometimes called pair programming, but equally applicable to operations tasks) puts two people on the same problem at the same time. One person drives while the other navigates, asking questions and catching errors. Pair work is especially valuable for high-risk operations (database migrations, network changes) and for onboarding, where the new person drives and the experienced person navigates.
Change logs and handoff notes bridge the gap between shifts or rotations. At the end of an on-call rotation, the outgoing engineer should leave a brief summary: what alerts fired, what was done about them, what is still in progress, and what the incoming engineer should watch for. This can be as simple as a few paragraphs in a shared document or a pinned message in a chat channel. The key is consistency: if the team expects a handoff note every rotation, it becomes a habit rather than an afterthought.
Documentation should live alongside the code it describes. This approach, often called “docs-as-code,” means writing documentation in Markdown (or a similar plain-text format), storing it in Git, and reviewing changes through pull requests just like any other code change.
The benefits are significant. Version control gives you a history of every change: who wrote it, when, and why (via the commit message). Pull requests give you a review process: someone else reads the documentation before it merges, catching errors, ambiguities, and gaps. Branching lets you draft large documentation changes without disrupting the current version. And co-locating docs with code means they are more likely to be updated when the code changes, because the developer is already in the repository.
Organize documentation predictably. A common structure places runbooks in docs/runbooks/, architecture docs in docs/architecture/, and decision records in docs/decisions/. Use descriptive filenames (restart-celery-worker.md, not runbook3.md) so that readers can find what they need without opening every file.
GitOps is a methodology for managing infrastructure and application deployments using Git as the single source of truth. In a GitOps workflow, all changes to the system — whether to application code, configuration files, or infrastructure definitions — are made through Git commits and pull requests. An automated agent (such as Argo CD or Flux) continuously reconciles the live system state with the desired state stored in the repository, applying changes automatically when the repository is updated.
The practical consequence is that every change is reviewed, versioned, and auditable before it reaches production. Rolling back means reverting a commit, not manually undoing changes on a server. GitOps is a natural extension of the infrastructure-as-code and CI/CD practices covered earlier in the course.
As architectures decompose into microservices, inter-service communication becomes a significant operational challenge. A service mesh is a dedicated infrastructure layer — deployed as lightweight sidecar proxies alongside each service — that handles communication concerns including load balancing, retries, circuit breaking, mutual TLS between services, and distributed tracing.
Istio and Linkerd are the most widely used open-source service meshes. Rather than requiring each development team to implement these concerns in application code, a service mesh provides them as platform-level features, which makes the system easier to observe and operate at scale.
Monitoring tells you what your system is doing under real load. Performance testing tells you what it can handle before it breaks. Common approaches include:
Load testing: simulate realistic traffic levels to verify that the system meets its SLOs under expected conditions. Tools include k6, Locust, and Apache JMeter.
Stress testing: push the system beyond its expected limits to identify the failure mode and the breaking point.
Soak testing: run the system at moderate load for an extended period (hours to days) to catch problems like memory leaks, connection pool exhaustion, or disk growth that only manifest over time.
Running performance tests before a major release, and comparing results against a baseline, is one of the most effective ways to catch regressions before users experience them.
System administration has evolved from a single role into a family of related specializations. Understanding the landscape helps you choose where to invest your energy.
The system administrator role remains the foundation. Sysadmins manage servers, networks, storage, and user accounts. They handle backups, patches, and capacity planning. In smaller organizations, the sysadmin does everything; in larger ones, the role tends to specialize.
DevOps engineers bridge the gap between development and operations. They build and maintain CI/CD pipelines, automate infrastructure provisioning, and work closely with developers to improve deployment velocity and reliability. The DevOps role emphasizes automation, collaboration, and iterative improvement.
Site Reliability Engineers (SREs) apply software engineering practices to operations problems. The SRE model, popularized by Google, defines reliability in terms of Service Level Objectives and error budgets, and treats operational work as engineering work. SREs write code to eliminate manual labor (called “toil”), build monitoring and alerting systems, and participate in on-call rotations.
Platform engineers build internal platforms that other teams use to deploy and operate their services. Instead of managing individual applications, platform engineers create self-service tools, shared infrastructure (such as Kubernetes clusters or CI/CD systems), and developer-facing APIs. The goal is to make it easy for application teams to do the right thing by default.
Cloud architects design the overall structure of an organization’s cloud infrastructure. They make decisions about which cloud services to use, how to organize accounts and networks, how to manage costs, and how to meet security and compliance requirements. Cloud architects tend to work at a higher level of abstraction, focusing on patterns and standards rather than individual systems.
These roles are not rigid boxes. In practice, most operations professionals blend skills from several of them, and career paths frequently cross between them. What they share is a commitment to building systems that work reliably, and the documentation practices we have discussed in this chapter are common to all of them.
Technology changes quickly, and a career in system administration requires continuous learning. Several avenues are worth pursuing.
Certifications validate your knowledge and can open doors, especially early in your career. The AWS Certified Solutions Architect, Certified Kubernetes Administrator (CKA), Red Hat Certified System Administrator (RHCSA), and CompTIA Linux+ are widely recognized. Certifications are most valuable when they align with the technology you actually use; collecting certifications for their own sake is less useful than deepening expertise in your current stack.
Communities provide peer learning and professional connections. Local meetups, online forums (such as the SRE and DevOps subreddits), and Slack or Discord communities (such as the Rands Leadership Slack or the Kubernetes Slack) are places to ask questions, share knowledge, and learn from practitioners at other organizations. Engaging with a community also helps you calibrate your skills against the broader industry.
Conferences concentrate learning into a few intense days. SREcon, KubeCon, LISA, and regional DevOpsDays events feature talks from practitioners solving real problems. Many conferences publish their talks online afterward, making them accessible even if you cannot attend in person.
Open source contribution is one of the most effective ways to deepen your skills. Contributing to projects you use (whether by fixing bugs, improving documentation, or adding features) exposes you to code review, collaboration norms, and engineering standards beyond your own organization. It also creates a public portfolio of your work that is visible to future employers.
Writing and teaching solidify your understanding. Writing a blog post about a problem you solved, giving a talk at a local meetup, or mentoring a junior colleague forces you to organize your thoughts and identify gaps in your own knowledge. The act of explaining something clearly is one of the best tests of whether you truly understand it.
Return to the scenario from the beginning of this chapter. This time, you are three weeks into the job, your pager fires at 2 AM, and things are different. You open the runbook for the payment service, which lists the exact steps to diagnose and restart it. The architecture diagram shows you which upstream services depend on it and which database it connects to. A decision record explains why the service runs on a dedicated host instead of in the shared Kubernetes cluster (it was a latency-sensitive choice made eight months ago). The on-call handoff note from your colleague mentions that this service had a brief hiccup two days ago and suggests checking the connection pool settings first.
You follow the runbook, verify the fix, update the handoff note for the next person, and go back to sleep. The incident that could have been a two-hour escalation took twelve minutes.
That is the power of good documentation. It is not glamorous work, and it is rarely celebrated in the moment, but it is the foundation on which reliable operations are built. Every runbook you write, every decision you record, every diagram you keep current is a gift to your future self and to the teammates who will follow you.