Configuration Management
Imagine you are responsible for three web servers that must serve identical content behind a load balancer. On day one you log into each machine, install nginx, copy over a configuration file, and start the service. Everything works. A month later, a colleague patches one server but forgets the other two. Someone else tweaks a timeout setting on the third machine to debug a problem, then never reverts it. Before long, the three “identical” servers have quietly diverged. This phenomenon is called configuration drift, and it is one of the most common sources of mysterious, hard-to-reproduce bugs in production environments.
A server that has been hand-configured over time, accumulating one-off changes that nobody fully remembers, is sometimes called a snowflake server. Like an actual snowflake, it is unique, and that uniqueness is a liability. If it fails, recreating it from memory is slow and error-prone.
Configuration management is the discipline of defining and repeatedly enforcing the desired state of running systems. Terraform, as covered in the Infrastructure as Code lecture, is excellent at creating infrastructure objects such as virtual machines, networks, and databases. Shell scripts, as covered in the Shell Scripting lecture, are excellent at automating a sequence of commands on one machine. Configuration management sits between those layers. It answers the day-two question: once a server exists, how do you keep it configured correctly next week, next month, and after the third emergency change at 2 AM?
This lecture uses Ansible to explain the broader ideas behind configuration management: desired state, convergence, idempotency, inventories, reconciliation loops, reuse, secrets handling, and the tradeoffs between agentless push tools and agent-based pull tools. If the Shell Scripting lecture made the case for Bash as the right tool for single-machine automation, this lecture explains what takes over when the fleet grows beyond what any script can reliably manage.
Configuration Management as a Discipline
Section titled “Configuration Management as a Discipline”Configuration management exists because infrastructure changes in two directions at once. You intentionally change servers by installing packages, updating configuration files, rotating credentials, and deploying applications. At the same time, servers also change accidentally through manual fixes, forgotten experiments, package upgrades, and partial failures. A configuration management system exists to keep the intentional changes and reject the accidental ones.
The two most important ideas in this discipline are convergence and idempotency. Convergence means the machine moves toward a declared target state each time the automation runs. Idempotency means re-running the automation does not create extra damage, duplicate work, or unintended side effects. Those properties are not unique to Ansible. They are the reason configuration management is a distinct category rather than just “better shell scripts.” A shell script can be written idempotently, but that safety is something you must design carefully every time. A good configuration management system makes that the default shape of the work.
Common tools in this space include Puppet, Chef, Salt, and Ansible. They are all trying to solve the same underlying problem, but they make different tradeoffs about control flow, agent installation, central services, and how often enforcement happens.
The table below is worth reading as a comparison of operating models, not as a scoreboard. Different organizations land on different tools because their operational constraints differ.
| Tool | Typical model | Why teams choose it | Typical tradeoff |
|---|---|---|---|
| Puppet | Pull, agent-based | Continuous enforcement and mature enterprise policy patterns | Requires agents and more supporting infrastructure |
| Chef | Pull, agent-based, Ruby-heavy | Very expressive model and strong ecosystem in some enterprises | Steeper learning curve and more abstraction overhead |
| Salt | Push and pull, flexible transport | Fast remote execution and broad targeting options | Larger conceptual surface area |
| Ansible | Push, mostly agentless over SSH | Low barrier to entry and readable YAML | Enforcement is usually operator-initiated unless you schedule it |
The push versus pull distinction matters operationally. A push system usually starts from a control node that opens connections to targets and applies changes now, because an operator or pipeline told it to. A pull system usually installs an agent that periodically asks a central service, “What should I be doing?” Pull models are attractive when you want constant background enforcement. Push models are attractive when you want minimal software on the target and explicit operator control. Neither is inherently more correct. They simply optimize for different environments.
The push-versus-pull distinction has a direct consequence that goes beyond tool selection: enforcement frequency. A pull-based agent on a 15-minute check interval silently corrects drift 96 times per day without any operator action. A push-based model like Ansible enforces state only when someone, or some scheduled process, actually runs the playbook. Many Ansible shops go days or weeks between full runs, which means drift accumulates in the gaps. Whether that gap is acceptable depends entirely on the workload. A stateless web tier serving read-only traffic can probably tolerate several days of unreviewed drift without meaningful risk. A server handling sensitive transactions in a regulated environment almost certainly cannot, and in that case the “lower barrier to entry” advantage of Ansible starts to look more like “lower barrier to inconsistency.”
If continuous enforcement matters, Ansible has answers. AWX is one of the open-source upstream projects for Red Hat’s Ansible Automation Platform. It adds scheduling, role-based access control, and persistent run history to standard Ansible workflows, turning a collection of playbooks into a managed service that executes on a defined cadence. Cron jobs or CI/CD pipeline triggers are a lower-ceremony alternative for teams that do not need the full AWX feature set. The tradeoff is observability: a playbook run by a cron job at 3 AM on a single control node is easy to configure and easy to forget, while AWX retains job history, notifies on failure, and gives operators a concrete place to investigate when drift is suspected and they want to know exactly when it started.
The design tension here is between reactive enforcement, running the playbook when you know something needs to change, and proactive enforcement, running it on a schedule to catch changes you did not know had happened. Both have a place. Proactive enforcement catches the forgotten emergency fix and the unannounced package upgrade that happened during a security patch cycle. Reactive enforcement is faster when you are deploying a deliberate change and do not want to wait for the next scheduled window. Mature Ansible environments usually combine them: a nightly or weekly scheduled run for drift correction, plus manual or pipeline-triggered runs for intentional deployments.
Configuration management also sits inside a larger debate about mutable versus immutable infrastructure. In a mutable model, you keep a long-lived server and repeatedly change it in place. In an immutable model, you build a fresh image or container and replace the old instance entirely. Tools such as cloud-init and Packer can push a lot of setup earlier into the lifecycle, and container platforms push even more application state into images and manifests. That does not eliminate configuration management. It narrows where you need it. Long-lived virtual machines, bastion hosts, stateful services, and mixed fleets still need a way to converge back to a known-good configuration over time.
Why This Course Uses Ansible
Section titled “Why This Course Uses Ansible”This course uses Ansible because it exposes the core ideas with relatively little ceremony. If you already understand SSH, YAML, packages, services, files, and Linux users, you can get productive quickly. That makes it a good teaching tool. The important point is not that Ansible is the one true answer. The important point is that Ansible makes the underlying configuration management concepts visible instead of hiding them behind a large amount of platform-specific infrastructure.
In the Ansible documentation, the machine where you run automation is the control node and the targets are managed nodes. For the Linux systems used in this course, Ansible typically needs an SSH-reachable host and a usable Python interpreter on the remote side. The control node connects over SSH, transfers small modules, executes them, collects the results, and cleans up. No long-running daemon is required on each Linux host, and no always-on central configuration database is required just to get started.
Ansible is not magic. It is a coordination layer around YAML, SSH, Python modules, and idempotent resource descriptions. That is exactly why it is useful pedagogically. You can usually see what it is doing, and when something fails, the failure still teaches you something real about the underlying operating system.
The diagram below shows that execution model at a high level. The playbook, inventory, and variables live on the control side. The actual system state lives on the managed side. Much of the craft of configuration management lies in keeping those two views aligned without confusing the description of the system for the system itself.
flowchart TB operator["Operator or automation runner"] repo["Automation source\nplaybook · inventory · variables"] control["Control node\nansible-playbook"] nodes["Managed nodes\npackages · files · services"] operator --> repo repo --> control control -- "SSH + modules" --> nodes nodes -- "facts + results" --> control
Modeling the Fleet with Inventories and Variables
Section titled “Modeling the Fleet with Inventories and Variables”An inventory is more than a host list. It is the first abstraction layer over your fleet. The moment you group hosts as webservers, dbservers, staging, or production, you are telling the automation which machines share responsibilities, which machines should receive the same configuration, and what your blast radius is when you target a group. The Ansible inventory documentation formalizes this idea, but the concept applies far beyond Ansible: any configuration management system needs some model of “which machines are these?” before it can do useful work.
Here is a simple static inventory written in an INI file:
[webservers]web1 ansible_host=192.0.2.10web2 ansible_host=192.0.2.11web3 ansible_host=192.0.2.12
[webservers:vars]ansible_user=deployansible_python_interpreter=/usr/bin/python3
[dbservers]db1 ansible_host=192.0.2.20
[production:children]webserversdbserversThree hosts are placed into the webservers group, one host is placed into dbservers, and the parent group production contains both. This is already doing valuable modeling work. It lets you ask focused operational questions: “Apply this web configuration to all webservers,” “restart only the database tier,” or “run a safety check against all production nodes.” The default groups all and ungrouped exist as well, but the real power comes from designing group structure that matches how you actually think about the fleet.
Variables deepen that model. Instead of hardcoding ports, paths, users, and package names inside every task, you define them once at the right scope and let hosts inherit or override them. Ansible commonly stores these values in group_vars/ and host_vars/ directories alongside the inventory:
Directoryproject
- hosts.ini
Directorygroup_vars
- webservers.yml
Directoryhost_vars
- web1.yml
- site.yml
If group_vars/webservers.yml defines nginx_worker_processes: 4 and nginx_listen_port: 80, every web server inherits those values. If host_vars/web1.yml sets nginx_worker_processes: 8, that one machine diverges intentionally, in code, rather than accidentally, through a manual edit at 2 AM. That distinction is operationally crucial. Planned difference is not drift. Unrecorded difference is drift.
Ansible also gathers facts about each host, such as operating system family, hostname, IP addresses, CPU count, and memory. Facts let you write automation that adapts to reality instead of pretending every server is identical. Static inventories are fine for stable fleets, but dynamic infrastructure eventually pushes you toward inventory plugins that query cloud APIs at runtime. The higher-level idea stays the same: your automation needs a reliable, current model of the machines it is about to touch.
Provisioning-to-Configuration Handoff
Section titled “Provisioning-to-Configuration Handoff”One subtle design problem appears as soon as you combine provisioning with configuration management: how does the configuration tool learn which machines now exist? A small static fleet can tolerate a hand-maintained inventory for a long time. Disposable cloud instances usually cannot. The handoff between “I created a machine” and “I configured a machine” becomes part of the system design.
The table below shows the common handoff patterns. Notice that they differ less in raw capability than in where truth lives and how quickly that truth goes stale.
| Pattern | How it works | Strength | Risk |
|---|---|---|---|
| Static inventory | A human-maintained hosts file records targets | Simple and explicit | Becomes stale in elastic environments |
| Generated inventory | Provisioning writes an inventory file from outputs | Clean fit for small cloud stacks | Couples tools and file formats |
| Dynamic inventory | Ansible queries the provider API at run time | Sees the current fleet state | Depends on disciplined tagging and cloud authentication |
| Orchestrated invocation | A pipeline or wrapper runs provisioning and configuration back-to-back | Creates a one-command workflow | Can blur tool boundaries if overused |
None of these patterns is universally correct. Generated inventories are often the easiest bridge from Terraform or OpenTofu into Ansible because the flow is explicit and inspectable. Dynamic inventory becomes attractive when instances appear and disappear frequently enough that file generation feels like a cache with a race condition. As covered in the Cloud Networking, Storage, and Identity lecture, the identity used to query cloud APIs is part of this design too. Inventory is not just a file problem; it is a trust problem.
Playbooks, Tasks, and Idempotency
Section titled “Playbooks, Tasks, and Idempotency”Inventories answer “which machines?” Playbooks answer “what should be true about them?” In Ansible’s playbook guide, a playbook is a YAML document containing one or more plays. Each play targets hosts and contains tasks. Each task calls a module, and modules are distributed in collections. The naming sounds mechanical, but the deeper point is conceptual layering: fleet, target set, desired state, concrete action.
This playbook shows the structure clearly:
---- name: Configure web servers hosts: webservers become: true
tasks: - name: Install nginx ansible.builtin.apt: name: nginx state: present update_cache: true cache_valid_time: 3600
- name: Deploy nginx configuration ansible.builtin.copy: src: files/nginx.conf dest: /etc/nginx/nginx.conf owner: root group: root mode: "0644" notify: Reload nginx
- name: Ensure nginx is running and enabled ansible.builtin.service: name: nginx state: started enabled: true
handlers: - name: Reload nginx ansible.builtin.service: name: nginx state: reloadedThe play targets the webservers group and uses become: true for elevated privileges. The ansible.builtin.apt module manages packages, ansible.builtin.copy manages files, and ansible.builtin.service manages services. The notify line points at a handler, which is just a task that runs only if something changed and only once per play. This pattern matters because configuration management is not simply about reaching the right end state. It is about reaching it without unnecessary churn. A web server that reloads on every run is technically automated, but it is not operationally elegant.
Idempotency is the heart of the whole model. If nginx is already installed, the package task should report ok, not re-install it. If the configuration file on disk already matches the desired contents, the file task should do nothing. If the service is already started and enabled, the service task should leave it alone. That is why running a playbook twice is such a useful mental test. On the second run, most tasks should be boring. Boring is the goal.
This is also where the limits of declarative tooling become visible. Modules such as apt, copy, template, and service know how to inspect state before acting. Raw command execution does not. Ansible’s command and shell modules are powerful escape hatches, but they are closer to scripting than to true state management. The same warning applies in every configuration management system: the moment you drop to arbitrary commands, you are stepping outside the framework’s strongest guarantees and back into a world where idempotency is your responsibility.
Ansible does support ad-hoc commands, and they are genuinely useful for inspection and quick probes. ansible webservers -i hosts.ini -m ansible.builtin.ping is a fleet connectivity check, not an ICMP ping, and ansible-doc ansible.builtin.apt is a fast way to inspect module behavior locally. The useful rule of thumb is this: if a command matters more than once, it probably belongs in a playbook, because repeatability and review matter more than convenience.
Updates, Upgrades, and Reboots
Section titled “Updates, Upgrades, and Reboots”Using configuration management to update software is entirely normal, and it is one of the most common day-two jobs. The same package task that ensures nginx is present can also keep it on an approved version or move it forward during a maintenance window. Routine patching is the easy case. If the declared state is “install the current approved nginx package,” Ansible can keep converging hosts toward that state every time the play runs.
The harder case is a major version bump or any update that carries a migration with it. A package manager can make PostgreSQL 16 packages available while the real operational work is still pg_upgrade or a logical migration, and nginx can pull in a new module layout or configuration syntax, but “package upgraded” is not the same thing as “application safely migrated.” Mature playbooks treat those upgrades as explicit change-management workflows: pin the target version, back up or snapshot state, drain or limit traffic, run any required migration or syntax validation step, verify the service, and widen the rollout only after the canary host behaves correctly. Declarative tooling makes the mechanics repeatable. It does not remove the need to understand the upgrade path.
The same boundary explains where kernel and VM-level changes belong. Configuration management is a good fit for kernel packages, sysctl values, loaded modules, systemd units, and reboot coordination, because those live inside the operating system. Virtual machine shape, such as vCPU count, memory size, disk class, attached NICs, or subnet placement, is infrastructure and belongs to provisioning tools such as Terraform. A useful rule is that Ansible changes the inside of the machine; provisioning changes what machine exists in the first place.
Reading a Playbook Run
Section titled “Reading a Playbook Run”Idempotency is easy to describe but worth seeing in terms of what Ansible actually reports. When a play runs, common task outcomes include ok (task succeeded), changed (task succeeded and modified state), skipped (a when condition or other gate kept it from running), and failed (something went wrong and execution stopped). At the end of every run, Ansible also prints recap counters per host. These numbers are per-host counters, not one global total:
PLAY RECAP *********************************************************************web1 : ok=6 changed=2 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0web2 : ok=6 changed=2 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0web3 : ok=6 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0The recap fields are easier to read if you treat them as a compact incident summary for each host.
| Field | What it means |
|---|---|
ok | The task completed successfully. Some of those successful tasks may also contribute to changed. |
changed | The task successfully modified the host. This is the counter you watch for drift correction or rollout activity. |
unreachable | Ansible could not connect to the host or start normal module execution, usually because of SSH, authentication, networking, or bootstrap issues. |
failed | A task failed and the play did not recover from it on that host. |
skipped | A task was intentionally not run, often because of when, tags, check mode behavior, or host targeting. |
rescued | A task inside a block failed, but a rescue path handled that failure. |
ignored | A task failed, but ignore_errors: true let the play continue anyway. |
This recap tells a story. Every host completed six successful tasks. On web1 and web2, two of those successful tasks actually modified state. Zero changes were needed on web3, which indicates that web3 was already in the desired state before the run started. No host became unreachable, no unhandled failure stopped the play, no recovery path was needed, and no failed task was ignored. One task was skipped on every host, probably because a conditional evaluated to false in the current environment.
Two recap fields deserve extra attention. unreachable is different from failed: it means Ansible never got far enough to perform normal task logic on that host. rescued is different from a clean success: it means something did go wrong, but the playbook had an explicit recovery path and used it.
On a well-maintained fleet, the second run of any playbook should produce almost entirely ok lines. A recap showing many changed items on a second run is a diagnostic signal worth investigating: either the tasks are not genuinely idempotent, the system is reverting to a different state between runs, or the definition of desired state is changing between invocations. Each of those causes has a different remedy, but the recap is the right place to start.
The --check flag is a command-line option on ansible-playbook that runs the play in dry-run mode, predicting what would change without applying anything. Combined with --diff, modules that support diff mode can show before-and-after differences, especially for file-oriented changes. Running both together before a maintenance window is one of the cheapest risk-reduction habits available:
ansible-playbook site.yml -i hosts.ini --check --diffFor supporting modules, the output can show which files would change and how. It is not a perfect simulation: tasks that register results from live command output cannot always predict what a downstream conditional will do, modules executing arbitrary shell commands may not be able to pre-evaluate their impact, and some diffs are intentionally suppressed when they would expose sensitive data. For the common work of package management, file management, and service state, the preview is reliable enough to catch obvious mistakes before they reach production. Building the habit of “check and diff before apply” on any playbook targeting more than one host is worth doing early.
Task Flow, Conditions, and Failure Semantics
Section titled “Task Flow, Conditions, and Failure Semantics”The Shell Scripting lecture introduced conditionals, exit codes, and one-time setup logic in Bash. Ansible has the same operational needs, but it expresses them as data attached to tasks rather than as control-flow syntax wrapped around whole scripts. That shift matters because it lets you reason about intent one task at a time: why this task runs, what counts as change, and what should be considered a failure.
Several task-flow tools show up constantly in real playbooks. register captures a task result so later tasks can inspect it. when turns that result into a gate. changed_when and failed_when let you correct Ansible’s default interpretation when a raw command does not map neatly onto the framework’s idea of success or change. For multi-step work, block, rescue, and always group tasks into a success path, a recovery path, and a cleanup path. The first example is short, but it captures the pattern.
- name: Check whether the bootstrap marker exists ansible.builtin.stat: path: /srv/myapp/.bootstrapped register: bootstrap_marker
- name: Run one-time bootstrap import ansible.builtin.command: /usr/local/bin/bootstrap-myapp when: not bootstrap_marker.stat.exists changed_when: true
- name: Verify the service is active ansible.builtin.command: systemctl is-active myapp register: service_state changed_when: false failed_when: service_state.stdout != "active"Read that snippet from top to bottom:
ansible.builtin.statinspects/srv/myapp/.bootstrappedandregister: bootstrap_markerstores the full result object for later use. That stored object includes fields such as whether the path exists, the return code, and whether the task changed anything.when: not bootstrap_marker.stat.existsturns the stored result into a gate. If the marker file already exists, the bootstrap task is skipped.changed_when: truetells Ansible to count a successful bootstrap as a real change. Thecommandmodule cannot infer whether/usr/local/bin/bootstrap-myappactually altered the system, so you must state that intent yourself.- The service check uses
registeragain, this time to capture the output ofsystemctl is-active myappinservice_state. changed_when: falsemarks that command as observational. You want the recap to say, “I checked status,” not, “I changed the server.”failed_when: service_state.stdout != "active"rewrites the success rule for the task. In other words, this command only counts as healthy if it reports exactlyactive.
This is the point where configuration management stops feeling like a prettier package installer and starts feeling like a systems language. A task result is not just pass or fail. It can be “ran and changed,” “ran and did not change,” “should be skipped,” or “ran but its stdout means something is wrong.” When file existence is the natural guard, creates and removes are even cleaner than a separate stat task because they state the gate directly. That is especially useful for imports, archive extraction, key installation, or bootstrap actions that should run once and then become boring.
Sometimes the unit of failure is larger than one task. You might need to render a configuration file, validate it, and roll back if validation fails. Ansible’s block, rescue, and always keywords exist for exactly that situation.
- name: Apply and validate web configuration block: - name: Render configuration ansible.builtin.template: src: nginx.conf.j2 dest: /etc/nginx/nginx.conf
- name: Validate configuration ansible.builtin.command: nginx -t changed_when: false
rescue: - name: Restore last known good configuration ansible.builtin.copy: src: files/nginx.conf.backup dest: /etc/nginx/nginx.conf
always: - name: Probe final nginx state ansible.builtin.command: systemctl is-active nginx changed_when: false failed_when: falseThis second snippet also reads cleanly as a timeline:
- Ansible enters
blockand runs its tasks in order. Here that means render the configuration, then validate it withnginx -t. changed_when: falseon the validation command says that a syntax check is inspection, not configuration drift correction.- If every task in
blocksucceeds, Ansible skipsrescueentirely. - If a task in
blockreturns a normalfailedresult, Ansible stops the remaining block tasks and jumps torescue. In this example, a failednginx -twould trigger the rollback copy. - If the rescue tasks succeed, the recap increments
rescuedrather thanfailed. If a rescue task fails, or if you deliberately add anansible.builtin.failtask at the end of rescue, the host still ends the play as failed. alwaysusually runs after the success path or the rescue path while the host is still executing normal task flow. It is where you put cleanup, notifications, or diagnostic checks that should run after ordinary block processing. Herefailed_when: falsekeeps the status probe from masking the original outcome with a second, less important failure.
One subtle boundary matters here: rescue handles normal task failures, not transport failures. If a host is unreachable because SSH or authentication broke, Ansible never gets a normal task result to rescue from. The same limitation applies to always, and invalid task definitions bypass both sections as well.
The reason this matters is not elegance. It is operational safety. A playbook that can distinguish “no change,” “healthy change,” and “change that must be rolled back” is a playbook you can trust during a maintenance window.
Reuse, Templates, and Shared Automation
Section titled “Reuse, Templates, and Shared Automation”As the amount of automation grows, the next problem is not syntax. It is structure. A single giant playbook works for a while, then turns into a hard-to-review blob of tasks, variables, and special cases. Configuration management needs the same kind of decomposition that application code needs. In Ansible, the first two major reuse mechanisms are roles and templates.
Roles package related tasks, handlers, files, templates, and defaults into a reusable unit with a predictable directory structure:
Directoryroles
Directorynginx
Directorytasks/
- main.yml
Directoryhandlers
- main.yml
Directorytemplates/
- nginx.conf.j2
Directoryfiles/
- index.html
Directorydefaults/
- main.yml
Once you think in roles, you stop asking “How do I configure this one server?” and start asking “What is the reusable definition of a web tier, a monitoring agent, or a bastion host?” That is a healthier question. It encourages consistent naming, variable defaults, and cleaner separation between generic behavior and environment-specific values.
Roles also force you to think about variable precedence, which is one of the first places a growing Ansible codebase becomes confusing. The override chain is useful because it lets a role publish gentle defaults while inventories or operator-supplied variables tighten those values for an environment. It is dangerous because a value can appear to come from nowhere if the team cannot explain which layer owns it. Role defaults should usually be weak suggestions, group variables should express environment-wide policy, host variables should be rare exceptions, and extra vars should be reserved for deliberate operator overrides rather than daily configuration.
Templates solve the corresponding file problem. Static file copies are fine when every target should receive exactly the same bytes. Real fleets rarely stay that uniform. Ports differ, hostnames differ, upstream backends differ, and security settings differ by environment. Ansible uses the Jinja2 templating engine so one configuration file can be rendered differently for each host:
worker_processes {{ nginx_worker_processes }};
server { listen {{ nginx_listen_port }}; server_name {{ ansible_hostname }}; root {{ app_document_root }};}The template is only half the story. The other half is the variable model behind it. nginx_worker_processes may come from role defaults, a group variable, or a host override. ansible_hostname comes from gathered facts. The rendered file is the place where abstract configuration becomes concrete operating-system state.
Ansible also separates roles from collections. A role is a reusable component inside your codebase or inside a downloaded package. A collection is the broader distribution unit that can contain roles, modules, plugins, and documentation. Ansible Galaxy is the public hub for sharing that content. This is powerful, but it comes with the same supply-chain caution as any package registry: version pinning, maintenance quality, and code review still matter. Reuse is only a win when you trust what you are reusing.
Collections deserve the same dependency discipline as application libraries. A small requirements.yml file records which collections the project expects and what version range the team has actually tested.
collections: - name: community.docker version: ">=4.0.0,<5.0.0"That file does two useful things. First, it makes setup reproducible across laptops, runners, and control nodes. Second, it makes dependency drift visible in code review. Using fully qualified collection names such as community.docker.docker_container or ansible.builtin.copy serves the same goal: the playbook tells you exactly which namespace owns the behavior you are relying on.
Testing and Verifying Automation
Section titled “Testing and Verifying Automation”A role that works correctly on Ubuntu will fail silently against Rocky Linux if nobody ever tested that combination. A Jinja2 template that renders cleanly with a complete variable set will raise an error during an incident response when one variable is undefined. Testing automation before it reaches production is not optional maturity work reserved for large teams with dedicated platform engineering. It is how you find out whether your assumptions about the target system are actually correct before the system does.
Two tools handle most of this verification work in the Ansible ecosystem. ansible-lint is a static analysis tool that inspects playbooks and roles for style violations, deprecated module usage, and logical mistakes that a plain syntax check would miss. It catches problems like tasks that always report changed regardless of whether anything actually changed, become directives placed in scopes where they should not be inherited, variable names that shadow Ansible built-ins, and module parameters that were renamed or removed in a recent collection release. Running ansible-lint as part of a CI pipeline turns your automation repository into something with a clear pass or fail signal on every push, rather than something that only reveals problems when it runs on a production server:
ansible-lint site.ymlansible-lint roles/nginx/Most violations reported by ansible-lint are inexpensive to fix: consistent indentation, fully qualified module names such as ansible.builtin.apt rather than the short-form apt, and explicit state declarations. The violations worth reading carefully are the logic warnings. A task flagged by ansible-lint here is usually a task, often using command or shell, whose change behavior is ambiguous to Ansible. The fix is to describe that behavior explicitly with changed_when or, when appropriate, creates or removes. Otherwise the recap becomes noisy and it becomes much harder to tell whether the playbook actually changed the system.
Molecule goes further by actually running roles inside ephemeral containers or virtual machines. A Molecule scenario provisions a test instance, applies the role under test, reruns it to check idempotence, runs a verifier to assert the expected system state, and destroys the instance when finished. In current Molecule workflows that verification step is commonly Ansible-native, often through a verify.yml playbook. Some teams also use Testinfra, which lets you write Python assertions about files, services, packages, and sockets on the guest system:
def test_nginx_running(host): nginx = host.service("nginx") assert nginx.is_running assert nginx.is_enabled
def test_nginx_listening(host): assert host.socket("tcp://0.0.0.0:80").is_listeningThose assertions do not verify every possible edge case. They do verify the thing that matters most: the role actually produces a running, enabled nginx process that is listening on the expected port. A linter cannot confirm that. Only running the role against a real operating system can.
The production tradeoff is coverage versus pipeline speed. Testing every role end-to-end in CI adds real execution time, particularly when Molecule spins up virtual machines for each scenario. Teams that invest in automation testing usually follow a practical progression: linting on every push (fast and nearly free in CPU and clock time), Molecule coverage on the most critical roles first, and lighter coverage for simpler utility roles. The calculus changes quickly the first time a broken role runs across 50 servers during a maintenance window and the only way to assess the damage is to log into each one manually. Consistent, tested roles reduce that kind of uncertainty significantly, and the roles that most need testing are usually the ones you are most confident work correctly because you wrote them and ran them a few times.
Configuration Is Not Data
Section titled “Configuration Is Not Data”One of the most important conceptual mistakes in automation is treating configuration and persistent data as if they were the same thing. A package list, a systemd unit file, and an nginx template are configuration. Database rows, uploaded media, queue state, and a game world save are application data. Configuration management can usually recreate the first category cheaply. The second category must be preserved, backed up, restored, or intentionally discarded.
The diagram below shows the basic shape of a stateful rebuild. The runtime and configuration can be reapplied from code. The data path must be mounted or restored before the service is allowed to treat the machine as healthy.
flowchart TB host["Fresh host"] runtime["Runtime and configuration\npackages · files · services"] backup["Backup or snapshot source\nobject storage or snapshot store"] data["Persistent data path\n/data or /srv/appdata"] service["Stateful service"] host --> runtime runtime --> data backup --> data data --> service
This is why stateful services demand careful ordering. You provision the host, install the runtime, create or mount the persistent path, restore the data if needed, and only then start the service. If you reverse the last two steps, many applications happily initialize an empty directory and create a brand-new state. From the automation’s perspective the service is “running.” From the operator’s perspective the wrong system is running.
Idempotency looks slightly different in these workflows. For package management, idempotency often means “the task makes no change when the system already matches the desired state.” For data restore, it often means “the task can be safely repeated without corrupting or duplicating state.” Checksums, restore markers, file existence guards, and validation tasks become important because the automation is now protecting information, not just configuration files.
When Configuration Management Is the Wrong Tool
Section titled “When Configuration Management Is the Wrong Tool”Configuration management excels at keeping long-lived servers aligned with a declared state over time. But long-lived servers that you manage in place are not the only way to run infrastructure, and understanding when that model stops being the right choice is as important as knowing how to apply it.
The core tension is between repairability and replaceability. In the mutable model, a server accumulates configuration over its lifetime and configuration management keeps it on track by detecting and repairing drift. That works well when servers are long-lived, carry persistent state, or are expensive to recreate. It works less well when servers are ephemeral, when the configuration is complex enough that convergence cannot fully guarantee identical state on every host, or when the pace of intentional change is high enough that the playbook is always chasing a moving target.
The alternative is to encode configuration into a machine image during a controlled build process rather than applying it at runtime on a live server. Packer is the standard tool for this on virtual machine infrastructure. Packer provisions a temporary instance, runs provisioners inside it (shell scripts, Ansible playbooks, or cloud-specific bootstrap tools), and captures the result as an AMI, a VM disk snapshot, or a machine image ready for deployment to your cloud provider. When a new server is needed, you launch the pre-built image. There is no post-boot convergence step, no agent polling for drift, and no gap between what the playbook declares and what is actually on disk. The image is the artifact, and it was built and verified before it was ever published.
This approach, often called golden image or baked image infrastructure, reduces configuration drift across instances to nearly zero because every instance launched from the same image contains exactly the same bits. The tradeoff is inertia. Updating a mutable server means running a playbook; the change is live in minutes. Updating an immutable image means rebuilding it, publishing it to an image registry, and replacing every running instance with the new version. For a team deploying a new application version twice a week, that replacement cycle is a manageable cost. For a team that needs to apply emergency security patches quickly across a large fleet, rebuilding and redeploying AMIs can slow the response window enough to matter.
Packer and Ansible are often used together rather than in opposition. Packer calls an Ansible playbook during the image build, capturing the provisioning work into the image at build time rather than applying it at runtime. The result is an immutable image built with a configuration management tool: you get the reproducibility of baked images without giving up the expressiveness of Ansible for the provisioning logic. The image becomes a testable, versioned artifact just like an application binary.
Containers push the immutability model further still. A container image is already a fully specified, reproducible artifact. A container launched from that image behaves identically on any compliant host. In a fully containerized environment, configuration management’s role at the server layer shrinks to the host baseline: the kernel, the container runtime, log forwarding, and security settings. Application configuration moves into environment variables, mounted ConfigMaps, or secrets injected at startup by the orchestrator. That does not eliminate configuration management, but it restructures where it is needed. A team running a fully containerized application tier might have very sparse Ansible automation covering the operating system baseline and a rich set of Kubernetes manifests covering the application layer.
The practically useful way to think about this is a spectrum rather than a binary. At one end, a long-lived database server accumulating years of tuning, schema migrations, and custom software is pure mutable infrastructure: configuration management is essential to keep it honest over time. At the other end, an auto-scaling group of stateless API instances launched from a versioned AMI that is rebuilt on every release is pure immutable infrastructure: configuration management at the server layer is nearly irrelevant. Most real organizations live somewhere between those poles: a containerized application tier on top of VM-based infrastructure nodes that are themselves managed with Ansible or a similar tool. Understanding where on that spectrum a given system sits is the prerequisite to choosing the right approach for it.
Desired State at the Orchestration Layer
Section titled “Desired State at the Orchestration Layer”Kubernetes manifests describe desired state in terms that should sound familiar: “this Deployment should run three replicas of this image, exposed on this port, with these resource limits.” The reconciliation loop that keeps the cluster in that state is structurally identical to what a configuration management agent does on a single host: observe the actual state, compare it to the declared state, and take actions to close the gap. The difference is the layer of the stack where that reconciliation operates.
Ansible and Puppet operate at the operating system layer: packages, files, services, users, kernel parameters, and system settings. Kubernetes operates at the workload layer: pods, Deployments, Services, volumes, and network policies. Neither tool replaces the other. A Kubernetes cluster still runs on nodes, and those nodes need a container runtime, a monitoring agent, kernel tuning, and a security baseline. You can manage all of that with Ansible. What Ansible cannot do is schedule pods across nodes, execute a zero-downtime rolling deployment of a container image update, or automatically restart a container that exits unexpectedly. Kubernetes does those things. The layers are designed to be complementary.
The practical consequence is that fully automated infrastructure usually involves two distinct codebases: one that provisions and configures the infrastructure running Kubernetes, typically with Ansible or a similar tool, and one that describes the workloads running on it, using Kubernetes manifests or Helm charts. Keeping that boundary clear matters. The common failure pattern is letting application configuration drift into Ansible variables while operations-level settings leak into ConfigMaps, until nobody is confident which layer is authoritative for a given value. The Container Orchestration lecture covers the workload side of that boundary in more detail.
Artifacts, Versions, and Deployment Boundaries
Section titled “Artifacts, Versions, and Deployment Boundaries”The next important boundary is between producing software artifacts and deploying them. The CI/CD Pipelines lecture goes deeper on how pipelines build, test, and publish software, but configuration management still has to consume whatever those pipelines produce. That boundary is easy to blur if you treat deployment as “run latest thing.” It becomes much clearer if you divide responsibilities deliberately.
The diagram below shows the cleaner separation. Provisioning creates the host and its surrounding infrastructure. CI/CD produces a versioned artifact. Configuration management takes an already-reviewed artifact and makes a particular environment run it with the right runtime, files, credentials, and policies.
flowchart TB source["Source code"] ci["CI/CD pipeline\nbuild · test · publish"] registry["Registry\nversioned image or package"] tf["Provisioning\nTerraform or OpenTofu"] host["Running host or cluster"] cm["Configuration management\nselect artifact · write config · start service"] service["Running application"] source --> ci --> registry tf --> host registry --> cm host --> cm cm --> service
The cleanest model is a division of labor. Provisioning decides that a machine, network path, storage attachment, and identity exist. CI/CD decides that version 1.4.2 or digest sha256:... has been built, tested, and published. Configuration management decides that this environment should run exactly that reviewed artifact and should keep running it after reboot, replacement, or drift correction.
Pinned versions matter because reproducibility is part of trust. The tag latest feels convenient, but it turns every deployment into an implicit decision to accept whatever a registry currently serves. A semantic version tag is better because it communicates intent. An immutable digest is better still when you need exact reproducibility. Whether a pipeline compiles software from source or simply retags a reviewed upstream image, the conceptual boundary is the same: build and publish in one system, deploy the reviewed result in another.
Secrets, Identity, and Safe Rollouts
Section titled “Secrets, Identity, and Safe Rollouts”The word secret is often overloaded in automation conversations. A database password, TLS private key, or API token is a secret value. A cloud instance profile, service account, or workload identity is not a secret value. It is an identity with permissions. Good automation distinguishes those categories because they are distributed, rotated, and audited in different ways.
Ansible Vault still matters because some values really do need to be materialized inside the playbook or on the target. Secrets cannot live as plain text in a repository and still be called automation done responsibly. Ansible Vault exists because teams need encrypted variables and files that can still travel through version control and review workflows. Behind the scenes, Vault encrypts the file contents locally before you commit them, so Git stores ciphertext beginning with an $ANSIBLE_VAULT header rather than plaintext. A teammate who has the vault password, or access to the vault password file or secret-manager entry used by automation, can decrypt, edit, or run against that file. A teammate without that credential sees only encrypted text. During playbook execution the control node decrypts the data so tasks can use it, but that does not mean plaintext can never touch disk anywhere: ansible-vault edit uses a temporary decrypted file under the hood, and tasks may intentionally render secrets into target configuration files.
# Create a new encrypted variables file from the startansible-vault create group_vars/production/secrets.yml
# Edit an encrypted file without keeping a plaintext copy in gitansible-vault edit group_vars/production/secrets.yml
# Run a playbook that uses encrypted variablesansible-playbook site.yml --ask-vault-passA healthy Vault workflow is usually this simple:
- Create or edit the secrets file with
ansible-vault createoransible-vault edit, so the copy in your working tree stays encrypted at rest. - Commit the encrypted file like any other source file.
- Give authorized humans and CI the vault password through a password manager, CI secret, or protected
--vault-password-file, not through the repository itself.
The easiest way to avoid accidentally pushing plaintext is to avoid ansible-vault decrypt in normal editing workflows. Create and edit encrypted files in place, let secret-scanning or pre-commit hooks reject obvious credentials, and treat permanent decryption as an exceptional recovery step that is followed immediately by re-encryption. If a task handles sensitive values directly, no_log: true and careful review of the rendered target files matter just as much as Vault itself.
In cloud environments, the cleanest pattern is often to avoid copying long-lived access keys onto the host at all. If a machine only needs permission to pull from a private registry or read backups from object storage, platform identity is usually a better fit: an instance profile, service account, or workload identity grants that permission at run time. That changes the trust boundary in a good way. The playbook still configures the machine, but the machine obtains its credentials from the platform rather than from a file sitting on disk.
This distinction also clarifies what Vault is for. Vault stores secret values that must be rendered somewhere. It is not a substitute for every authorization decision. A host that only needs permission to fetch backups should usually receive that permission from the platform, not from a copied API key. A playbook that needs a database password to write a config file may still need Vault. Those are different problems and better automation treats them differently.
Operational Controls for Safer Change Windows
Section titled “Operational Controls for Safer Change Windows”Once fleets grow, safety features stop being optional conveniences and become part of the design. You need ways to reduce blast radius, preview impact, and debug late-stage failures without rerunning an entire playbook from the top every time. Ansible’s operational controls are valuable not because they are clever, but because they let you narrow the scope of risk.
Most of these controls are things you add to the ansible-playbook command during a maintenance window. The main exception is serial, which is a playbook keyword inside the YAML because it defines how many hosts a play changes at once. The table below shows the controls worth internalizing early.
| Control | Where you use it | What it actually does | Example |
|---|---|---|---|
--limit | On the ansible-playbook command line | Temporarily narrows host targeting, even if the play itself targets a larger group. | ansible-playbook site.yml -i hosts.ini --limit web1 |
--tags / --skip-tags | On the ansible-playbook command line | Includes only tagged tasks, or excludes tagged tasks, so you can run just the slice you intend. | ansible-playbook site.yml -i hosts.ini --tags nginx |
serial | Inside the playbook YAML | Batches the rollout so the play changes only part of the fleet at one time. | serial: 1 or serial: 25% |
--check --diff | On the ansible-playbook command line | Simulates changes and shows before-and-after diffs for modules that support those modes. | ansible-playbook site.yml -i hosts.ini --check --diff |
-vvv | On the ansible-playbook command line | Increases logging detail so you can inspect connections, task results, and module behavior. | ansible-playbook site.yml -i hosts.ini -vvv |
--start-at-task | On the ansible-playbook command line | Restarts the playbook at a named task after you fix a late-stage problem. | ansible-playbook site.yml -i hosts.ini --start-at-task "Validate configuration" |
Read that table as an answer to the practical question, “What do I add to ansible-playbook site.yml -i hosts.ini ... when I want less risk right now?” The one non-command-line answer is serial, which you write into the playbook when you want batching to be a permanent property of the rollout.
These controls are strongest when the playbooks themselves are designed to support them. Tags should map to stable concerns, not ad hoc debugging labels. serial should match the failure tolerance of the service behind the playbook. Check mode is helpful, but it is not a perfect simulation. As covered in the Shell Scripting lecture, a great deal of operational maturity is simply narrowing the surface area of a risky action before you take it.
The most common failure modes remain very ordinary. SSH authentication fails. Python is missing or not where Ansible expects it. A task needs elevated privileges and does not have them. A variable name is misspelled. A template renders correctly but points at an application path that does not exist on that operating system. Configuration management reduces toil, but it does not remove the need to understand the underlying system. In fact, it often exposes that understanding more clearly, because the automation fails exactly where your mental model was incomplete.
Configuration Management on Employee Devices
Section titled “Configuration Management on Employee Devices”The fleet model in this lecture has assumed servers: machines with stable network reachability, predictable operating systems, and a defined operational role. A different fleet sits inside almost every organization that runs servers, namely the laptops, phones, and tablets that employees use day to day. The same concepts apply, but the operational shape changes enough that a different category of tool emerged to handle it.
This category is usually called endpoint management or mobile device management (MDM), and increasingly unified endpoint management (UEM) when one platform covers laptops, phones, and tablets together. The job is structurally familiar: enroll a device, declare what should be true about it, and reconcile reality with that declaration over time. What changes is everything around that core loop. Endpoints connect from coffee shops and airports rather than from a known subnet. Users own part of the configuration through their accounts and personal data. Compliance posture, including disk encryption, screen lock, OS patch level, and certificate trust, matters at least as much as which packages are installed. And users push back against changes that interrupt their work in a way no stateless web server ever does.
The tooling reflects that. Microsoft Intune is a common choice in Microsoft-centric organizations and manages Windows, macOS, iOS/iPadOS, Android, Linux, and Chrome OS. Jamf Pro and Mosyle are well-known Apple-focused platforms. Omnissa Workspace ONE (formerly VMware Workspace ONE) is a broader cross-platform UEM product, and Iru (formerly Kandji) has expanded from Apple device management into a wider IT and security platform. On the open-source side, Fleet builds on osquery to give visibility and some policy enforcement across mixed fleets, and Munki handles Mac software distribution outside a full MDM stack. Older Windows shops still rely on Group Policy, which is conceptually a configuration management system bolted onto Active Directory, although many organizations now pair it with, or gradually replace parts of it with, cloud endpoint-management platforms.
Most endpoint platforms rely heavily on device-initiated check-ins, but the transport model varies by platform. Apple device management, for example, commonly uses push notifications through Apple Push Notification service (APNs) to tell a device to contact the management service, while other platforms lean more on scheduled sync and periodic polling. That architecture fits intermittent connectivity better than a server-side tool that expects a stable SSH session. The more recent direction, especially from Apple, has been explicitly toward declarative models. Apple’s Declarative Device Management lets the device itself evaluate whether it satisfies a declared state and act autonomously, rather than waiting for the server to push every command. The vocabulary is identical to what this lecture has used throughout: declarations, status reports, predicates, reconciliation. Configuration management did not start at the endpoint, but the endpoint world has now thoroughly absorbed the model.
Endpoint management does not replace server-side configuration management, and Ansible is rarely the right tool to manage a fleet of corporate laptops. The connectivity assumptions, user-experience constraints, and security integrations are different enough that purpose-built endpoint platforms almost always win at that job. The useful insight is that the underlying idea, declared state plus a reconciliation loop, is general. Whether the fleet is web servers in a VPC, Kubernetes nodes in a cluster, or laptops in employees’ backpacks, the question is the same: what should be true here, and how do we keep it that way?
Takeaways
Section titled “Takeaways”Configuration management is the part of infrastructure automation that keeps mutable systems honest over time. Terraform provisions the server, network, or database. Configuration management brings that running system into alignment with the software, files, services, and policies you actually want, repeatedly and reliably. CI/CD then automates the delivery of application changes onto that prepared foundation. Those layers are related, but they are not interchangeable. Confusing them is one of the fastest ways to build brittle automation that works until it quietly does not.
Within that larger stack, the central mental model is simple: define the desired state, detect drift, and converge back toward the declared outcome safely enough that running the automation repeatedly is unremarkable. Idempotency is what makes that safe. The playbook recap, with its ok and changed lines, is how you verify the system is behaving that way. But idempotency alone is not the whole story. Inventories and handoff patterns tell the tool which machines exist. Task conditions and failure semantics tell it when a one-time action is truly finished. Roles and collections keep the codebase maintainable as it grows. Testing with ansible-lint and Molecule is how you find out whether the automation actually does what you believe before production does. Data restore ordering keeps a stateful service from booting into the wrong reality. And the push-versus-pull choice determines how often convergence actually happens versus how often drift simply accumulates.
The broader lesson is that configuration management is not one thing. It is a set of ideas: desired state, convergence, enforcement frequency, role reuse, and verified correctness. Those ideas apply on a spectrum from long-lived mutable servers managed entirely in place, to golden images baked with Ansible and replaced on every release, to containerized workloads where Kubernetes performs its own reconciliation loop at the workload layer while Ansible manages the OS baseline underneath, and out to the laptops and phones in employees’ hands, where endpoint management platforms apply the same declared-state model to a very different fleet. Understanding where a given system sits on that spectrum, and which tool is appropriate for which layer, is more valuable than deep expertise in any single tool.
Ansible is the concrete vehicle here because it makes the underlying ideas visible rather than hiding them behind platform-specific infrastructure. Puppet, Chef, Salt, Packer, cloud-init, and container orchestration are all answers to overlapping subsets of the same family of questions. The right operational habit is not tool loyalty. It is understanding which layer owns which kind of change: provisioning creates the target, CI/CD publishes the reviewed artifact, configuration management makes the environment run that exact artifact with the correct files, identity, policies, and data in place. The CI/CD Pipelines lecture goes deeper on the artifact side of that boundary next.