Skip to content

Infrastructure as Code

Imagine you need to deploy a web server. You log into the AWS console, click through a dozen screens, choose an AMI (Amazon Machine Image, a pre-built OS snapshot used to launch instances), configure a security group, attach a key pair, and launch the instance. A week later a colleague needs an identical server in a different region. Can you reproduce exactly what you did? Probably not with perfect fidelity, and certainly not quickly.

This scenario illustrates the core problem that Infrastructure as Code (IaC) solves. When infrastructure is provisioned by hand, three risks emerge: configuration drift (two servers that should be identical slowly diverge as engineers apply ad-hoc changes to one but not the other), lack of reproducibility (if a disaster strikes, you are relying on memory and wiki pages to rebuild from scratch), and lack of an audit trail (a manual change in a cloud console leaves no pull request, no diff, and no approval record). IaC addresses all three by expressing infrastructure in source files that live in version control. Changes go through code review, environments can be rebuilt from a single command, and drift is detected automatically. Terraform is one of the most widely adopted IaC tools and the focus of this lecture, but the concepts apply across the entire ecosystem.

The Two Phases of Infrastructure Management

Section titled “The Two Phases of Infrastructure Management”

It helps to think of infrastructure work in two distinct phases, each with different concerns.

Initial setup is the provisioning phase: spinning up servers, configuring networks, creating load balancers, setting up databases. For a given environment, this is largely a one-time activity, though it may be repeated many times across dev, staging, and production.

Ongoing maintenance is the operational phase: updating software versions, deploying new application releases, scaling resources up or down, changing network configuration, and recovering from failures. Maintenance is continuous and never fully stops.

The same tools do not always serve both phases equally well, and recognizing why reveals a deeper philosophical split in how teams think about infrastructure. Provisioning tools like Terraform work best when infrastructure is treated as immutable: rather than patching a running server’s configuration, you replace the resource with a correctly configured one. Configuration management tools like Ansible, by contrast, are designed for mutable infrastructure: they connect to running servers and bring them to a desired state incrementally, without replacement. In practice, combining tools is common. Terraform provisions an EC2 instance, Ansible installs and configures the application, and GitHub Actions deploys updates automatically. The Configuration Management with Ansible and CI/CD Pipelines lectures cover those tools in depth; this lecture focuses on the provisioning foundation they build on.

There are two broad philosophies for automating infrastructure. In an imperative approach, you write a sequence of steps: “create a VPC (Virtual Private Cloud, an isolated network you control within AWS), then create a subnet, then launch an instance.” Shell scripts and SDK-based tools work this way; the code describes how to reach the desired state.

In a declarative approach, you describe what the desired state looks like: “there should be a VPC with one public subnet and one EC2 instance.” The tool figures out how to make reality match the declaration.

Terraform is declarative. You write configuration files describing the end state of your infrastructure, and Terraform calculates the difference between that desired state and the current state, then applies only the necessary changes. Two properties follow directly from this design.

The first is idempotency: you can run Terraform against an already-correct environment and nothing changes. No resources are duplicated, no services are restarted unnecessarily, and no existing configuration is disrupted. This matters operationally because it means you can re-run Terraform safely at any time, including after a partial failure. A shell script is idempotent only if the author carefully writes every step to check before acting; a declarative tool provides idempotency structurally.

The second is convergence: the tool drives reality toward the declaration regardless of what happened in between. If someone makes a manual change in the AWS console after Terraform’s last run, the next plan will surface the discrepancy. This makes declarative IaC a natural drift-detection mechanism, not just a provisioning tool.

Terraform configurations are written in HCL (HashiCorp Configuration Language), a declarative syntax designed to be both human-readable and machine-parseable. The design is built around four key concepts: providers, resources, data sources, and state. Understanding what each one is and what problem it solves is more valuable than memorizing syntax, because the same conceptual model appears in nearly every other IaC tool, just under different names.

A provider is a plugin that teaches Terraform how to talk to a particular platform or service. The AWS provider knows how to create EC2 instances and S3 buckets; the Google Cloud provider knows how to create Compute Engine VMs and Cloud Storage buckets. Providers are distributed independently from Terraform itself through the Terraform Registry. This separation means any community or organization can publish a provider for any API, which is why Terraform can manage Cloudflare records, GitHub teams, and Datadog monitors with the same tool and the same workflow as cloud servers.

A resource is a single piece of infrastructure managed by a provider: an EC2 instance, a security group, a DNS record, a database. Resources are the core unit of Terraform’s world. Each resource has a type that identifies what the provider can create (like aws_instance) and a local name you choose (like web), which together form its address: aws_instance.web. When you declare a resource in a configuration file, you are saying: this thing should exist, with these properties.

A data source lets you look up information that already exists outside of your Terraform configuration. Rather than hard-coding your 12-digit AWS account ID in IAM policies and resource ARNs (Amazon Resource Names, the unique identifiers for AWS resources), you can query AWS for the current account’s identity at plan time. The same configuration then deploys correctly to your development account and your production account without modification. Data sources are read-only: Terraform will never try to create or modify them. They are the bridge between what you are managing and what already exists in the world.

Terraform keeps a record of every resource it manages in a state file. This file maps the resources in your configuration to real objects in the cloud: it knows that aws_instance.web corresponds to a specific running instance in your AWS account, along with all of its current attributes. State is what makes incremental changes possible. Without it, Terraform would have no way to distinguish between “this resource already exists and matches the configuration” and “this resource needs to be created,” so it would attempt to create everything on every run.

The state file is not just a performance optimization: it is the source of truth about what Terraform currently manages. This is why corrupted or lost state is a serious operational problem, and why protecting it is covered in detail later in this lecture.

Configuring Resources: Variables, Outputs, and Locals

Section titled “Configuring Resources: Variables, Outputs, and Locals”

Three additional constructs let you write configurations that are readable and reusable across environments rather than littered with hard-coded values.

Input variables parameterize a configuration so the same code can be deployed with different values. A variable for instance_type might default to t3.micro in development but be set to m5.large in production. Variables make the difference between a configuration tied to one specific environment and one that expresses a pattern applicable to many. You can set variables through .tfvars files, environment variables prefixed with TF_VAR_, or command-line flags, which gives you a range of options from checked-in defaults to pipeline-injected secrets.

Outputs expose values from your configuration so that other tools or configurations can consume them. After Terraform provisions a web server, an output named web_public_ip makes that IP available without requiring anyone to look it up manually in the console. Outputs are also documentation: they communicate what a configuration produced.

Locals are named expressions that let you compute or consolidate values once and reference them throughout the configuration. A common_tags local, for example, lets you define a standard set of resource tags in one place. If the tagging convention changes, you update one block rather than every resource that uses it.

Together, these three constructs give a configuration a clean interface: variables are its inputs, outputs are its results, and locals are its internal vocabulary.

The most important thing to understand about Terraform’s workflow is that it is designed as a checkpoint system, not just a sequence of commands. Each phase exists for a reason, and skipping any phase trades safety for speed in ways that cause real incidents.

terraform init downloads provider plugins and sets up the working directory. It records the exact provider versions in a lock file (.terraform.lock.hcl) so that every team member who runs the same configuration gets identical provider behavior. Committing the lock file to version control is how you prevent a provider update from silently changing behavior on one machine but not another.

terraform validate checks configuration syntax and internal consistency without contacting any remote API. It is fast and catches typos, missing required arguments, and references to undefined variables before any real work begins.

terraform plan is the critical step. It reads the current state, queries the cloud provider for the actual state of managed resources, and produces a detailed execution plan showing exactly what Terraform intends to create, modify, or destroy. Nothing changes in your infrastructure at this stage. The plan output is a diff: resources marked + will be created, resources marked ~ will be modified in place, and resources marked -/+ will be destroyed and recreated (which matters because recreation is not always safe for stateful resources like databases). Read the plan the way you read a code review. A surprise in the plan is a reason to stop and investigate, not to proceed.

terraform apply executes the plan. When called with a saved plan file, it applies the reviewed plan instead of recalculating a new one. If the underlying state changes before apply, Terraform will reject the saved plan as stale and force you to plan again. This distinction matters on teams: saving a plan to a file with terraform plan -out=tfplan and requiring approval before running terraform apply tfplan is the IaC equivalent of requiring a pull request review before merging code. The saved plan file guarantees that the apply step uses the reviewed plan or fails rather than silently applying a different one.

terraform destroy removes all managed resources in the correct reverse dependency order. It respects the same dependency graph that governs creation, so a database is not dropped before the application servers that write to it are removed first.

State is one of the most important and most misunderstood aspects of Terraform. Understanding it well changes how you think about every Terraform operation, not just the mechanics of where the file lives.

By default, Terraform writes state to a file called terraform.tfstate in the project directory. This works for individual experimentation, but it creates problems in a team setting. If two engineers run terraform apply from their own laptops, they can overwrite each other’s state and corrupt the environment. There is also no locking: nothing prevents two people from modifying the same state simultaneously.

For any shared project you should store state in a remote backend. On AWS, a current common pattern is an S3 backend with encryption enabled, S3 bucket versioning turned on for recovery, and native lockfiles enabled with use_lockfile = true. Older Terraform setups often paired S3 with DynamoDB for locking, but DynamoDB-based locking is now deprecated.

terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "cs312/web/terraform.tfstate"
region = "us-west-2"
use_lockfile = true
encrypt = true
}
}

Bucket versioning is configured on the S3 bucket itself, not in the backend block, but it is worth treating as part of the same design because it gives you a recovery path if state is accidentally deleted or overwritten. If one engineer is running apply, another who tries to run apply at the same time receives a lock error rather than silently corrupting state. This is more than a technical safeguard: it forces infrastructure changes to be serialized, which is a sound practice even when only one engineer is actively making changes at a given moment.

A single state file for all of your infrastructure becomes a liability as a project grows. Every plan queries every resource it manages; as that number climbs into the hundreds, plans slow down, applies fail more often due to cloud API rate limits (providers throttle the number of requests per second, and a plan refreshing hundreds of resources simultaneously can hit those limits), and the blast radius of any change grows. One apply can affect everything because one state file tracks everything.

Teams at scale split state by domain boundary: shared networking infrastructure in one state file, per-application stacks in their own, and per-environment splits on top of that. This reduces blast radius for routine changes: an application deployment cannot accidentally touch the VPC because the two are in separate plans. The networking state itself is still shared infrastructure, and a careless change to it can still break every resource that depends on it. Splitting does not eliminate that risk; it only keeps routine application changes from ever reaching it. Teams treat foundational state files as higher-ceremony to modify and apply stricter IAM access controls on who can run applies against them. Cross-state references are handled through remote state data sources, where one Terraform configuration reads the outputs of another without managing those resources directly.

Some organizations carry this further by using separate AWS accounts per environment, giving each its own state file, its own permission boundaries, and a stronger administrative isolation boundary enforced by AWS itself. This does not eliminate the risk of targeting the wrong environment, but it reduces the blast radius and makes cross-environment access an explicit IAM and role-assumption decision rather than an accident inside one shared account. That is one reason separate accounts are a common pattern at scale.

One of state’s most valuable operational properties is drift detection. When a manual change is made directly in the AWS console after Terraform’s last apply, the state file no longer matches reality. The next terraform plan will surface this: Terraform queries the live cloud resources and compares their actual attributes against the state file, then shows the discrepancy as a planned change. You can run terraform plan at any time as a health check, not just when you intend to make a change.

In larger environments, automated drift detection is common practice: run terraform plan on a schedule and alert on non-empty output. Any environment that people also modify manually will drift; the question is only whether you discover it immediately or when something unexpectedly breaks.

State files contain the attribute values of every managed resource. If a resource has a password attribute, that password appears in state in plain text. Always encrypt remote state at rest, restrict access to the state bucket, and never commit terraform.tfstate to version control.

Most teams do not start a Terraform journey with a blank account. You may inherit an EC2 instance, a VPC, or a set of security groups created by hand months earlier. The terraform import command brings those resources under Terraform management without recreating them. It writes the resource’s current state into the state file; from that point on, Terraform tracks and manages the resource. The key limitation is that terraform import does not generate the HCL configuration for you, only the state entry. You still have to write the matching resource block and iterate on terraform plan until it reports no changes. Modern Terraform also supports config-driven imports through an import block, and terraform plan -generate-config-out=... can generate draft HCL during planning. Current Terraform documentation still marks that generated configuration workflow as experimental, so the output should be treated as a starting point that requires careful review and cleanup before it is committed.

Resources differ in how risky they are to modify or destroy. A security group can be recreated with no lasting harm. An RDS database holds data: destroying it means losing that data unless a backup strategy is in place, and recreating it disrupts every service that writes to it. Terraform’s lifecycle block lets you attach safety constraints to individual resources that reflect this reality.

The most important rule is prevent_destroy. A resource marked prevent_destroy = true causes Terraform to error at plan time rather than produce a plan that would destroy it, even when the destruction would logically follow from a configuration change. It is the right safeguard for any resource that other infrastructure depends on: the VPC that every subnet lives in, the RDS instance that holds production data, the shared IAM role that a fleet of services assumes.

resource "aws_db_instance" "primary" {
# ... configuration ...
lifecycle {
prevent_destroy = true
}
}

The second common rule is ignore_changes. It tells Terraform to stop tracking drift on specific attributes. A common example is an Auto Scaling Group whose desired capacity is adjusted dynamically by scaling policies or temporarily by an operator during an incident. If you do not want Terraform to force that value back on the next apply, you add that attribute to the ignore_changes list. This is a pragmatic escape hatch for resources that are partially managed outside Terraform without abandoning Terraform management entirely.

resource "aws_autoscaling_group" "web" {
# ... configuration ...
lifecycle {
ignore_changes = [desired_capacity]
}
}

The deeper principle these rules reflect is that Terraform manages infrastructure shape, not the data inside it. It can provision an RDS instance, resize its storage, and modify its engine version. It cannot run a schema migration, orchestrate a blue/green cutover, or verify that a change is safe given the data the instance currently holds. Those concerns belong to separate, specialized tooling: migration frameworks like Flyway or Alembic, deployment orchestration scripts, and deliberate runbooks for high-risk changes. Understanding this boundary prevents a class of incidents that happen when engineers treat a database change as just another terraform apply. When the plan output shows -/+ for a stateful resource, that is a warning requiring investigation and a verified backup strategy, not a routine instruction to type yes.

Resource Dependencies and Infrastructure Topology

Section titled “Resource Dependencies and Infrastructure Topology”

Terraform automatically infers dependencies between resources based on attribute references. When one resource’s argument references an attribute of another, Terraform records that relationship and ensures the dependency is satisfied before the dependent resource is created or modified. This is called an implicit dependency, and it is the mechanism by which the shape of your configuration describes the topology of your infrastructure.

Internally, Terraform builds a directed acyclic graph (DAG) of all resources before executing any changes. The DAG captures every dependency relationship, allowing Terraform to determine the correct creation order and to parallelize independent resources simultaneously. A security group that an instance depends on is created first; two security groups with no relationship between them are created at the same time.

The implicit dependency mechanism is also how multi-tier architectures are expressed. A database security group that permits connections only from the web security group does so by referencing the web security group’s ID in its ingress rule. Terraform infers the full creation order from that reference: web security group first, then database security group (which depends on the web security group’s ID), then the database instance. The network topology is encoded directly in the attribute references, not declared separately or maintained manually.

Occasionally you need a dependency that is not visible in the attribute references. The most common case is when a resource uses a literal string that coincidentally refers to another managed resource. If an EC2 instance specifies iam_instance_profile = "my-role-name" as a plain string rather than iam_instance_profile = aws_iam_role.web_role.name as an attribute reference, Terraform sees a string with no connection to any other resource in the configuration and cannot infer the dependency. You declare it explicitly with depends_on:

resource "aws_instance" "web" {
# ... other arguments ...
iam_instance_profile = "my-role-name" # plain string: no implicit dependency inferred
depends_on = [aws_iam_role.web_role]
}

The depends_on argument tells Terraform to create or update aws_iam_role.web_role before touching aws_instance.web, even though no attribute reference links them.

As configurations grow, repeating the same blocks across projects becomes tedious and error-prone. Terraform modules let you package a set of resources into a reusable unit with a defined interface.

A module is simply a directory containing .tf files. The directory you are working in is the root module. You can create a child module by placing files in a subdirectory and calling it from the root:

project/
main.tf # root module
modules/
web_server/
main.tf # child module
variables.tf
outputs.tf

The child module declares its accepted inputs in variables.tf and exposes its results through outputs.tf, establishing the contract that any caller must satisfy:

modules/web_server/variables.tf
variable "instance_type" {
description = "EC2 instance size"
type = string
}
variable "ssh_cidr" {
description = "CIDR block allowed to SSH"
type = string
}
modules/web_server/outputs.tf
output "public_ip" {
description = "Public IP of the provisioned instance"
value = aws_instance.web.public_ip
}

The root module calls the child module by passing values for those variables and reading back the outputs:

module "web" {
source = "./modules/web_server"
instance_type = "t3.micro"
ssh_cidr = var.ssh_cidr
}
output "web_ip" {
value = module.web.public_ip
}

Modules enforce a clean interface: the child module declares which variables it accepts and which outputs it exposes. The root module passes values in and reads outputs back. This encapsulation makes it possible to share modules across teams or publish them to the Terraform Registry. A well-designed module is a reusable infrastructure component: you can call it multiple times with different variable values to create multiple independent instances of the same architecture.

The failure mode to watch for is over-abstraction. A module that takes dozens of variables to control every possible configuration detail, or that nests other modules several layers deep, becomes harder to use than the raw resources it wraps. The plan output becomes opaque (it references module.web.module.networking.module.subnet.aws_subnet.main rather than a resource you recognize), failures are harder to trace, and callers cannot reason about what will actually be created. Prefer thin, composable modules that represent one coherent infrastructure concept. If a module is doing five different things, it is probably two or three modules.

Once a module is shared across teams it should be treated like an API: changes to its variable interface require a version bump, backward compatibility matters, and unannounced breaking changes will break every caller. The Terraform Registry and Git tags both support module versioning. An unversioned shared module is an organizational coupling problem: a change that serves one team can silently break another team’s infrastructure on their next init.

Terraform is not the only tool in this space, and understanding the alternatives clarifies the tradeoffs in Terraform’s own design choices.

OpenTofu is a community-maintained, open-source fork of Terraform created after HashiCorp changed Terraform’s license to the Business Source License in 2023. OpenTofu is a drop-in replacement with identical HCL syntax and is the choice for teams that require a fully open-source license. The two projects are diverging slowly; for most workloads they are interchangeable today.

Pulumi takes a different approach to the language question. Instead of a domain-specific language like HCL, Pulumi lets you write infrastructure definitions in general-purpose programming languages: TypeScript, Python, Go, and C#. This lowers the barrier for developers already fluent in those languages and makes it possible to apply standard programming constructs, loops, conditionals, and libraries, to infrastructure code. The tradeoff is that the full generality of a programming language can make Pulumi configurations harder to audit statically, while HCL’s limited expressiveness makes configurations more predictable and reviewable.

AWS CloudFormation is AWS’s native IaC service. It uses JSON or YAML templates and integrates tightly with the AWS ecosystem through features such as change sets, StackSets, and drift detection. Being native means it has deep first-class support for new AWS services, often before Terraform’s AWS provider adds them. The cost is cloud specificity: a CloudFormation template is useful only on AWS, which matters if you ever need to operate across providers.

Azure Resource Manager (ARM) and Bicep are Microsoft’s equivalents for Azure infrastructure. Bicep is a more readable domain-specific language that compiles down to ARM JSON, playing the same role HCL plays for Terraform.

Ansible is primarily a configuration management tool, but it can also provision infrastructure resources through its cloud modules. The overlap with Terraform is real, but the tools approach the problem differently: Ansible is procedural and agentless; Terraform is declarative and state-driven. As discussed in the Two Phases section, they complement each other more than they compete.

The tools themselves are only part of the story. How a team uses IaC matters as much as which tool they choose. The following principles apply across all declarative IaC tools, not just Terraform.

Every configuration file, every variable file (excluding secrets), every module, and the provider lock file belong in a version-controlled repository. This is not optional: version control is what gives you the audit trail, the reproducibility, and the code review that justify IaC in the first place. An IaC configuration that lives on an engineer’s laptop is not substantially better than clicking through a console.

Environment Parity Through Parameterization

Section titled “Environment Parity Through Parameterization”

One of IaC’s most powerful benefits is the ability to express your entire infrastructure as a pattern and instantiate it multiple times with different variable values. A single Terraform configuration with separate variable files can create a development environment, a staging environment, and a production environment that are structurally identical but sized and configured differently. This is the IaC answer to “it worked in staging but broke in production”: when both environments come from the same configuration, the surface area for divergence is limited to the variable values that differ intentionally.

In a CI/CD (Continuous Integration/Continuous Deployment) pipeline, the standard pattern is to generate a saved plan automatically on every pull request, post the plan output as a comment, and require a human to review and approve before running terraform apply. This combines automated consistency checking with human judgment. The saved plan file ensures that the apply step uses the reviewed plan rather than computing a new one, and if the underlying state changes in the meantime Terraform requires a new plan instead of silently applying a stale one.

A single terraform apply can create, modify, or destroy dozens of resources at once. The plan is how you understand the scope of a change before committing to it. A plan showing ten resources marked for destruction is worth investigating even when you expected only one. On large configurations, a seemingly small refactor can touch far more resources than intended, and discovering that in the plan is the right time, not after the fact.

Never hard-code passwords, API keys, or other secrets in .tf files or in .tfvars files that will be committed to version control. Use variables marked sensitive = true, inject values through environment variables (Terraform reads TF_VAR_<name> automatically), or integrate with a secrets manager such as AWS Secrets Manager or HashiCorp Vault. Marking a variable sensitive tells Terraform to redact its value from plan and apply output, which prevents secrets from appearing in CI logs.

Apply Least Privilege to Infrastructure Automation

Section titled “Apply Least Privilege to Infrastructure Automation”

Terraform’s execution role should follow the principle of least privilege (granting only the permissions actually needed, and no more) just like any other system credential. A role with full administrator access on your AWS account is a high-impact credential: if it leaks, or if someone misconfigures a resource, the blast radius is unlimited. Scope the role to the resource types and regions the configuration actually manages. In practice, many teams start with broad permissions and tighten them incrementally as the configuration stabilizes. The important discipline is to treat the Terraform execution role as a sensitive credential that deserves the same review and rotation policy as any other privileged access, rather than as an infrastructure utility that can safely have unlimited permissions.

Human review catches many problems in a plan, but not all of them. At scale, teams use policy as code tools to enforce organizational rules automatically as part of the plan/apply pipeline. HashiCorp Sentinel and the Open Policy Agent (OPA) with tools like Conftest can enforce rules such as “no S3 bucket may have public access enabled,” “all resources must carry cost-allocation tags,” or “production applies may not run on weekends.” These policies evaluate the plan before apply runs and block non-compliant changes automatically. Policy as code is the IaC equivalent of linting: not every constraint can be expressed in the type system of your configuration language, so some must be enforced by tooling applied consistently to every change.

Always specify version constraints for your providers. Without version pinning, a provider update can introduce breaking changes that silently alter infrastructure behavior. The lock file records the exact provider versions installed; committing it to version control ensures every team member and every CI job uses the same versions.

Tools like Infracost can analyze a Terraform plan and estimate the monthly cost of the resources being created. Adding cost estimation to a review pipeline prevents surprise bills, especially when resources like NAT Gateways or RDS instances with provisioned IOPS appear in a plan for the first time.

Infrastructure as Code transforms infrastructure management from a manual, error-prone process into a disciplined engineering practice with the same review, versioning, and reproducibility expectations as application code. The declarative model is the core insight: you describe the desired end state, the tool calculates how to get there, and the properties of idempotency and convergence make the result safe to run repeatedly and reliably drift-detecting over time.

The plan/apply cycle is not just a workflow convenience; it is a risk-management mechanism. A plan is a contract between what you reviewed and what gets applied, and treating it with the same rigor as a code review prevents the class of infrastructure incidents that happen when a change affects more than the author expected. State is what makes incremental changes possible and what makes drift visible; protecting it through remote backends with locking, encryption at rest, and exclusion from version control is as important as writing correct configuration in the first place.

Terraform does not stand alone in a real automation stack. It handles the provisioning layer: creating networks, compute, and storage. Configuration management and CI/CD then operate on the infrastructure Terraform provisioned. Understanding where each tool’s responsibility begins and ends is the key to combining them without duplication or conflict. The Configuration Management with Ansible and CI/CD Pipelines lectures cover those layers in detail.