Skip to content

Containerization

A dependency version your application needs conflicts with the one another service requires on the same server. A build that passes every test in CI fails on the first deploy because the server is running a different OS version. A traffic spike demands twenty identical application instances in under a minute, but each one takes ten minutes to configure by hand. These are the problems containers solve.

A container packages an application together with everything it needs to run: libraries, runtime, and configuration. It does this in a portable artifact that usually behaves the same from a developer’s laptop to a production server to a cloud instance, as long as the target environment provides a compatible CPU architecture and kernel. Unlike a virtual machine, a container shares the host kernel and often starts in under a second. Unlike a bare-metal deployment, it is isolated from other processes and carries no dependency on the host’s user-space software.

This lecture explains how containers work from the kernel features that make them possible, through Docker’s architecture and image model, to the tools for building, running, and composing containerized applications. It also covers Podman, the OCI standard that governs container interoperability, and where containers fit in a broader production deployment picture.

To understand why containers exist, it helps to trace the evolution of how software is deployed.

Bare metal means installing your application directly on a physical server. The operating system, libraries, and application all share the same kernel and filesystem. This is simple but fragile: installing a dependency for one application can break another, and scaling requires buying more hardware.

Virtual machines improved things dramatically. A hypervisor lets you run multiple isolated operating systems on the same physical host, each with its own kernel. However, each VM carries the overhead of a full OS: hundreds of megabytes of disk, its own memory footprint, and boot times measured in seconds to minutes.

Containers take a different approach. Instead of virtualizing hardware, they use kernel features (namespaces and cgroups on Linux) to isolate processes while sharing the host’s kernel. A container holds only the application and its user-space dependencies. The result is an artifact measured in megabytes rather than gigabytes, with startup times under a second.

How Containers Work: Namespaces and cgroups

Section titled “How Containers Work: Namespaces and cgroups”

Containers are isolated not by virtualized hardware but by two Linux kernel features: namespaces and cgroups (control groups). Namespaces give a process a private, isolated view of system resources: its own filesystem tree, network interfaces, process table, and hostname, each independent from the rest of the host. cgroups limit how much CPU, RAM, and I/O a process group can consume, so no single container can starve others on the same host. A container runtime creates a set of namespaces and a cgroup hierarchy, then launches the application process inside them. From inside, the container appears to be a standalone system. From outside, it is just a process tree with extra kernel bookkeeping.

Understanding these mechanisms matters because they define both what containers isolate and what they do not. Containers share the host kernel; they are not separate OS instances. The isolation is process-level, not hardware-level. The history of how these features were built up over time makes this architecture clear.

The ideas behind containers were developed incrementally over decades.

chroot (1979): Unix introduced the chroot system call, which changes the apparent root directory for a process and its children. This creates a “chroot jail” that hides portions of the filesystem. However, it only restricts filesystem access; it does not limit CPU, RAM, or network resources. Root users can escape it, so it is not a security mechanism on its own.

cgroups (2006/2008): “Control Groups” were developed by Google engineers and merged into the Linux kernel in 2008. cgroups allow the OS to limit, account for, and control the resource usage of a collection of processes:

  • Resource limiting: cap RAM, CPU, and I/O usage for a process group.
  • Prioritization: give certain process groups higher priority for disk access.
  • Accounting: track how much CPU and memory a group has consumed.
  • Control: freeze, kill, or checkpoint a process group.

As containers became mainstream, Linux introduced cgroups v2 (broadly available in major distributions around 2016) as a unified hierarchy that replaces the multiple independent controller trees used by v1. The practical differences are simpler delegation rules, more consistent resource semantics across controllers, and better behavior under nested/containerized workloads. Modern Docker, containerd, and runc can run on either version, but newer Linux distributions increasingly default to v2.

Namespaces (2002-present): Linux namespaces wrap a global resource so that processes within the namespace see their own isolated instance of it. There are currently eight namespace types in the Linux kernel:

NamespaceWhat it isolates
MountFilesystem mount points
PIDProcess IDs
NetworkNetwork interfaces, IP addresses, routing tables, sockets, firewall rules
IPCInter-process communication (shared memory, message queues)
UTSHostname and domain name
UserUser and group IDs
CgroupThe cgroup root seen by processes
TimeSystem clock offsets (added in kernel 5.6, 2020)

The network namespace deserves special attention: each container gets its own list of network interfaces, its own IP address space, its own routing table, and its own firewall rules. This is what allows multiple containers on the same host to each bind to port 80 without conflicting: they have separate network namespaces, so port 80 inside each container is an independent socket.

The PID namespace is equally important: the first process started inside a container becomes PID 1 within that namespace, even though the host kernel assigns it a different PID on the host. Other processes inside the same container receive their own namespace-local PIDs. Processes in the container cannot enumerate or signal processes outside their namespace. From inside, the system appears to have only the processes that belong to the container. This matters operationally later when you stop a container: the runtime sends the stop signal to that container’s main process, which is PID 1 inside the container even if it has a different PID on the host.

LXC (Linux Containers): Combining cgroups and namespaces, LXC appeared around 2008 as a way to run a full “virtual operating system” without a hypervisor. An LXC container might run Ubuntu inside even though the host runs Debian, while sharing the same host kernel. This makes LXC very resource-efficient, but because the kernel is shared, a kernel vulnerability could potentially be exploited across container boundaries.

Docker’s paradigm shift: Earlier containerization approaches like LXC virtualized an entire OS environment. Docker’s insight was to discard the OS portion entirely and virtualize only the single application you want to run. A Docker container is not “a small Ubuntu”; it is your application process, packaged with the user-space libraries it needs. Docker originally used LXC internally, but now uses libcontainer (part of the runc project), a purpose-built library that directly uses cgroups and namespaces.

Docker is the most widely adopted container platform. Understanding its architecture helps you reason about what happens when you type docker run.

The system has four main components:

The Docker daemon (dockerd) is a long-running background process that manages images, containers, networks, and volumes on the host. It listens for API requests over a Unix socket (or TCP) and does the actual work of creating and running containers.

The Docker client (docker) is the command-line tool you interact with. When you run docker build or docker run, the client sends API calls to the daemon. The client and daemon can run on the same machine or on different machines.

Images are read-only templates that contain a filesystem snapshot plus metadata (default command, environment variables, exposed ports). An image is the blueprint; a container is a running instance of that blueprint.

Registries are servers that store and distribute images. Docker Hub is the default public registry, but organizations often use private registries such as GitHub Container Registry (GHCR), Amazon ECR (Elastic Container Registry), or a self-hosted registry. When you run docker pull nginx, the client asks the daemon to download the nginx image from Docker Hub.

The diagram below shows how these components relate. Every docker command flows from the CLI through the daemon to containerd, which delegates the actual kernel work to runc. The daemon reaches outward to a registry only when an image needs to be pulled or pushed.

flowchart TB
  CLI["docker\nCLI client"]
  daemon["dockerd\nDocker daemon"]
  containerd["containerd\ncontainer lifecycle"]
  runc["runc\nOCI runtime"]
  kernel["Linux Kernel\nnamespaces · cgroups"]
  registry["Registry\nDocker Hub · GHCR · ECR"]

  CLI -- "API (Unix socket)" --> daemon
  daemon <-- "pull / push images" --> registry
  daemon --> containerd
  containerd --> runc
  runc -- "kernel calls" --> kernel

A Docker image is not a single monolithic file. It is a stack of read-only layers, each representing a set of filesystem changes. This layered design uses a union filesystem to present all the layers as a single coherent directory tree to a running container.

As a simplified learning model, consider an image for a web application built in sequence: start from a base runtime image, copy in the dependency manifest, install the dependencies, then copy in the application source code. In this example, each filesystem-changing step produces a new layer:

  1. Base layer: the runtime and its OS dependencies
  2. Manifest layer: the dependency file added to the filesystem
  3. Install layer: the result of running the package installer
  4. Application layer: the application source code

This does not mean every Docker image has exactly four layers or these exact boundaries. Real images often have more layers, sometimes fewer, and may include metadata-only history entries.

The power of this design is layer caching. If you rebuild the image and only the application code has changed, Docker reuses the cached base, manifest, and install layers and only rebuilds the application layer. This makes rebuilds fast. It also means that if ten images on the same host all use the same base image, that layer is stored only once on disk.

The diagram below shows the layer stack for a Python web application. The image layers (read-only) are shared across all containers started from this image. Each container adds its own writable layer on top for any changes it makes at runtime; that layer is discarded when the container is removed.

flowchart TB
  W("Writable layer\ncontainer runtime only, ephemeral")
  L4["COPY . . \napplication source code"]
  L3["RUN pip install\ninstalled packages"]
  L2["COPY requirements.txt ."]
  L1["FROM python:3.12-slim\nbase OS + Python runtime"]

  L1 --> L2 --> L3 --> L4 --> W

You can inspect the layers of any image with:

Terminal window
docker history nginx:alpine

This shows each layer’s size, the instruction that created it, and when it was built. In many application images, one of the largest layers is the dependency installation step, though the base image or copied build artifacts can also dominate.

Some rows show a size of 0B. This usually means the history entry changed image metadata but did not add or remove filesystem content. Instructions such as CMD, ENTRYPOINT, ENV, EXPOSE, LABEL, USER, and WORKDIR often appear this way. They still matter operationally because they change container startup behavior and defaults, even when no bytes are added to the layer filesystem.

A Dockerfile is a text file containing instructions that Docker executes in sequence to build an image. Instructions that change the filesystem typically produce layers, while some instructions only change image metadata or history. The final stack of filesystem layers, together with that metadata, becomes the image.

A typical web application Dockerfile follows a predictable structure:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8080
CMD ["python", "server.py"]

Each instruction serves a specific purpose:

  • FROM sets the base image. python:3.12-slim is a Debian-based image with Python pre-installed but without extra development tools, keeping it small. Every image descends from a base, ultimately tracing back to scratch (an empty filesystem).
  • WORKDIR sets the working directory inside the container. Subsequent COPY, RUN, and CMD instructions execute relative to this path. Using WORKDIR avoids cd commands scattered through the file.
  • COPY requirements.txt . copies only the dependency manifest first. This is a deliberate caching optimization: the install step only re-runs when the manifest changes.
  • RUN executes a command during the build and commits the result as a new layer. Here it installs packages. The --no-cache-dir flag prevents pip from storing download caches in the image, saving space.
  • COPY . . copies the rest of the application source code after the install step so that code changes do not bust the install cache.
  • EXPOSE documents which port the application listens on. It does not actually publish the port; that happens at runtime with -p.
  • CMD sets the default command that runs when a container starts from this image. It can be overridden at runtime.

For compiled languages, the toolchain required to build an application (compiler, linker, development headers) is much larger than what is needed to run the resulting binary. Multi-stage builds let you use one image for compilation and a different, smaller image for the final artifact.

# Build stage: compile the application
FROM golang:1.22 AS build
WORKDIR /src
COPY . .
RUN CGO_ENABLED=0 go build -o /app ./cmd/server
# Runtime stage: copy only the compiled binary
FROM scratch
COPY --from=build /app /app
ENTRYPOINT ["/app"]

The final image contains only the binary, produced from an empty scratch base. There are no shell, no package manager, no Go toolchain in the deployed image. The attack surface is minimal and the image size might be 10 MB where the build stage weighed 1 GB.

This image uses ENTRYPOINT rather than CMD. The difference matters here. CMD sets a default command that is replaced entirely if you pass anything after the image name at runtime: docker run myimage bash would swap out the whole command for bash. ENTRYPOINT sets a fixed executable that always runs; anything passed at runtime becomes arguments to it rather than a replacement. The two are often paired: ENTRYPOINT names the executable and CMD provides its default arguments, which can be overridden without replacing the binary. In a scratch image there is no shell to invoke anyway, so the distinction is mostly academic here. The deeper reason to prefer ENTRYPOINT for a production binary is intent: it signals that the image exists to run that one thing and is not meant to be repurposed as a general-purpose environment.

Multi-stage builds are useful for any language that produces distributable artifacts: Go, Rust, Java (producing a JAR or native image), and C/C++. They can also be used to separate test execution from the production image.

With an image available, docker run creates a container from it and starts the application process:

Terminal window
docker run -d -p 8080:80 --name webserver nginx:alpine

This command starts a container from the nginx:alpine image, names it webserver, and runs it in the background. nginx is the web server software, and alpine means the image uses Alpine Linux, a small distribution chosen to keep image size low. The -p 8080:80 mapping publishes container port 80 (where nginx listens inside the container) on host port 8080, so you can reach it at http://localhost:8080.

The most commonly used flags:

  • -d (detach) runs the container in the background and prints its ID. Without -d, the container’s output streams to your terminal and the container stops when you press Ctrl+C.
  • -p host:container maps a port on the host to a port inside the container. Traffic arriving at the host port is forwarded to the container’s internal IP on the container port.
  • --name gives the container a memorable name instead of a randomly generated one. Named containers are easier to reference in subsequent docker commands.
  • -e KEY=VALUE sets an environment variable inside the container. This is the standard way to pass configuration (database URLs, API keys, feature flags) without baking values into the image.
  • -v /host/path:/container/path creates a bind mount, mapping a host directory into the container. Changes are visible in both directions immediately.
  • --rm automatically removes the container when it exits, useful for one-off commands.
  • --restart unless-stopped tells Docker to restart the container if it crashes or if the host reboots, unless you explicitly stop it.

Containers move through a simple lifecycle: created, running, stopped, removed. Docker provides commands for each state:

Stopping is not just a Docker concept; it is a Unix process-control concept. A signal is a small message the kernel delivers to a process to tell it that something has happened or that it should take an action. SIGTERM is the polite request to shut down: it gives the application a chance to finish in-flight work, close files, flush buffers, and exit cleanly. SIGKILL is the forced stop: the kernel terminates the process immediately, and the process cannot catch it, ignore it, or do cleanup after receiving it. Docker uses both in sequence because a clean shutdown is better for application correctness, but a container that refuses to exit still has to be stopped eventually.

Terminal window
# View running containers
docker ps
# View all containers, including stopped ones
docker ps -a
# Stop a running container (sends SIGTERM, then SIGKILL after timeout)
docker stop webserver
# Start a stopped container
docker start webserver
# Restart a container
docker restart webserver
# View container logs (stdout/stderr)
docker logs webserver
# Follow logs in real time
docker logs -f webserver
# Execute a command inside a running container
docker exec -it webserver sh
# Show resource usage (CPU, memory, network I/O)
docker stats
# Remove a stopped container
docker rm webserver
# Force-remove a running container
docker rm -f webserver

The docker exec -it command is particularly useful for debugging. The -it flags allocate an interactive terminal, giving you a shell inside the running container. docker stats provides a real-time view of CPU and memory consumption without installing extra tooling inside the container.

Every container gets its own network namespace, but Docker also needs to make containers reachable from the host and from each other. It does this through a set of network drivers, each with different isolation and connectivity characteristics.

The default network mode is bridge. On a typical Linux Docker installation, Docker creates a virtual bridge interface on the host (usually docker0) and assigns each container a private IP address within that bridge’s subnet, often 172.17.0.0/16 unless the daemon or host networking has been customized. On Docker Desktop for macOS or Windows, the bridge exists inside the Linux VM Docker manages, so the exact private subnet may differ. Containers can reach the internet through NAT. On Linux, the host itself can usually reach a container’s private bridge IP directly, but machines outside the host cannot. Port publishing is what makes a container service reachable through a host port such as localhost:8080 or from other systems on the network.

Port publishing (-p host:container) maps a port on the host to a port inside the container. Docker configures a NAT rule forwarding traffic arriving at the host port to the container’s private IP on the container port. The container process only knows it is listening on, say, port 80; the host port is a translation that exists outside the container’s view. This is why two containers can both “listen on port 80” without conflicting: they each have a unique private IP, and only the host-side ports (which must be unique) are exposed externally.

The diagram below shows two containers both listening on their internal port 80. Docker’s NAT maps distinct host ports to each one. From the host, the two services are distinguishable by port; from inside each container, there is no port conflict because each has its own network namespace.

flowchart LR
  subgraph host["Host"]
    h1[":8080"]
    h2[":9090"]
  end
  subgraph bridge["docker0 bridge  172.17.0.0/16"]
    cA["Container A\n172.17.0.2:80\nnginx"]
    cB["Container B\n172.17.0.3:80\nnginx"]
  end

  h1 -- "NAT" --> cA
  h2 -- "NAT" --> cB

The default bridge network has one significant limitation: containers on it cannot reach each other by name, only by IP. When you create a user-defined bridge network, Docker adds an embedded DNS resolver. Containers attached to the same user-defined network can reach each other by container name or service name, without knowing IP addresses.

Terminal window
# Create a named network
docker network create myapp
# Run two containers on it
docker run -d --network myapp --name database postgres:16-alpine
docker run -d --network myapp --name web alpine:3.20 sleep infinity
# 'web' can resolve 'database' by name on the shared network
docker exec web ping -c 1 database

This DNS-based service discovery is how Docker Compose inter-service communication works: Compose creates a user-defined network for each project and attaches all services to it. A web service can reach a database service at the hostname database without any manual network configuration. Name resolution only tells you the other container is reachable on the network; it does not guarantee the application inside that container is ready to serve requests.

Host networking (--network host) removes the network namespace entirely: the container process binds directly to the host’s network interfaces. There is no port mapping; if the container listens on port 8080, the host is also exposing port 8080. This is useful for network tooling or performance-sensitive applications but sacrifices the isolation that container networking normally provides.

No networking (--network none) gives the container only a loopback interface. It cannot make or receive network connections. This is appropriate for batch processing jobs that read and write files and have no reason to access the network.

Docker provides two mechanisms for data that must outlive a container: volumes and bind mounts.

Bind mounts map a specific path on the host filesystem into the container. They are useful during development when you want the container to see your source code changes in real time without rebuilding the image, or when you need to inject configuration files from the host.

Volumes are managed by Docker and stored in a Docker-controlled area of the host filesystem (/var/lib/docker/volumes/ on Linux). Docker handles their permissions and lifecycle. They are the preferred mechanism for application data that must persist across container replacements.

Terminal window
# Create a named volume
docker volume create db-data
# Attach it to a container
docker run -d -v db-data:/var/lib/postgresql/data postgres:16-alpine
# List volumes
docker volume ls
# Inspect where a volume lives on the host
docker volume inspect db-data
# Remove a volume (only if no container is currently using it)
docker volume rm db-data

The key distinction between the two mechanisms: bind mounts express “share this specific host path with the container,” while volumes express “give this container persistent storage that Docker manages.” For production databases, volumes are preferred because they are not tied to a specific host path, integrate cleanly with Docker tooling, and are generally easier to back up or migrate than bind mounts. The backup and migration still require explicit tooling or operational process; Docker does not do them automatically for you.

The most common surprise with persistent storage is permissions. Many production images run as a non-root user for security, but a mounted path keeps the ownership and mode bits of the host directory or Docker-managed volume that is attached at runtime. If that mounted path is owned by root or by some other UID/GID, the application can fail with Permission denied even though the image itself was built correctly. A chown done during docker build changes the image layer, not the runtime mount that later covers that path.

Docker Compose: Multi-Container Applications

Section titled “Docker Compose: Multi-Container Applications”

Real applications rarely consist of a single container. A web application typically needs a web server, a database, and perhaps a cache or message queue. Docker Compose lets you define and manage multi-container applications using a single YAML file.

Consider a web application that needs a database. The compose.yml file describes both services and their relationship:

services:
web:
build: .
ports:
- "8080:80"
environment:
- DATABASE_URL=${DATABASE_URL}
depends_on:
database:
condition: service_healthy
restart: unless-stopped
database:
image: postgres:16-alpine
environment:
- POSTGRES_DB=${POSTGRES_DB}
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
healthcheck:
test: ["CMD-SHELL", "pg_isready -U $$POSTGRES_USER -d $$POSTGRES_DB"]
interval: 10s
timeout: 5s
retries: 5
volumes:
- db-data:/var/lib/postgresql/data
volumes:
db-data:

With this pattern, Compose reads values from shell environment variables or from a local .env file in the same directory as compose.yml. For example, .env can define DATABASE_URL, POSTGRES_DB, POSTGRES_USER, and POSTGRES_PASSWORD without hardcoding them in the Compose file.

Several things to notice:

Services are the containers you want to run. Each service can either build from a local Dockerfile (as web does here) or pull a pre-built image from a registry (as database does).

Networking is automatic. Compose creates a user-defined bridge network for the application and attaches all services to it. The web service can connect to the database at the hostname database on port 5432 without manual network configuration. A connection string such as postgres://${POSTGRES_USER}:${POSTGRES_PASSWORD}@database:5432/${POSTGRES_DB} works because database resolves to the database container’s IP address within the shared network.

Environment variables are the standard mechanism for injecting configuration. Values that differ between environments (development, staging, production) such as database URLs, API keys, and feature flags are passed in at runtime, not baked into the image. Keep secrets out of version control by using .env files that are git-ignored, Docker secrets, or an external secrets manager.

Volumes declared at the top level (the volumes: key) are named volumes that Docker creates and manages. In the example, db-data: declares a named volume. When a service references it with - db-data:/var/lib/postgresql/data, that volume is mounted into the container at that path. Named volumes persist across docker compose down and docker compose up cycles unless you explicitly pass --volumes to the down command. If instead you wanted a bind mount (host directory), you would write - ./local-path:/var/lib/postgresql/data, using a relative or absolute path instead of a volume name.

depends_on controls startup order. In its simplest form, it only means the dependency container is started first, not that the service inside is actually ready to accept connections. In the example above, the long form with condition: service_healthy works together with the database healthcheck so the web service waits for PostgreSQL to report readiness. Even then, production applications should still include retry logic, because distributed systems can fail after startup too.

The workflow is:

Terminal window
# Build images and start all services in the background
docker compose up --build -d
# View running services
docker compose ps
# View logs from all services
docker compose logs
# Follow logs from a specific service
docker compose logs -f web
# Stop and remove containers (preserves volumes)
docker compose down
# Stop and remove containers AND volumes
docker compose down --volumes

An image built locally is only useful if it can be shared and deployed. Registries are servers that store and distribute images. The most common:

  • Docker Hub (hub.docker.com): the default public registry; hosts official images for nginx, postgres, redis, and thousands of community images.
  • GitHub Container Registry (GHCR) (ghcr.io): integrated with GitHub; images are published alongside source code in the same organization.
  • Amazon ECR, Google Artifact Registry, Azure Container Registry: cloud-provider registries for production deployments, typically used with CI/CD pipelines.

Before pushing, an image must be tagged with the destination repository path. The format is registry/repository:tag. If no registry is specified, Docker assumes Docker Hub.

Terminal window
# Tag for Docker Hub
docker tag my-app myusername/my-app:1.0.0
docker tag my-app myusername/my-app:latest
# Tag for GitHub Container Registry
docker tag my-app ghcr.io/myusername/my-app:latest
# Push
docker push myusername/my-app:latest
# Pull on another machine
docker pull myusername/my-app:latest

Tags are mutable pointers. latest is a convention, not a guarantee of recency; a registry administrator can point latest at any image version. For reproducible deployments, use specific version tags like 1.0.0 or image digest references (@sha256:...), which are immutable.

The reason tagging and registries matter operationally is that they make version upgrades clean. Because the container is an ephemeral runtime instance and the image is the immutable artifact, upgrading means pulling a new image and recreating the container, not modifying a running process in place. The old container’s filesystem is discarded; any data that matters is in a volume and survives untouched.

With Docker Compose, this workflow is straightforward:

Terminal window
# Pull all updated images declared in compose.yml
docker compose pull
# Recreate containers that are using outdated images
docker compose up -d

Compose detects which services have a newer image than what is currently running and recreates only those containers. Services whose image has not changed are left alone. The named volumes attached to each service are unaffected by the container recreation.

The same principle applies to rollbacks. Because each image version is a separate, immutable artifact, rolling back to an older version means pulling the previous tag and recreating the container. There is no in-place state to unwind because the state never lived in the container:

Terminal window
# Roll back after changing the service's image tag to 1.2.3 in compose.yml
docker compose pull web
docker compose up -d

This clean separation between application code (in the image) and application data (in the volume) is one of the core operational benefits of the container model. It is also why pinning to a specific version tag like 1.2.3 rather than latest matters in production: you always know exactly which image is running and can recreate it precisely or roll back to it by name.

Docker popularized containers, but it is not the only tool in the ecosystem. Understanding the standards that govern container interoperability helps you reason about alternatives.

The Open Container Initiative (OCI), formed in 2015 by Docker, Red Hat, Google, and others, defines two standards:

  • OCI Image Specification: how container images are formatted and stored (layers, manifests, configuration).
  • OCI Runtime Specification: the interface a container runtime must implement to create and manage containers.

Because Docker images conform to the OCI image specification, any OCI-compatible runtime can run them. This is what makes the ecosystem of alternative tools possible.

runc is the OCI reference runtime: a small, low-level tool that directly calls Linux namespaces and cgroups to create and run containers. Both Docker and Podman use runc (or a compatible implementation) under the hood. In Docker’s architecture, containerd sits between the Docker daemon and runc. It is the component that manages container lifecycle tasks such as pulling images, unpacking layers, and supervising running containers. When you run docker run, the chain is: Docker client → Docker daemon → containerd → runc, with runc making the actual kernel calls to create namespaces and apply cgroups.

Podman is an OCI-compatible container engine developed by Red Hat. It is the default container tool on RHEL 8+, Fedora, and AlmaLinux, and is available on all major Linux distributions and macOS.

Podman differs from Docker in two important ways:

Daemonless. There is no long-running background process. Each podman command is a direct process that creates and manages containers itself, then exits. There is no single daemon running as root to which all operations are delegated.

Rootless by default. Containers run as your own user identity on the host rather than requiring root privileges or a root-owned daemon. If an attacker were to escape the container’s namespace isolation, they would land as an unprivileged user, not as root.

Podman’s CLI is deliberately compatible with Docker’s. Many common docker commands work with podman substituted, and some users who have transitioned even run alias docker=podman for day-to-day work. The compatibility is high, but it is not perfect across every flag, workflow, or ecosystem integration. Podman reads OCI images, so it can pull and run images from Docker Hub, GHCR, or any other OCI registry. For Compose-style workflows, podman compose typically invokes an external Compose provider such as docker-compose or podman-compose while pointing it at the local Podman socket.

On macOS, and in many Windows setups that run Linux containers, containers require a Linux VM because container isolation depends on Linux kernel features. Docker Desktop and Podman Desktop manage this VM transparently. OrbStack is a macOS-only alternative to Docker Desktop. It runs a lightweight Linux VM with fast startup time, low memory overhead, and strong filesystem performance, and it is fully compatible with the Docker CLI and Docker socket, so all docker commands work without modification.

Rancher Desktop is a free, open-source Docker Desktop alternative available on macOS, Windows, and Linux. It runs a Linux VM (using Lima on macOS and Linux, WSL2 on Windows) and supports either containerd or dockerd as the container engine, selectable at install time. It also bundles an optional lightweight Kubernetes cluster (k3s), making it a common choice for developers who need both a local Docker workflow and a local Kubernetes environment without running a separate tool. Like OrbStack, it is compatible with the Docker CLI and Docker socket.

The practical implication: because the OCI standard defines both image format and runtime interface, the concepts in this lecture (images, layers, Dockerfiles, volumes, Compose) apply regardless of whether you use Docker, Podman, OrbStack, or Rancher Desktop. The image format, Dockerfile syntax, and Compose file format are all standardized or widely compatible across the ecosystem.

Managing a handful of containers manually works for development and small deployments. At scale, you need container orchestration: a system that automatically deploys, scales, heals, and load-balances containers across a cluster of machines.

Kubernetes (abbreviated K8s, where 8 represents the eight letters between “K” and “s”) is the dominant orchestration platform. A Kubernetes cluster consists of a control plane (the components that make scheduling and management decisions) and worker nodes (the machines that actually run containers). You describe desired state in YAML manifests (“run three replicas of this image, expose them on this port, restart if they fail”), and Kubernetes continuously works to make reality match that description. Key capabilities include:

  • Automated deployment and rollback
  • Horizontal scaling (add more replicas under load)
  • Self-healing (replace failed containers automatically)
  • Load balancing across replicas
  • Rolling updates with zero downtime

Kubernetes will be discussed later in more depth.

Docker Swarm is Docker’s own simpler orchestration tool. It is easier to set up than Kubernetes but less capable. For most new projects, Kubernetes (or a managed variant like Amazon EKS, Google GKE, or Azure AKS, where the cloud provider operates the control plane) is the industry standard.

The following practices are not optional refinements to apply later. They address failure modes that appear early in real deployments. Oversized images slow down CI pipelines and increase attack surface. Secrets baked into image layers have caused publicized security incidents. Missing health checks cause orchestrators to route traffic to instances that are not yet ready. Processes running as root inside containers have led to host compromise when container escapes were discovered. Each practice below translates directly into images that are smaller, more secure, and more reliable.

The python:3.12 image is over 900 MB; python:3.12-slim is around 150 MB. Alpine-based images can be even smaller, though they use musl instead of glibc, which can cause compatibility issues with some compiled C extensions. Start with -slim variants and switch to Alpine only if your dependencies support it and image size is a significant concern.

Just as .gitignore prevents unnecessary files from entering version control, .dockerignore prevents unnecessary files from entering the build context. Without it, docker build sends everything in the current directory to the daemon, including .git/, virtual environments, node_modules/, and other large directories. This slows builds and risks including sensitive files in the image.

.git
.venv
__pycache__
*.pyc
.env
node_modules

By default, processes inside a container run as root. If an attacker exploits a vulnerability in your application, they gain root access inside the container, which may translate to host access through misconfigurations. Create a dedicated user in your Dockerfile and switch to it before the runtime command:

FROM python:3.12-slim
RUN useradd --create-home appuser
WORKDIR /home/appuser/app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
USER appuser
EXPOSE 8080
CMD ["python", "server.py"]

The USER instruction switches all subsequent commands and the container’s runtime process to the non-privileged appuser.

This hardening step does not solve mounted-storage permissions by itself. If /data, /config, or another writable path is supplied by a bind mount or volume, that mounted path must also be writable by the same UID/GID. Once the mount itself has been prepared correctly, later restarts and reuse of that same volume or host path usually keep working; a fresh mount is what forces you to initialize permissions again.

A container can be “running” (the process has not exited) without being “healthy” (the application is actually responding). Docker supports a HEALTHCHECK instruction that periodically runs a command to verify the application’s state:

HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl -f http://localhost:8080/ || exit 1

Docker marks the container as healthy, unhealthy, or starting based on the result. Docker and Docker Compose can surface this state directly, and orchestrators provide analogous mechanisms such as readiness and liveness probes in Kubernetes. The underlying idea is the same: a process can still be running even when the application is not yet ready or has become unhealthy.

Never put passwords, API keys, or certificates in a Dockerfile or in files that get copied into the image. Anyone who pulls the image can extract them from any layer. Inject secrets at runtime through environment variables, Docker secrets (in Swarm mode), or external secret managers (HashiCorp Vault, AWS Secrets Manager). Even environment variables are visible via docker inspect, so for highly sensitive values, mount a secrets file from a volume or use a dedicated secrets management tool.

Everything described in this lecture, namespaces, cgroups, runc, Docker, Podman, assumes a Linux kernel. That is not incidental. Namespaces and cgroups are Linux kernel features; they do not exist on macOS or Windows kernels. This is why Docker Desktop on macOS runs a lightweight Linux VM under the hood, and why Podman on macOS does the same. When you run docker run on a Mac, your container is not running on macOS directly; it is running on a Linux kernel inside a VM that Docker manages transparently. The tools hide this, but understanding it explains why you cannot run a Linux container on a Windows host without virtualization, and why image architectures must match the host’s CPU architecture.

The Linux-specific design of the mainstream container stack does not mean containerization as a concept is unique to Linux. Different operating systems and environments have developed their own isolation mechanisms, each reflecting the design priorities of their platform.

FreeBSD jails predate Docker by over a decade. Introduced in FreeBSD 4.0 in 2000, a jail creates an isolated environment with its own filesystem subtree, hostname, and IP address, similar in concept to what Linux namespaces provide. Processes inside a jail cannot see or affect processes outside it, and even root inside the jail has limited capabilities on the host. FreeBSD jails were a major influence on Linux container design, and the problem they solved (isolating multiple services on a shared server) is exactly the problem Docker was built to solve. The key difference is platform: jails run on FreeBSD only and use BSD kernel primitives rather than Linux cgroups and namespaces. They are still in active use in production FreeBSD environments.

Microsoft added native container support in Windows Server 2016. Windows containers use Windows-specific kernel features, analogous to Linux namespaces and cgroups, to isolate Windows processes. They can run Windows executables, Windows services, and Windows-native applications. They are not Linux containers; you cannot run a Linux image in a Windows container or vice versa. Windows containers come in two isolation modes: process isolation (sharing the Windows host kernel, similar to Linux containers) and Hyper-V isolation (each container gets its own lightweight VM, similar to a VM but with a container-like interface). Hyper-V isolation exists to provide a stronger isolation boundary and broader compatibility between host and container versions.

WebAssembly (WASM) started as a browser technology for running compiled code at near-native speed in a safe sandbox. In recent years, efforts like WASI (WebAssembly System Interface) have extended it to server-side environments, and tools like wasmtime and WasmEdge can run WASM modules as server workloads. The isolation model is fundamentally different from Linux containers: instead of OS-level process isolation using kernel features, WASM provides language-level sandboxing enforced by the runtime. A WASM module cannot access memory or system calls outside what the host explicitly grants, regardless of what operating system or kernel is running. The practical appeal is extreme portability: the same WASM binary can run on Linux, macOS, Windows, and even embedded devices, without recompilation or a VM. The tradeoff is that WASM has a different execution model and limited language support, and the server-side ecosystem is still maturing compared to OCI containers. Some container orchestrators, including Kubernetes, have experimental support for running WASM workloads alongside traditional containers.

The common thread across all of these approaches is the same problem: how do you run multiple isolated workloads on shared infrastructure without them interfering with each other? FreeBSD jails solved it with BSD kernel primitives. Linux solved it with namespaces and cgroups. Windows solved it with its own kernel objects and optionally Hyper-V. WASM solves it at the language runtime level, independent of the OS entirely. Docker’s dominance reflects that Linux became the default server platform, not that the container concept itself is Linux-specific.

Consider what deploying a Python web application with a PostgreSQL database looks like using everything in this lecture. You write a Dockerfile starting from python:3.12-slim, copy requirements.txt first, run pip install, then copy the rest of the source code. That ordering is deliberate: it keeps the install layer cached across code changes, so rebuilds are fast. You add a HEALTHCHECK instruction so the container does not report healthy until the application is actually serving requests, and you switch to a non-root appuser before the CMD so the runtime process has minimal privileges.

In compose.yml you declare two services: web and database. Compose creates a user-defined bridge network for the project and attaches both services to it, so web can connect to database by hostname without any manual network configuration. The database service uses a named volume for its data files so they survive container replacement. The web service declares depends_on with condition: service_healthy to wait for the database before starting. You bring everything up with docker compose up --build -d and follow startup with docker compose logs -f.

When it is time to deploy, you tag the image with a specific version like 1.3.0 and push it to GHCR. On the target server, docker compose pull fetches the new image, and docker compose up -d replaces the running container while leaving the named volume intact. The data never lived in the container, so the replacement is clean. Rolling back means pulling the previous tag and doing the same thing in reverse.

The container model introduces a clean separation of concerns: the image is the immutable artifact, the container is the ephemeral runtime instance, volumes hold the state that outlives any individual instance, and registries distribute images wherever they need to run. That separation is also what makes version upgrades and rollbacks straightforward: you are not modifying a running system; you are swapping one image for another while the data stays put.

What containers do not replace is the need to understand the underlying systems. All of this works because Linux provides namespaces and cgroups. On macOS or Windows, a Linux VM sits between you and those kernel features, even if the tools hide it. Other platforms solve the same isolation problem differently: FreeBSD with jails, Windows with its own kernel primitives, and WebAssembly by stepping outside the OS model entirely. Docker’s dominance reflects Linux’s dominance on servers, not the universality of the specific mechanism.