Skip to content

Cluster Operations: Health Probes, Rollouts, and Failure Drills (Kubernetes)

The lunch rush crashed the website at all three locations simultaneously. Gerald called it “a digital fire.” He wants “those health check things” and a plan for when things break. You asked who would be on call. Gerald said, “You’re on call.” You asked when. He said, “Always.”

Your WordPress stack (nginx → WordPress → MariaDB) is running, but Kubernetes does not yet know whether your application is actually healthy; it only knows whether the container processes are running. A container can be running but completely broken (returning errors, stuck in a loop, out of memory). In this lab, you will add health probes that let Kubernetes make smarter decisions, set resource limits so one container cannot starve the node, practice rolling updates and rollbacks on the nginx reverse proxy, intentionally break things to learn how to diagnose and recover from failures, and correct a data durability gap from the previous lab by adding persistent storage to MariaDB.

You need:

  • The k3s cluster from Lab 7 with all manifests still applied (nginx, WordPress, MariaDB, and their Services)
  • SSH access to your EC2 instance

If you ended your AWS Academy session, restart it. The EC2 instance and cluster will still be there. If you deleted the resources from Lab 7, re-apply all the manifests from that lab before starting.

Liveness Probe: Tells Kubernetes whether the container is alive. If the liveness probe fails, Kubernetes kills the container and restarts it. Use this to recover from deadlocks or stuck processes.

Readiness Probe: Tells Kubernetes whether the container is ready to receive traffic. If the readiness probe fails, Kubernetes removes the Pod from Service endpoints (stops sending it requests) but does not kill it. Use this during startup or when a container is temporarily overwhelmed.

Resource Requests: The minimum CPU and memory a container needs. The Kubernetes scheduler uses requests to decide which node can fit the Pod.

Resource Limits: The maximum CPU and memory a container can use. If a container exceeds its memory limit, it is killed with an Out of Memory (OOM) error.

PersistentVolumeClaim (PVC): A request for storage from a workload. Kubernetes binds the claim to a PersistentVolume (PV), a piece of storage provisioned on the cluster, and mounts it into the container. Unlike the container’s ephemeral writable layer, data in a PVC survives pod deletion and node restarts. k3s ships with a local-path StorageClass that automatically provisions PVs backed by directories on the node’s local filesystem.

Watch for the answers to these questions as you follow the tutorial.

  1. Run kubectl get pvc. What is the STATUS and STORAGECLASS of mariadb-pvc? After deleting the MariaDB Pod and letting Kubernetes replace it, did your WordPress data survive? What does this demonstrate about PersistentVolumeClaims compared to the ephemeral storage used in Lab 7? (3 points)
  2. How many revisions does kubectl rollout history deployment/nginx show after the rolling update? What image tag was used in the most recent revision, and what CHANGE-CAUSE text did you record for that revision? (3 points)
  3. During the rolling update, what changed after you ran kubectl rollout pause deployment/nginx and then kubectl rollout resume deployment/nginx? Did you observe both old and new Pods running at the same time, and how does that support zero downtime? (5 points)
  4. When you deployed the nonexistent image (nginx:doesnotexist), what error status did the Pod show? Write down the exact error message from the Events section of kubectl describe pod. (4 points)
  5. When you set the memory limit to 1Mi, what failure signal did you observe? Depending on container runtime timing, you may see either Pod status OOMKilled/CrashLoopBackOff with exit code 137, or FailedCreatePodSandBox Events that include container init was OOM-killed. Explain what either result means. (5 points)
  6. After rolling back to a specific revision (kubectl rollout undo --to-revision=<n>), what image tag are the healthy Pods running? How many Pods show 1/1 Ready? (2 points)
  7. Get your TA’s initials showing all Pods in a Running/Ready state after recovery from the OOMKill drill. (3 points)

In Lab 7, MariaDB stored all its data inside the container’s own writable layer. That layer is ephemeral: it is discarded whenever the Pod is deleted, crashes, or the node restarts. Stopping and restarting the EC2 instance causes k3s to terminate and relaunch all Pods, wiping the database each time.

Kubernetes separates storage from compute using two objects:

  • PersistentVolume (PV): A piece of storage provisioned on the cluster: a directory on the node’s filesystem, an EBS volume, an NFS share, and so on. A PV exists independently of any Pod.
  • PersistentVolumeClaim (PVC): A request for storage from a workload. Kubernetes binds a PVC to a matching PV and mounts it into the container. Data in a PVC survives pod deletion and node restarts.

k3s ships with a local-path StorageClass that automatically provisions PVs backed by directories on the node’s local disk. This is not suitable for multi-node clusters (data is tied to one specific node), but it is exactly right for a single-node learning environment.

  1. Write the PVC manifest

    SSH into your EC2 instance and navigate to your lab directory then write the PVC manifest:

    Terminal window
    cd ~/k8s-lab
    vim mariadb-pvc.yaml
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    name: mariadb-pvc
    spec:
    accessModes:
    - ReadWriteOnce
    storageClassName: local-path
    resources:
    requests:
    storage: 1Gi

    ReadWriteOnce means only one node can mount this volume at a time, which is correct for a single-replica database. local-path tells k3s to provision a directory on the node’s disk automatically.

  2. Apply the PVC

    Terminal window
    kubectl apply -f mariadb-pvc.yaml
    kubectl get pvc

    The STATUS column will show Pending initially. The local-path provisioner uses late binding: the PV is not created until a Pod actually claims the PVC. It will become Bound once the updated Deployment starts.

  3. Update the MariaDB Deployment to use the PVC

    Edit mariadb-deployment.yaml and add a volumeMounts entry to the container and a volumes entry to the Pod spec:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: mariadb
    labels:
    app: mariadb
    spec:
    replicas: 1
    selector:
    matchLabels:
    app: mariadb
    template:
    metadata:
    labels:
    app: mariadb
    spec:
    containers:
    - name: mariadb
    image: mariadb:11
    env:
    - name: MYSQL_ROOT_PASSWORD
    valueFrom:
    secretKeyRef:
    name: db-secret
    key: db-root-password
    - name: MYSQL_DATABASE
    value: wordpress
    - name: MYSQL_USER
    value: wordpress
    - name: MYSQL_PASSWORD
    valueFrom:
    secretKeyRef:
    name: db-secret
    key: db-password
    ports:
    - containerPort: 3306
    volumeMounts:
    - name: mariadb-data
    mountPath: /var/lib/mysql
    volumes:
    - name: mariadb-data
    persistentVolumeClaim:
    claimName: mariadb-pvc

    MariaDB stores all its data in /var/lib/mysql. Mounting the PVC there means all writes go to the EC2 instance’s disk rather than the container’s ephemeral layer.

  4. Apply the updated Deployment

    Terminal window
    kubectl apply -f mariadb-deployment.yaml
    kubectl rollout status deployment/mariadb

    kubectl rollout status works for any Deployment, including a single-replica one like MariaDB. Here it means “wait until the new MariaDB Pod is available.” This is not the same behavior we observed in the previous lab with 2 wordpress replicas, where old and new Pods overlap to keep traffic flowing during the update.

  5. Verify the PVC is bound

    Terminal window
    kubectl get pvc

    STATUS should now show Bound. Record the STATUS and STORAGECLASS values for your lab questions.

  6. Verify data persists across pod restarts

    Log in to the WordPress admin and set a distinctive site title (e.g., “Gerald’s Persistent Restaurant”). Then delete the MariaDB Pod:

    Terminal window
    kubectl delete pod -l app=mariadb

    Watch Kubernetes replace it:

    Terminal window
    kubectl get pods -w

    Once the new Pod is Running, reload WordPress in your browser. Your site title should still be there. The replacement Pod mounted the same PVC, which holds the data on the EC2 disk.

At this point your stack is running, but Kubernetes still needs explicit rules for traffic safety and resource fairness. The two probes serve different purposes: liveness answers “is the nginx process itself alive?” and triggers a restart if it is not, while readiness answers “can this nginx Pod successfully serve a real request right now?” and temporarily removes only that Pod from nginx-service endpoints when it cannot. Resource settings solve a different problem: requests reserve minimum CPU/memory so the scheduler can place Pods reliably, and limits cap maximum usage so one container cannot starve the node. Together, probes and resource controls improve availability during failures, updates, and load spikes.

  1. Update the nginx ConfigMap to add a health endpoint

    Edit nginx-configmap.yaml to add a /health location that nginx handles directly (not proxied to WordPress). This lets the liveness probe check whether nginx itself is alive, independently of whether WordPress is reachable:

    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: nginx-config
    data:
    default.conf: |
    server {
    listen 80;
    location /health {
    access_log off;
    return 200 "ok\n";
    add_header Content-Type text/plain;
    }
    location / {
    proxy_pass http://wordpress-service;
    proxy_set_header Host $http_host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    }
    }

    Apply the updated ConfigMap:

    Terminal window
    kubectl apply -f nginx-configmap.yaml
  2. Update the nginx Deployment to add probes and resource constraints

    Edit nginx-deployment.yaml:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: nginx
    labels:
    app: nginx
    spec:
    replicas: 4
    minReadySeconds: 15
    strategy:
    type: RollingUpdate
    rollingUpdate:
    maxUnavailable: 1
    maxSurge: 1
    selector:
    matchLabels:
    app: nginx
    template:
    metadata:
    labels:
    app: nginx
    spec:
    containers:
    - name: nginx
    image: nginx:1.27
    ports:
    - containerPort: 80
    livenessProbe:
    httpGet:
    path: /health
    port: 80
    initialDelaySeconds: 5
    periodSeconds: 10
    readinessProbe:
    httpGet:
    path: /
    port: 80
    initialDelaySeconds: 3
    periodSeconds: 5
    resources:
    requests:
    memory: "64Mi"
    cpu: "50m"
    limits:
    memory: "128Mi"
    cpu: "100m"
    volumeMounts:
    - name: nginx-config-volume
    mountPath: /etc/nginx/conf.d
    volumes:
    - name: nginx-config-volume
    configMap:
    name: nginx-config

    Let’s walk through the new sections:

    • replicas: 4: We intentionally run four nginx Pods in this lab so rollouts are easier to observe. With more replicas, old and new Pods overlap longer during updates, so pause/resume behavior is visible instead of finishing too quickly.
    • minReadySeconds: 15: A new Pod must stay Ready for 15 seconds before Kubernetes marks it Available. This slows the rollout enough for you to inspect intermediate states.
    • strategy.rollingUpdate: maxUnavailable: 1 and maxSurge: 1 create a controlled rollout. Kubernetes can bring up one extra Pod while taking at most one old Pod out of service at a time.
    • livenessProbe: Every 10 seconds (after a 5-second initial delay), Kubernetes sends an HTTP GET to /health on port 80. nginx handles this location directly (returning 200 without touching WordPress), so the probe checks whether the nginx process itself is alive, not whether the upstream is reachable. If it fails 3 times in a row, Kubernetes restarts the container.
    • readinessProbe: Every 5 seconds (after a 3-second delay), Kubernetes checks /, a request that nginx proxies to WordPress. If WordPress is unreachable and nginx returns 502, the readiness probe fails and that specific nginx Pod is removed from nginx-service endpoints (stops receiving traffic). It does not remove or restart WordPress Pods. Readiness failure alone does not restart nginx; the Pod is added back automatically when readiness checks succeed again.
    • When does a Pod restart?: A restart happens when the container process exits (for example, OOM kill), when the liveness probe fails repeatedly, or when the Deployment is rolled out/updated. A readiness failure by itself only changes traffic routing.
    • resources.requests: The Pod needs at least 64 Megabytes (Mi) of memory and 50 millicores (50m = 5% of one CPU core).
    • resources.limits: The Pod cannot use more than 128Mi of memory or 100 millicores. If it exceeds the memory limit, the kernel’s OOM killer terminates it.
  3. Apply the updated Deployment

    Terminal window
    kubectl apply -f nginx-deployment.yaml
  4. Verify Pods are Running and Ready

    Terminal window
    kubectl get pods

    The READY column should show 1/1 for all four nginx Pods, meaning the readiness probe (which proxies through to WordPress) is passing.

A rolling update replaces Pods gradually; it creates new Pods with the updated image before terminating old ones, ensuring your service never goes fully offline. In this section, you will go beyond a basic image swap and practice real operator controls: pause first, stage a change safely, annotate change-cause, then resume and observe.

  1. Record the current revision

    Terminal window
    kubectl rollout history deployment/nginx

    You should see at least one revision.

  2. Pause the Deployment before making changes

    Terminal window
    kubectl rollout pause deployment/nginx

    Pausing lets you stage configuration/image changes without immediately starting a rollout.

  3. Update the image while paused

    Update directly from the command line:

    Terminal window
    kubectl set image deployment/nginx nginx=nginx:1.27-alpine

    This changes the nginx container to use the Alpine-based variant (a smaller image), but no new Pods are created yet because the Deployment is paused.

  4. Add a change-cause annotation (common operator practice)

    Terminal window
    kubectl annotate deployment nginx kubernetes.io/change-cause="Upgrade nginx image to 1.27-alpine" --overwrite

    This makes rollout history easier to interpret later.

  5. Verify the Deployment is paused

    Terminal window
    kubectl get deployment nginx -o jsonpath='{.spec.paused}{"\n"}'
    kubectl get pods

    The first command should print true. You should still see only old-image Pods running.

  6. Resume and watch the rollout begin

    Terminal window
    kubectl rollout resume deployment/nginx
    kubectl get pods -w

    You will see new Pods being created (with the new image) and old Pods being terminated. Because this lab uses 4 replicas with minReadySeconds, you have enough time to capture a moment where both old and new Pods are present. Press Ctrl+C to stop watching.

  7. Wait for rollout completion

    Terminal window
    kubectl rollout status deployment/nginx

    This command blocks until the rollout is complete and reports success or failure.

  8. Verify the new image

    Terminal window
    kubectl describe deployment nginx | grep Image

    It should show nginx:1.27-alpine.

  9. Check the revision history

    Terminal window
    kubectl rollout history deployment/nginx

    You should now see multiple revisions and your custom change-cause text. Record this output.

  1. Roll back to the previous revision

    Terminal window
    kubectl rollout undo deployment/nginx

    Kubernetes creates new Pods with the previous image and terminates the current ones.

  2. Verify the rollback

    Terminal window
    kubectl describe deployment nginx | grep Image

    The image should be back to nginx:1.27.

  3. Roll back to a specific revision

    First inspect revision numbers:

    Terminal window
    kubectl rollout history deployment/nginx

    Then roll back explicitly to a chosen revision number from the history output:

    Terminal window
    kubectl rollout undo deployment/nginx --to-revision=<revision-number>
    kubectl rollout status deployment/nginx
    kubectl describe deployment nginx | grep Image

    Record the image tag and ready Pod count for your lab questions.

This drill simulates a deployment that references an image that does not exist, a common mistake when a tag is misspelled or a registry is unreachable.

  1. Deploy a nonexistent image

    Terminal window
    kubectl set image deployment/nginx nginx=nginx:doesnotexist
  2. Observe the failure

    Terminal window
    kubectl get pods

    After a few seconds, you will see new Pods stuck in ErrImagePull or ImagePullBackOff status. The old Pods remain running because Kubernetes’ rolling update strategy will not terminate old Pods until new Pods are ready.

  3. Diagnose the failure

    Terminal window
    kubectl describe pod <failing-pod-name>

    Scroll to the Events section at the bottom. You will see the exact error message, something like “manifest for nginx:doesnotexist not found.” Record this for your lab questions.

  4. Recover via rollback

    Terminal window
    kubectl rollout undo deployment/nginx

    The failing Pods are replaced with Pods running the known-good image. Verify:

    Terminal window
    kubectl get pods

    All Pods should return to Running and Ready state.

This drill demonstrates what happens when a container exceeds its memory limit. The kernel’s Out of Memory (OOM) killer terminates the process. With an extremely low limit like 1Mi, you may see one of two valid outcomes depending on runtime timing:

  • The container starts, gets killed, and the Pod shows OOMKilled and then CrashLoopBackOff.
  • The container cannot initialize at all, and Events show FailedCreatePodSandBox with container init was OOM-killed while the Pod remains Pending/ContainerCreating.

Both outcomes represent the same root cause: the memory limit is too low.

  1. Set an unreasonably low memory limit

    Edit nginx-deployment.yaml and change the memory limit to 1Mi (1 Megabyte, far too little for nginx to start):

    resources:
    requests:
    memory: "1Mi"
    cpu: "50m"
    limits:
    memory: "1Mi"
    cpu: "100m"
  2. Apply the change

    Terminal window
    kubectl apply -f nginx-deployment.yaml
  3. Observe the failure pattern

    Terminal window
    kubectl get pods -w

    With 1Mi, you may see either OOMKilled/CrashLoopBackOff in Pod status, or the Pod may stay ContainerCreating/Pending while sandbox creation fails. In either case, Kubernetes keeps retrying because the memory limit is too low. Press Ctrl+C after you observe the pattern.

  4. Diagnose the failure (authoritative step)

    Terminal window
    kubectl describe pod <oomkilled-pod-name>

    Look for one of these:

    • Last State: Reason OOMKilled, Exit Code 137 (128 + signal 9, SIGKILL from the OOM killer)
    • Events: FailedCreatePodSandBox with container init was OOM-killed (memory limit too low?)

    Record this output for your lab questions.

  5. Check logs when available

    Terminal window
    kubectl logs <oomkilled-pod-name> --previous

    The --previous flag shows logs from the last terminated container instance. If the failure happened during container init (sandbox creation), this command may return little or no output because the main process never started.

  6. Fix the limits and recover

    Edit nginx-deployment.yaml and restore the memory limit to 128Mi:

    resources:
    requests:
    memory: "64Mi"
    cpu: "50m"
    limits:
    memory: "128Mi"
    cpu: "100m"

    Apply:

    Terminal window
    kubectl apply -f nginx-deployment.yaml
  7. Verify recovery

    Terminal window
    kubectl get pods

    All Pods should be Running and Ready again. Record the current image tag from:

    Terminal window
    kubectl describe deployment nginx | grep Image

This cluster is used in Lab 9 (Observability). Do not delete these resources yet.

To pause your work, end your AWS Academy Learner Lab session. Your EC2 instance, k3s, and all running workloads persist between sessions.

If you have completed both labs and are permanently done, tear everything down:

Terminal window
kubectl delete \
-f wordpress-ingress.yaml \
-f nginx-deployment.yaml -f nginx-service.yaml -f nginx-configmap.yaml \
-f wordpress-deployment.yaml -f wordpress-service.yaml \
-f mariadb-deployment.yaml -f mariadb-service.yaml -f db-secret.yaml \
-f mariadb-pvc.yaml

You have now added operational controls to your Kubernetes deployment: persistent storage that keeps the MariaDB database alive across pod restarts and EC2 stops, liveness probes that detect when nginx itself is stuck, readiness probes that remove nginx from rotation when WordPress is unreachable, resource limits that prevent runaway containers, and you have practiced the full failure-recovery loop: introduce a fault, diagnose it with describe and logs, and recover with a rollback or configuration fix. In the next lab, you will add observability: monitoring dashboards, alerts, and incident detection across the full WordPress stack.