Cluster Operations: Health Probes, Rollouts, and Failure Drills (Kubernetes)

The lunch rush crashed the website at all three locations simultaneously. Gerald called it “a digital fire.” He wants “those health check things” and a plan for when things break. You asked who would be on call. Gerald said, “You’re on call.” You asked when. He said, “Always.”

Your WordPress stack (nginx → WordPress → MariaDB) is running, but Kubernetes does not yet know whether your application is actually healthy; it only knows whether the container processes are running. A container can be running but completely broken (returning errors, stuck in a loop, out of memory). In this lab, you will add health probes that let Kubernetes make smarter decisions, set resource limits so one container cannot starve the node, practice rolling updates and rollbacks on the nginx reverse proxy, intentionally break things to learn how to diagnose and recover from failures, and correct a data durability gap from the previous lab by adding persistent storage to MariaDB.

Before You Start

You need:

The k3s cluster from Lab 7 with all manifests still applied (nginx, WordPress, MariaDB, and their Services)
SSH access to your EC2 instance

If you ended your AWS Academy session, restart it. The EC2 instance and cluster will still be there. If you deleted the resources from Lab 7, re-apply all the manifests from that lab before starting.

Key Concepts

Liveness Probe: Tells Kubernetes whether the container is alive. If the liveness probe fails, Kubernetes kills the container and restarts it. Use this to recover from deadlocks or stuck processes.

Readiness Probe: Tells Kubernetes whether the container is ready to receive traffic. If the readiness probe fails, Kubernetes removes the Pod from Service endpoints (stops sending it requests) but does not kill it. Use this during startup or when a container is temporarily overwhelmed.

Resource Requests: The minimum CPU and memory a container needs. The Kubernetes scheduler uses requests to decide which node can fit the Pod.

Resource Limits: The maximum CPU and memory a container can use. If a container exceeds its memory limit, it is killed with an Out of Memory (OOM) error.

PersistentVolumeClaim (PVC): A request for storage from a workload. Kubernetes binds the claim to a PersistentVolume (PV), a piece of storage provisioned on the cluster, and mounts it into the container. Unlike the container’s ephemeral writable layer, data in a PVC survives pod deletion and node restarts. k3s ships with a local-path StorageClass that automatically provisions PVs backed by directories on the node’s local filesystem.

Questions

Watch for the answers to these questions as you follow the tutorial.

Run kubectl get pvc. What is the STATUS and STORAGECLASS of mariadb-pvc? After deleting the MariaDB Pod and letting Kubernetes replace it, did your WordPress data survive? What does this demonstrate about PersistentVolumeClaims compared to the ephemeral storage used in Lab 7? (3 points)
How many revisions does kubectl rollout history deployment/nginx show after the rolling update? What image tag was used in the most recent revision, and what CHANGE-CAUSE text did you record for that revision? (3 points)
During the rolling update, what changed after you ran kubectl rollout pause deployment/nginx and then kubectl rollout resume deployment/nginx? Did you observe both old and new Pods running at the same time, and how does that support zero downtime? (5 points)
When you deployed the nonexistent image (nginx:doesnotexist), what error status did the Pod show? Write down the exact error message from the Events section of kubectl describe pod. (4 points)
When you set the memory limit to 1Mi, what failure signal did you observe? Depending on container runtime timing, you may see either Pod status OOMKilled/CrashLoopBackOff with exit code 137, or FailedCreatePodSandBox Events that include container init was OOM-killed. Explain what either result means. (5 points)
After rolling back to a specific revision (kubectl rollout undo --to-revision=<n>), what image tag are the healthy Pods running? How many Pods show 1/1 Ready? (2 points)
Get your TA’s initials showing all Pods in a Running/Ready state after recovery from the OOMKill drill. (3 points)

Tutorial

Adding Persistent Storage for MariaDB

In Lab 7, MariaDB stored all its data inside the container’s own writable layer. That layer is ephemeral: it is discarded whenever the Pod is deleted, crashes, or the node restarts. Stopping and restarting the EC2 instance causes k3s to terminate and relaunch all Pods, wiping the database each time.

Kubernetes separates storage from compute using two objects:

PersistentVolume (PV): A piece of storage provisioned on the cluster: a directory on the node’s filesystem, an EBS volume, an NFS share, and so on. A PV exists independently of any Pod.
PersistentVolumeClaim (PVC): A request for storage from a workload. Kubernetes binds a PVC to a matching PV and mounts it into the container. Data in a PVC survives pod deletion and node restarts.

k3s ships with a local-path StorageClass that automatically provisions PVs backed by directories on the node’s local disk. This is not suitable for multi-node clusters (data is tied to one specific node), but it is exactly right for a single-node learning environment.

Write the PVC manifest

SSH into your EC2 instance and navigate to your lab directory then write the PVC manifest:
Terminal window
```
cd ~/k8s-lab
vim mariadb-pvc.yaml
```
```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mariadb-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: local-path
  resources:
    requests:
      storage: 1Gi
```
ReadWriteOnce means only one node can mount this volume at a time, which is correct for a single-replica database. local-path tells k3s to provision a directory on the node’s disk automatically.
Apply the PVC
Terminal window
```
kubectl apply -f mariadb-pvc.yaml
kubectl get pvc
```
The STATUS column will show Pending initially. The local-path provisioner uses late binding: the PV is not created until a Pod actually claims the PVC. It will become Bound once the updated Deployment starts.

Update the MariaDB Deployment to use the PVC

Edit mariadb-deployment.yaml and add a volumeMounts entry to the container and a volumes entry to the Pod spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mariadb
  labels:
    app: mariadb
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mariadb
  template:
    metadata:
      labels:
        app: mariadb
    spec:
      containers:
        - name: mariadb
          image: mariadb:11
          env:
            - name: MYSQL_ROOT_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-secret
                  key: db-root-password
            - name: MYSQL_DATABASE
              value: wordpress
            - name: MYSQL_USER
              value: wordpress
            - name: MYSQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-secret
                  key: db-password
          ports:
            - containerPort: 3306
          volumeMounts:
            - name: mariadb-data
              mountPath: /var/lib/mysql
      volumes:
        - name: mariadb-data
          persistentVolumeClaim:
            claimName: mariadb-pvc

MariaDB stores all its data in /var/lib/mysql. Mounting the PVC there means all writes go to the EC2 instance’s disk rather than the container’s ephemeral layer.

Apply the updated Deployment
Terminal window
```
kubectl apply -f mariadb-deployment.yaml
kubectl rollout status deployment/mariadb
```
kubectl rollout status works for any Deployment, including a single-replica one like MariaDB. Here it means “wait until the new MariaDB Pod is available.” This is not the same behavior we observed in the previous lab with 2 wordpress replicas, where old and new Pods overlap to keep traffic flowing during the update.

Because the PVC starts empty, MariaDB initializes a fresh database. Any WordPress data from Lab 7 is gone. After this step, open http://<your-ec2-public-ip> and run through the WordPress setup wizard again to create the site and admin account. This is the last time you will need to do this: from here on, your data persists across pod restarts and EC2 instance stops.
Verify the PVC is bound
Terminal window
```
kubectl get pvc
```
STATUS should now show Bound. Record the STATUS and STORAGECLASS values for your lab questions.
Verify data persists across pod restarts

Log in to the WordPress admin and set a distinctive site title (e.g., “Gerald’s Persistent Restaurant”). Then delete the MariaDB Pod:
Terminal window
```
kubectl delete pod -l app=mariadb
```
Watch Kubernetes replace it:
Terminal window
```
kubectl get pods -w
```
Once the new Pod is Running, reload WordPress in your browser. Your site title should still be there. The replacement Pod mounted the same PVC, which holds the data on the EC2 disk.

Adding Probes and Resource Controls

At this point your stack is running, but Kubernetes still needs explicit rules for traffic safety and resource fairness. The two probes serve different purposes: liveness answers “is the nginx process itself alive?” and triggers a restart if it is not, while readiness answers “can this nginx Pod successfully serve a real request right now?” and temporarily removes only that Pod from nginx-service endpoints when it cannot. Resource settings solve a different problem: requests reserve minimum CPU/memory so the scheduler can place Pods reliably, and limits cap maximum usage so one container cannot starve the node. Together, probes and resource controls improve availability during failures, updates, and load spikes.

Update the nginx ConfigMap to add a health endpoint

Edit nginx-configmap.yaml to add a /health location that nginx handles directly (not proxied to WordPress). This lets the liveness probe check whether nginx itself is alive, independently of whether WordPress is reachable:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
data:
  default.conf: |
    server {
        listen 80;

        location /health {
            access_log off;
            return 200 "ok\n";
            add_header Content-Type text/plain;
        }

        location / {
            proxy_pass http://wordpress-service;
            proxy_set_header Host $http_host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }
    }

Apply the updated ConfigMap:

kubectl apply -f nginx-configmap.yaml

Update the nginx Deployment to add probes and resource constraints

Edit nginx-deployment.yaml:
```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 4
  minReadySeconds: 15
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.27
          ports:
            - containerPort: 80
          livenessProbe:
            httpGet:
              path: /health
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 3
            periodSeconds: 5
          resources:
            requests:
              memory: "64Mi"
              cpu: "50m"
            limits:
              memory: "128Mi"
              cpu: "100m"
          volumeMounts:
            - name: nginx-config-volume
              mountPath: /etc/nginx/conf.d
      volumes:
        - name: nginx-config-volume
          configMap:
            name: nginx-config
```
Let’s walk through the new sections:
- replicas: 4: We intentionally run four nginx Pods in this lab so rollouts are easier to observe. With more replicas, old and new Pods overlap longer during updates, so pause/resume behavior is visible instead of finishing too quickly.
- minReadySeconds: 15: A new Pod must stay Ready for 15 seconds before Kubernetes marks it Available. This slows the rollout enough for you to inspect intermediate states.
- strategy.rollingUpdate: maxUnavailable: 1 and maxSurge: 1 create a controlled rollout. Kubernetes can bring up one extra Pod while taking at most one old Pod out of service at a time.
- livenessProbe: Every 10 seconds (after a 5-second initial delay), Kubernetes sends an HTTP GET to /health on port 80. nginx handles this location directly (returning 200 without touching WordPress), so the probe checks whether the nginx process itself is alive, not whether the upstream is reachable. If it fails 3 times in a row, Kubernetes restarts the container.
- readinessProbe: Every 5 seconds (after a 3-second delay), Kubernetes checks /, a request that nginx proxies to WordPress. If WordPress is unreachable and nginx returns 502, the readiness probe fails and that specific nginx Pod is removed from nginx-service endpoints (stops receiving traffic). It does not remove or restart WordPress Pods. Readiness failure alone does not restart nginx; the Pod is added back automatically when readiness checks succeed again.
- When does a Pod restart?: A restart happens when the container process exits (for example, OOM kill), when the liveness probe fails repeatedly, or when the Deployment is rolled out/updated. A readiness failure by itself only changes traffic routing.
- resources.requests: The Pod needs at least 64 Megabytes (Mi) of memory and 50 millicores (50m = 5% of one CPU core).
- resources.limits: The Pod cannot use more than 128Mi of memory or 100 millicores. If it exceeds the memory limit, the kernel’s OOM killer terminates it.
Apply the updated Deployment
Terminal window
```
kubectl apply -f nginx-deployment.yaml
```
Verify Pods are Running and Ready
Terminal window
```
kubectl get pods
```
The READY column should show 1/1 for all four nginx Pods, meaning the readiness probe (which proxies through to WordPress) is passing.

Performing a Rolling Update

A rolling update replaces Pods gradually; it creates new Pods with the updated image before terminating old ones, ensuring your service never goes fully offline. In this section, you will go beyond a basic image swap and practice real operator controls: pause first, stage a change safely, annotate change-cause, then resume and observe.

Record the current revision
Terminal window
```
kubectl rollout history deployment/nginx
```
You should see at least one revision.
Pause the Deployment before making changes
Terminal window
```
 kubectl rollout pause deployment/nginx
```
Pausing lets you stage configuration/image changes without immediately starting a rollout.
Update the image while paused

Update directly from the command line:
Terminal window
```
kubectl set image deployment/nginx nginx=nginx:1.27-alpine
```
This changes the nginx container to use the Alpine-based variant (a smaller image), but no new Pods are created yet because the Deployment is paused.
Add a change-cause annotation (common operator practice)
Terminal window
```
kubectl annotate deployment nginx kubernetes.io/change-cause="Upgrade  nginx image to 1.27-alpine" --overwrite
```
This makes rollout history easier to interpret later.
Verify the Deployment is paused
Terminal window
```
kubectl get deployment nginx -o jsonpath='{.spec.paused}{"\n"}'
kubectl get pods
```
The first command should print true. You should still see only old-image Pods running.
Resume and watch the rollout begin
Terminal window
```
kubectl rollout resume deployment/nginx
kubectl get pods -w
```
You will see new Pods being created (with the new image) and old Pods being terminated. Because this lab uses 4 replicas with minReadySeconds, you have enough time to capture a moment where both old and new Pods are present. Press Ctrl+C to stop watching.
Wait for rollout completion
Terminal window
```
kubectl rollout status deployment/nginx
```
This command blocks until the rollout is complete and reports success or failure.
Verify the new image
Terminal window
```
kubectl describe deployment nginx | grep Image
```
It should show nginx:1.27-alpine.
Check the revision history
Terminal window
```
kubectl rollout history deployment/nginx
```
You should now see multiple revisions and your custom change-cause text. Record this output.

Rolling Back

Roll back to the previous revision
Terminal window
```
kubectl rollout undo deployment/nginx
```
Kubernetes creates new Pods with the previous image and terminates the current ones.
Verify the rollback
Terminal window
```
kubectl describe deployment nginx | grep Image
```
The image should be back to nginx:1.27.
Roll back to a specific revision

First inspect revision numbers:
Terminal window
```
kubectl rollout history deployment/nginx
```
Then roll back explicitly to a chosen revision number from the history output:
Terminal window
```
kubectl rollout undo deployment/nginx --to-revision=<revision-number>
kubectl rollout status deployment/nginx
kubectl describe deployment nginx | grep Image
```
Record the image tag and ready Pod count for your lab questions.

Failure Drill 1: Bad Image

This drill simulates a deployment that references an image that does not exist, a common mistake when a tag is misspelled or a registry is unreachable.

Deploy a nonexistent image

kubectl set image deployment/nginx nginx=nginx:doesnotexist

Observe the failure
Terminal window
```
kubectl get pods
```
After a few seconds, you will see new Pods stuck in ErrImagePull or ImagePullBackOff status. The old Pods remain running because Kubernetes’ rolling update strategy will not terminate old Pods until new Pods are ready.

ImagePullBackOff means Kubernetes tried to pull the image, failed, and is now waiting before retrying with exponentially increasing delays. This prevents hammering the registry with requests for an image that does not exist.
Diagnose the failure
Terminal window
```
kubectl describe pod <failing-pod-name>
```
Scroll to the Events section at the bottom. You will see the exact error message, something like “manifest for nginx:doesnotexist not found.” Record this for your lab questions.
Recover via rollback
Terminal window
```
kubectl rollout undo deployment/nginx
```
The failing Pods are replaced with Pods running the known-good image. Verify:
Terminal window
```
kubectl get pods
```
All Pods should return to Running and Ready state.

Failure Drill 2: OOM Kill

This drill demonstrates what happens when a container exceeds its memory limit. The kernel’s Out of Memory (OOM) killer terminates the process. With an extremely low limit like 1Mi, you may see one of two valid outcomes depending on runtime timing:

The container starts, gets killed, and the Pod shows OOMKilled and then CrashLoopBackOff.
The container cannot initialize at all, and Events show FailedCreatePodSandBox with container init was OOM-killed while the Pod remains Pending/ContainerCreating.

Both outcomes represent the same root cause: the memory limit is too low.

Set an unreasonably low memory limit

Edit nginx-deployment.yaml and change the memory limit to 1Mi (1 Megabyte, far too little for nginx to start):
```
resources:
  requests:
    memory: "1Mi"
    cpu: "50m"
  limits:
    memory: "1Mi"
    cpu: "100m"
```
Apply the change
Terminal window
```
kubectl apply -f nginx-deployment.yaml
```
Observe the failure pattern
Terminal window
```
kubectl get pods -w
```
With 1Mi, you may see either OOMKilled/CrashLoopBackOff in Pod status, or the Pod may stay ContainerCreating/Pending while sandbox creation fails. In either case, Kubernetes keeps retrying because the memory limit is too low. Press Ctrl+C after you observe the pattern.
Diagnose the failure (authoritative step)
Terminal window
```
kubectl describe pod <oomkilled-pod-name>
```
Look for one of these:
- Last State: Reason OOMKilled, Exit Code 137 (128 + signal 9, SIGKILL from the OOM killer)
- Events: FailedCreatePodSandBox with container init was OOM-killed (memory limit too low?)
Record this output for your lab questions.
Check logs when available
Terminal window
```
kubectl logs <oomkilled-pod-name> --previous
```
The --previous flag shows logs from the last terminated container instance. If the failure happened during container init (sandbox creation), this command may return little or no output because the main process never started.

Fix the limits and recover

Edit nginx-deployment.yaml and restore the memory limit to 128Mi:

resources:
  requests:
    memory: "64Mi"
    cpu: "50m"
  limits:
    memory: "128Mi"
    cpu: "100m"

Apply:

kubectl apply -f nginx-deployment.yaml

Verify recovery
Terminal window
```
kubectl get pods
```
All Pods should be Running and Ready again. Record the current image tag from:
Terminal window
```
kubectl describe deployment nginx | grep Image
```

Clean Up

This cluster is used in Lab 9 (Observability). Do not delete these resources yet.

To pause your work, end your AWS Academy Learner Lab session. Your EC2 instance, k3s, and all running workloads persist between sessions.

If you have completed both labs and are permanently done, tear everything down:

kubectl delete \
  -f wordpress-ingress.yaml \
  -f nginx-deployment.yaml -f nginx-service.yaml -f nginx-configmap.yaml \
  -f wordpress-deployment.yaml -f wordpress-service.yaml \
  -f mariadb-deployment.yaml -f mariadb-service.yaml -f db-secret.yaml \
  -f mariadb-pvc.yaml

You have now added operational controls to your Kubernetes deployment: persistent storage that keeps the MariaDB database alive across pod restarts and EC2 stops, liveness probes that detect when nginx itself is stuck, readiness probes that remove nginx from rotation when WordPress is unreachable, resource limits that prevent runaway containers, and you have practiced the full failure-recovery loop: introduce a fault, diagnose it with describe and logs, and recover with a rollback or configuration fix. In the next lab, you will add observability: monitoring dashboards, alerts, and incident detection across the full WordPress stack.