Cluster Operations: Health Probes, Rollouts, and Failure Drills (Kubernetes)
The lunch rush crashed the website at all three locations simultaneously. Gerald called it “a digital fire.” He wants “those health check things” and a plan for when things break. You asked who would be on call. Gerald said, “You’re on call.” You asked when. He said, “Always.”
Your WordPress stack (nginx → WordPress → MariaDB) is running, but Kubernetes does not yet know whether your application is actually healthy; it only knows whether the container processes are running. A container can be running but completely broken (returning errors, stuck in a loop, out of memory). In this lab, you will add health probes that let Kubernetes make smarter decisions, set resource limits so one container cannot starve the node, practice rolling updates and rollbacks on the nginx reverse proxy, intentionally break things to learn how to diagnose and recover from failures, and correct a data durability gap from the previous lab by adding persistent storage to MariaDB.
Before You Start
Section titled “Before You Start”You need:
- The k3s cluster from Lab 7 with all manifests still applied (nginx, WordPress, MariaDB, and their Services)
- SSH access to your EC2 instance
If you ended your AWS Academy session, restart it. The EC2 instance and cluster will still be there. If you deleted the resources from Lab 7, re-apply all the manifests from that lab before starting.
Key Concepts
Section titled “Key Concepts”Liveness Probe: Tells Kubernetes whether the container is alive. If the liveness probe fails, Kubernetes kills the container and restarts it. Use this to recover from deadlocks or stuck processes.
Readiness Probe: Tells Kubernetes whether the container is ready to receive traffic. If the readiness probe fails, Kubernetes removes the Pod from Service endpoints (stops sending it requests) but does not kill it. Use this during startup or when a container is temporarily overwhelmed.
Resource Requests: The minimum CPU and memory a container needs. The Kubernetes scheduler uses requests to decide which node can fit the Pod.
Resource Limits: The maximum CPU and memory a container can use. If a container exceeds its memory limit, it is killed with an Out of Memory (OOM) error.
PersistentVolumeClaim (PVC): A request for storage from a workload. Kubernetes binds the claim to a PersistentVolume (PV), a piece of storage provisioned on the cluster, and mounts it into the container. Unlike the container’s ephemeral writable layer, data in a PVC survives pod deletion and node restarts. k3s ships with a local-path StorageClass that automatically provisions PVs backed by directories on the node’s local filesystem.
Questions
Section titled “Questions”Watch for the answers to these questions as you follow the tutorial.
- Run
kubectl get pvc. What is the STATUS and STORAGECLASS ofmariadb-pvc? After deleting the MariaDB Pod and letting Kubernetes replace it, did your WordPress data survive? What does this demonstrate about PersistentVolumeClaims compared to the ephemeral storage used in Lab 7? (3 points) - How many revisions does
kubectl rollout history deployment/nginxshow after the rolling update? What image tag was used in the most recent revision, and whatCHANGE-CAUSEtext did you record for that revision? (3 points) - During the rolling update, what changed after you ran
kubectl rollout pause deployment/nginxand thenkubectl rollout resume deployment/nginx? Did you observe both old and new Pods running at the same time, and how does that support zero downtime? (5 points) - When you deployed the nonexistent image (
nginx:doesnotexist), what error status did the Pod show? Write down the exact error message from the Events section ofkubectl describe pod. (4 points) - When you set the memory limit to 1Mi, what failure signal did you observe? Depending on container runtime timing, you may see either Pod status
OOMKilled/CrashLoopBackOffwith exit code137, orFailedCreatePodSandBoxEvents that includecontainer init was OOM-killed. Explain what either result means. (5 points) - After rolling back to a specific revision (
kubectl rollout undo --to-revision=<n>), what image tag are the healthy Pods running? How many Pods show1/1 Ready? (2 points) - Get your TA’s initials showing all Pods in a Running/Ready state after recovery from the OOMKill drill. (3 points)
Tutorial
Section titled “Tutorial”Adding Persistent Storage for MariaDB
Section titled “Adding Persistent Storage for MariaDB”In Lab 7, MariaDB stored all its data inside the container’s own writable layer. That layer is ephemeral: it is discarded whenever the Pod is deleted, crashes, or the node restarts. Stopping and restarting the EC2 instance causes k3s to terminate and relaunch all Pods, wiping the database each time.
Kubernetes separates storage from compute using two objects:
- PersistentVolume (PV): A piece of storage provisioned on the cluster: a directory on the node’s filesystem, an EBS volume, an NFS share, and so on. A PV exists independently of any Pod.
- PersistentVolumeClaim (PVC): A request for storage from a workload. Kubernetes binds a PVC to a matching PV and mounts it into the container. Data in a PVC survives pod deletion and node restarts.
k3s ships with a local-path StorageClass that automatically provisions PVs backed by directories on the node’s local disk. This is not suitable for multi-node clusters (data is tied to one specific node), but it is exactly right for a single-node learning environment.
-
Write the PVC manifest
SSH into your EC2 instance and navigate to your lab directory then write the PVC manifest:
Terminal window cd ~/k8s-labvim mariadb-pvc.yamlapiVersion: v1kind: PersistentVolumeClaimmetadata:name: mariadb-pvcspec:accessModes:- ReadWriteOncestorageClassName: local-pathresources:requests:storage: 1GiReadWriteOncemeans only one node can mount this volume at a time, which is correct for a single-replica database.local-pathtells k3s to provision a directory on the node’s disk automatically. -
Apply the PVC
Terminal window kubectl apply -f mariadb-pvc.yamlkubectl get pvcThe STATUS column will show
Pendinginitially. Thelocal-pathprovisioner uses late binding: the PV is not created until a Pod actually claims the PVC. It will becomeBoundonce the updated Deployment starts. -
Update the MariaDB Deployment to use the PVC
Edit
mariadb-deployment.yamland add avolumeMountsentry to the container and avolumesentry to the Pod spec:apiVersion: apps/v1kind: Deploymentmetadata:name: mariadblabels:app: mariadbspec:replicas: 1selector:matchLabels:app: mariadbtemplate:metadata:labels:app: mariadbspec:containers:- name: mariadbimage: mariadb:11env:- name: MYSQL_ROOT_PASSWORDvalueFrom:secretKeyRef:name: db-secretkey: db-root-password- name: MYSQL_DATABASEvalue: wordpress- name: MYSQL_USERvalue: wordpress- name: MYSQL_PASSWORDvalueFrom:secretKeyRef:name: db-secretkey: db-passwordports:- containerPort: 3306volumeMounts:- name: mariadb-datamountPath: /var/lib/mysqlvolumes:- name: mariadb-datapersistentVolumeClaim:claimName: mariadb-pvcMariaDB stores all its data in
/var/lib/mysql. Mounting the PVC there means all writes go to the EC2 instance’s disk rather than the container’s ephemeral layer. -
Apply the updated Deployment
Terminal window kubectl apply -f mariadb-deployment.yamlkubectl rollout status deployment/mariadbkubectl rollout statusworks for any Deployment, including a single-replica one like MariaDB. Here it means “wait until the new MariaDB Pod is available.” This is not the same behavior we observed in the previous lab with 2 wordpress replicas, where old and new Pods overlap to keep traffic flowing during the update. -
Verify the PVC is bound
Terminal window kubectl get pvcSTATUS should now show
Bound. Record the STATUS and STORAGECLASS values for your lab questions. -
Verify data persists across pod restarts
Log in to the WordPress admin and set a distinctive site title (e.g., “Gerald’s Persistent Restaurant”). Then delete the MariaDB Pod:
Terminal window kubectl delete pod -l app=mariadbWatch Kubernetes replace it:
Terminal window kubectl get pods -wOnce the new Pod is Running, reload WordPress in your browser. Your site title should still be there. The replacement Pod mounted the same PVC, which holds the data on the EC2 disk.
Adding Probes and Resource Controls
Section titled “Adding Probes and Resource Controls”At this point your stack is running, but Kubernetes still needs explicit rules for traffic safety and resource fairness. The two probes serve different purposes: liveness answers “is the nginx process itself alive?” and triggers a restart if it is not, while readiness answers “can this nginx Pod successfully serve a real request right now?” and temporarily removes only that Pod from nginx-service endpoints when it cannot. Resource settings solve a different problem: requests reserve minimum CPU/memory so the scheduler can place Pods reliably, and limits cap maximum usage so one container cannot starve the node. Together, probes and resource controls improve availability during failures, updates, and load spikes.
-
Update the nginx ConfigMap to add a health endpoint
Edit
nginx-configmap.yamlto add a/healthlocation that nginx handles directly (not proxied to WordPress). This lets the liveness probe check whether nginx itself is alive, independently of whether WordPress is reachable:apiVersion: v1kind: ConfigMapmetadata:name: nginx-configdata:default.conf: |server {listen 80;location /health {access_log off;return 200 "ok\n";add_header Content-Type text/plain;}location / {proxy_pass http://wordpress-service;proxy_set_header Host $http_host;proxy_set_header X-Real-IP $remote_addr;proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;proxy_set_header X-Forwarded-Proto $scheme;}}Apply the updated ConfigMap:
Terminal window kubectl apply -f nginx-configmap.yaml -
Update the nginx Deployment to add probes and resource constraints
Edit
nginx-deployment.yaml:apiVersion: apps/v1kind: Deploymentmetadata:name: nginxlabels:app: nginxspec:replicas: 4minReadySeconds: 15strategy:type: RollingUpdaterollingUpdate:maxUnavailable: 1maxSurge: 1selector:matchLabels:app: nginxtemplate:metadata:labels:app: nginxspec:containers:- name: nginximage: nginx:1.27ports:- containerPort: 80livenessProbe:httpGet:path: /healthport: 80initialDelaySeconds: 5periodSeconds: 10readinessProbe:httpGet:path: /port: 80initialDelaySeconds: 3periodSeconds: 5resources:requests:memory: "64Mi"cpu: "50m"limits:memory: "128Mi"cpu: "100m"volumeMounts:- name: nginx-config-volumemountPath: /etc/nginx/conf.dvolumes:- name: nginx-config-volumeconfigMap:name: nginx-configLet’s walk through the new sections:
replicas: 4: We intentionally run four nginx Pods in this lab so rollouts are easier to observe. With more replicas, old and new Pods overlap longer during updates, so pause/resume behavior is visible instead of finishing too quickly.minReadySeconds: 15: A new Pod must stay Ready for 15 seconds before Kubernetes marks it Available. This slows the rollout enough for you to inspect intermediate states.strategy.rollingUpdate:maxUnavailable: 1andmaxSurge: 1create a controlled rollout. Kubernetes can bring up one extra Pod while taking at most one old Pod out of service at a time.livenessProbe: Every 10 seconds (after a 5-second initial delay), Kubernetes sends an HTTP GET to/healthon port 80. nginx handles this location directly (returning 200 without touching WordPress), so the probe checks whether the nginx process itself is alive, not whether the upstream is reachable. If it fails 3 times in a row, Kubernetes restarts the container.readinessProbe: Every 5 seconds (after a 3-second delay), Kubernetes checks/, a request that nginx proxies to WordPress. If WordPress is unreachable and nginx returns 502, the readiness probe fails and that specific nginx Pod is removed fromnginx-serviceendpoints (stops receiving traffic). It does not remove or restart WordPress Pods. Readiness failure alone does not restart nginx; the Pod is added back automatically when readiness checks succeed again.- When does a Pod restart?: A restart happens when the container process exits (for example, OOM kill), when the liveness probe fails repeatedly, or when the Deployment is rolled out/updated. A readiness failure by itself only changes traffic routing.
resources.requests: The Pod needs at least 64 Megabytes (Mi) of memory and 50 millicores (50m = 5% of one CPU core).resources.limits: The Pod cannot use more than 128Mi of memory or 100 millicores. If it exceeds the memory limit, the kernel’s OOM killer terminates it.
-
Apply the updated Deployment
Terminal window kubectl apply -f nginx-deployment.yaml -
Verify Pods are Running and Ready
Terminal window kubectl get podsThe READY column should show
1/1for all four nginx Pods, meaning the readiness probe (which proxies through to WordPress) is passing.
Performing a Rolling Update
Section titled “Performing a Rolling Update”A rolling update replaces Pods gradually; it creates new Pods with the updated image before terminating old ones, ensuring your service never goes fully offline. In this section, you will go beyond a basic image swap and practice real operator controls: pause first, stage a change safely, annotate change-cause, then resume and observe.
-
Record the current revision
Terminal window kubectl rollout history deployment/nginxYou should see at least one revision.
-
Pause the Deployment before making changes
Terminal window kubectl rollout pause deployment/nginxPausing lets you stage configuration/image changes without immediately starting a rollout.
-
Update the image while paused
Update directly from the command line:
Terminal window kubectl set image deployment/nginx nginx=nginx:1.27-alpineThis changes the nginx container to use the Alpine-based variant (a smaller image), but no new Pods are created yet because the Deployment is paused.
-
Add a change-cause annotation (common operator practice)
Terminal window kubectl annotate deployment nginx kubernetes.io/change-cause="Upgrade nginx image to 1.27-alpine" --overwriteThis makes rollout history easier to interpret later.
-
Verify the Deployment is paused
Terminal window kubectl get deployment nginx -o jsonpath='{.spec.paused}{"\n"}'kubectl get podsThe first command should print
true. You should still see only old-image Pods running. -
Resume and watch the rollout begin
Terminal window kubectl rollout resume deployment/nginxkubectl get pods -wYou will see new Pods being created (with the new image) and old Pods being terminated. Because this lab uses 4 replicas with
minReadySeconds, you have enough time to capture a moment where both old and new Pods are present. Press Ctrl+C to stop watching. -
Wait for rollout completion
Terminal window kubectl rollout status deployment/nginxThis command blocks until the rollout is complete and reports success or failure.
-
Verify the new image
Terminal window kubectl describe deployment nginx | grep ImageIt should show
nginx:1.27-alpine. -
Check the revision history
Terminal window kubectl rollout history deployment/nginxYou should now see multiple revisions and your custom change-cause text. Record this output.
Rolling Back
Section titled “Rolling Back”-
Roll back to the previous revision
Terminal window kubectl rollout undo deployment/nginxKubernetes creates new Pods with the previous image and terminates the current ones.
-
Verify the rollback
Terminal window kubectl describe deployment nginx | grep ImageThe image should be back to
nginx:1.27. -
Roll back to a specific revision
First inspect revision numbers:
Terminal window kubectl rollout history deployment/nginxThen roll back explicitly to a chosen revision number from the history output:
Terminal window kubectl rollout undo deployment/nginx --to-revision=<revision-number>kubectl rollout status deployment/nginxkubectl describe deployment nginx | grep ImageRecord the image tag and ready Pod count for your lab questions.
Failure Drill 1: Bad Image
Section titled “Failure Drill 1: Bad Image”This drill simulates a deployment that references an image that does not exist, a common mistake when a tag is misspelled or a registry is unreachable.
-
Deploy a nonexistent image
Terminal window kubectl set image deployment/nginx nginx=nginx:doesnotexist -
Observe the failure
Terminal window kubectl get podsAfter a few seconds, you will see new Pods stuck in
ErrImagePullorImagePullBackOffstatus. The old Pods remain running because Kubernetes’ rolling update strategy will not terminate old Pods until new Pods are ready. -
Diagnose the failure
Terminal window kubectl describe pod <failing-pod-name>Scroll to the Events section at the bottom. You will see the exact error message, something like “manifest for nginx:doesnotexist not found.” Record this for your lab questions.
-
Recover via rollback
Terminal window kubectl rollout undo deployment/nginxThe failing Pods are replaced with Pods running the known-good image. Verify:
Terminal window kubectl get podsAll Pods should return to
RunningandReadystate.
Failure Drill 2: OOM Kill
Section titled “Failure Drill 2: OOM Kill”This drill demonstrates what happens when a container exceeds its memory limit. The kernel’s Out of Memory (OOM) killer terminates the process. With an extremely low limit like 1Mi, you may see one of two valid outcomes depending on runtime timing:
- The container starts, gets killed, and the Pod shows
OOMKilledand thenCrashLoopBackOff. - The container cannot initialize at all, and Events show
FailedCreatePodSandBoxwithcontainer init was OOM-killedwhile the Pod remainsPending/ContainerCreating.
Both outcomes represent the same root cause: the memory limit is too low.
-
Set an unreasonably low memory limit
Edit
nginx-deployment.yamland change the memory limit to1Mi(1 Megabyte, far too little for nginx to start):resources:requests:memory: "1Mi"cpu: "50m"limits:memory: "1Mi"cpu: "100m" -
Apply the change
Terminal window kubectl apply -f nginx-deployment.yaml -
Observe the failure pattern
Terminal window kubectl get pods -wWith
1Mi, you may see eitherOOMKilled/CrashLoopBackOffin Pod status, or the Pod may stayContainerCreating/Pendingwhile sandbox creation fails. In either case, Kubernetes keeps retrying because the memory limit is too low. Press Ctrl+C after you observe the pattern. -
Diagnose the failure (authoritative step)
Terminal window kubectl describe pod <oomkilled-pod-name>Look for one of these:
- Last State: Reason
OOMKilled, Exit Code137(128 + signal 9, SIGKILL from the OOM killer) - Events:
FailedCreatePodSandBoxwithcontainer init was OOM-killed (memory limit too low?)
Record this output for your lab questions.
- Last State: Reason
-
Check logs when available
Terminal window kubectl logs <oomkilled-pod-name> --previousThe
--previousflag shows logs from the last terminated container instance. If the failure happened during container init (sandbox creation), this command may return little or no output because the main process never started. -
Fix the limits and recover
Edit
nginx-deployment.yamland restore the memory limit to128Mi:resources:requests:memory: "64Mi"cpu: "50m"limits:memory: "128Mi"cpu: "100m"Apply:
Terminal window kubectl apply -f nginx-deployment.yaml -
Verify recovery
Terminal window kubectl get podsAll Pods should be
RunningandReadyagain. Record the current image tag from:Terminal window kubectl describe deployment nginx | grep Image
Clean Up
Section titled “Clean Up”This cluster is used in Lab 9 (Observability). Do not delete these resources yet.
To pause your work, end your AWS Academy Learner Lab session. Your EC2 instance, k3s, and all running workloads persist between sessions.
If you have completed both labs and are permanently done, tear everything down:
kubectl delete \ -f wordpress-ingress.yaml \ -f nginx-deployment.yaml -f nginx-service.yaml -f nginx-configmap.yaml \ -f wordpress-deployment.yaml -f wordpress-service.yaml \ -f mariadb-deployment.yaml -f mariadb-service.yaml -f db-secret.yaml \ -f mariadb-pvc.yamlYou have now added operational controls to your Kubernetes deployment: persistent storage that keeps the MariaDB database alive across pod restarts and EC2 stops, liveness probes that detect when nginx itself is stuck, readiness probes that remove nginx from rotation when WordPress is unreachable, resource limits that prevent runaway containers, and you have practiced the full failure-recovery loop: introduce a fault, diagnose it with describe and logs, and recover with a rollback or configuration fix. In the next lab, you will add observability: monitoring dashboards, alerts, and incident detection across the full WordPress stack.