Kubernetes Operations: Probes, Rollouts, and Failure Drills
The lunch rush crashed the website at all three locations simultaneously. Gerald called it “a digital fire.” He wants “those health check things” and a plan for when things break. You asked who would be on call. Gerald said, “You’re on call.” You asked when. He said, “Always.”
Your nginx deployment works, but Kubernetes does not yet know whether your application is actually healthy; it only knows whether the container process is running. A container can be running but completely broken (returning errors, stuck in a loop, out of memory). In this lab, you will add health probes that let Kubernetes make smarter decisions, set resource limits so one container cannot starve the node, practice rolling updates and rollbacks, and intentionally break things to learn how to diagnose and recover from failures.
Before You Start
Section titled “Before You Start”You need:
- The k3s cluster from Lab 7 with the nginx Deployment and Service still running
- SSH access to your EC2 instance
If you no longer have the Lab 7 setup, re-apply the manifests from that lab before starting.
Key Concepts
Section titled “Key Concepts”Liveness Probe: Tells Kubernetes whether the container is alive. If the liveness probe fails, Kubernetes kills the container and restarts it. Use this to recover from deadlocks or stuck processes.
Readiness Probe: Tells Kubernetes whether the container is ready to receive traffic. If the readiness probe fails, Kubernetes removes the Pod from Service endpoints (stops sending it requests) but does not kill it. Use this during startup or when a container is temporarily overwhelmed.
Resource Requests: The minimum CPU and memory a container needs. The Kubernetes scheduler uses requests to decide which node can fit the Pod.
Resource Limits: The maximum CPU and memory a container can use. If a container exceeds its memory limit, it is killed with an Out of Memory (OOM) error.
Questions
Section titled “Questions”Watch for the answers to these questions as you follow the tutorial.
- How many revisions does
kubectl rollout history deployment/nginxshow after the rolling update? What image tag was used in the most recent revision? (4 points) - During the rolling update, did you observe both old and new Pods running at the same time? How does Kubernetes ensure zero downtime during a rollout? (5 points)
- When you deployed the nonexistent image (
nginx:doesnotexist), what error status did the Pod show? Write down the exact error message from the Events section ofkubectl describe pod. (5 points) - When you set the memory limit to 1Mi, what status did the Pod show? What was the exit code, and what does it mean? (Hint: 128 + signal number.) (5 points)
- After your final rollback, what image tag are the healthy Pods running? How many Pods show
1/1 Ready? (3 points) - Get your TA’s initials showing all Pods in a Running/Ready state after recovery from the OOMKill drill. (3 points)
Tutorial
Section titled “Tutorial”Adding Probes and Resource Controls
Section titled “Adding Probes and Resource Controls”-
Update the Deployment manifest
SSH into your EC2 instance and navigate to your lab directory:
Terminal window cd ~/k8s-labEdit
nginx-deployment.yamlto add probes and resource constraints:apiVersion: apps/v1kind: Deploymentmetadata:name: nginxlabels:app: nginxspec:replicas: 2selector:matchLabels:app: nginxtemplate:metadata:labels:app: nginxspec:containers:- name: nginximage: nginx:1.27ports:- containerPort: 80livenessProbe:httpGet:path: /port: 80initialDelaySeconds: 5periodSeconds: 10readinessProbe:httpGet:path: /port: 80initialDelaySeconds: 3periodSeconds: 5resources:requests:memory: "64Mi"cpu: "50m"limits:memory: "128Mi"cpu: "100m"volumeMounts:- name: html-volumemountPath: /usr/share/nginx/htmlvolumes:- name: html-volumeconfigMap:name: nginx-htmlLet’s walk through the new sections:
livenessProbe: Every 10 seconds (after a 5-second initial delay), Kubernetes sends an HTTP GET to/on port 80. If it fails 3 times in a row, Kubernetes restarts the container.readinessProbe: Every 5 seconds (after a 3-second delay), Kubernetes checks if the container is ready for traffic. During rolling updates, new Pods must pass their readiness probe before old Pods are terminated.resources.requests: The Pod needs at least 64 Megabytes (Mi) of memory and 50 millicores (50m = 5% of one CPU core).resources.limits: The Pod cannot use more than 128Mi of memory or 100 millicores. If it exceeds the memory limit, the kernel’s OOM killer terminates it.
-
Apply the updated Deployment
Terminal window kubectl apply -f nginx-deployment.yaml -
Verify Pods are Running and Ready
Terminal window kubectl get podsThe READY column should show
1/1for both Pods, meaning the readiness probe is passing.
Performing a Rolling Update
Section titled “Performing a Rolling Update”A rolling update replaces Pods gradually; it creates new Pods with the updated image before terminating old ones, ensuring your service never goes fully offline.
-
Record the current revision
Terminal window kubectl rollout history deployment/nginxYou should see at least one revision.
-
Update the image to a newer version
You can update the image directly from the command line:
Terminal window kubectl set image deployment/nginx nginx=nginx:1.27-alpineThis changes the nginx container to use the Alpine-based variant (a smaller image). Kubernetes immediately starts a rolling update.
-
Watch the rollout
In one terminal, watch the Pods:
Terminal window kubectl get pods -wYou will see new Pods being created (with the new image) and old Pods being terminated. At some point, both old and new Pods will be running simultaneously; capture a
kubectl get podssnapshot at this moment for your lab questions. Press Ctrl+C to stop watching. -
Check the rollout status
Terminal window kubectl rollout status deployment/nginxThis command blocks until the rollout is complete and reports success or failure.
-
Verify the new image
Terminal window kubectl describe deployment nginx | grep ImageIt should show
nginx:1.27-alpine. -
Check the revision history
Terminal window kubectl rollout history deployment/nginxYou should now see multiple revisions. Record this output.
Rolling Back
Section titled “Rolling Back”-
Roll back to the previous revision
Terminal window kubectl rollout undo deployment/nginxKubernetes creates new Pods with the previous image and terminates the current ones.
-
Verify the rollback
Terminal window kubectl describe deployment nginx | grep ImageThe image should be back to
nginx:1.27.
Failure Drill 1: Bad Image
Section titled “Failure Drill 1: Bad Image”This drill simulates a deployment that references an image that does not exist, a common mistake when a tag is misspelled or a registry is unreachable.
-
Deploy a nonexistent image
Terminal window kubectl set image deployment/nginx nginx=nginx:doesnotexist -
Observe the failure
Terminal window kubectl get podsAfter a few seconds, you will see new Pods stuck in
ErrImagePullorImagePullBackOffstatus. The old Pods remain running because Kubernetes’ rolling update strategy will not terminate old Pods until new Pods are ready. -
Diagnose the failure
Terminal window kubectl describe pod <failing-pod-name>Scroll to the Events section at the bottom. You will see the exact error message, something like “manifest for nginx:doesnotexist not found.” Record this for your lab questions.
-
Recover via rollback
Terminal window kubectl rollout undo deployment/nginxThe failing Pods are replaced with Pods running the known-good image. Verify:
Terminal window kubectl get podsAll Pods should return to
RunningandReadystate.
Failure Drill 2: OOM Kill
Section titled “Failure Drill 2: OOM Kill”This drill demonstrates what happens when a container exceeds its memory limit. The kernel’s Out of Memory (OOM) killer terminates the process, and Kubernetes reports it as OOMKilled.
-
Set an unreasonably low memory limit
Edit
nginx-deployment.yamland change the memory limit to1Mi(1 Megabyte, far too little for nginx to start):resources:requests:memory: "1Mi"cpu: "50m"limits:memory: "1Mi"cpu: "100m" -
Apply the change
Terminal window kubectl apply -f nginx-deployment.yaml -
Observe the OOM kills
Terminal window kubectl get pods -wYou will see Pods entering
OOMKilledstatus, thenCrashLoopBackOffas Kubernetes repeatedly tries to start them and they keep exceeding the memory limit. Press Ctrl+C after you observe the pattern. -
Diagnose the failure
Terminal window kubectl describe pod <oomkilled-pod-name>Look for the Last State section, which shows:
- Reason:
OOMKilled - Exit Code:
137(128 + signal 9, which is SIGKILL from the OOM killer)
Record this output for your lab questions.
- Reason:
-
Check the logs
Terminal window kubectl logs <oomkilled-pod-name> --previousThe
--previousflag shows logs from the last terminated container instance. There may be no logs if the process was killed before it could write any. -
Fix the limits and recover
Edit
nginx-deployment.yamland restore the memory limit to128Mi:resources:requests:memory: "64Mi"cpu: "50m"limits:memory: "128Mi"cpu: "100m"Apply:
Terminal window kubectl apply -f nginx-deployment.yaml -
Verify recovery
Terminal window kubectl get podsAll Pods should be
RunningandReadyagain. Record the current image tag from:Terminal window kubectl describe deployment nginx | grep Image
Clean Up
Section titled “Clean Up”If you plan to continue with Lab 9 (Observability), leave k3s and the nginx deployment running. Otherwise:
kubectl delete -f nginx-deployment.yaml -f nginx-service.yaml -f nginx-configmap.yamlYou have now added operational controls to your Kubernetes deployment: probes that tell Kubernetes whether your application is healthy, resource limits that prevent runaway containers, and you have practiced the full failure-recovery loop: introduce a fault, diagnose it with describe and logs, and recover with a rollback or configuration fix. In the next lab, you will add observability: monitoring dashboards, alerts, and incident detection.