Measure Error Budget
This activity puts into practice the concepts from the Reliability Engineering lecture. You will define a simple SLO for a personalized web service, probe it with k6, ramp the load past the steady state to find where the service breaks, and then repeat a one-minute test across four controlled experiments including a graceful-degradation fallback at the edge. By the end, you will have measured the saturation knee of your service, seen which failures spent error budget, observed how a fail-open fallback changes the user-visible result, and verified that the system returned to steady state.
What You Will Need
Section titled “What You Will Need”- Your cluster from the Minikube activity
kubectlalready working against that clusterk6installed before class using the official k6 install guide and verified withk6 version- Four terminal windows or tabs: one for
kubectl port-forward, one fork6, one for Kubernetes commands, and one for the short sampling loop in Experiment 3
Deploy the Test Service
Section titled “Deploy the Test Service”Before you inject any failures, give yourself a small backend service and a tiny front-end NGINX layer. You will port-forward into the front layer, so the local tunnel stays up while the backend Pods fail behind it.
-
In every terminal that will run
kubectlork6, set your ONID, local forward port, and target URL:Terminal window export ONID=your-onidexport FORWARD_PORT=8080 # or 8081, 8082, ... if 8080 is in useexport TARGET_URL=http://127.0.0.1:${FORWARD_PORT}/Replace
your-onidwith your actual ONID, with no angle brackets. Any later command that uses$ONID,$FORWARD_PORT, or$TARGET_URLassumes you ran this first in that terminal. -
Create a working directory for this activity and an idempotent namespace:
Terminal window mkdir -p ~/cs312-reliabilitycd ~/cs312-reliabilitykubectl create namespace reliability-lab --dry-run=client -o yaml | kubectl apply -f -This keeps the files together and makes it safe to rerun the setup commands.
-
Write the manifest for the backend service, the front-end edge layer, and the two Services:
Terminal window cat <<EOF > web.yamlapiVersion: v1kind: ConfigMapmetadata:name: ${ONID}-sitedata:index.html: |<!doctype html><html><body><h1>${ONID} reliability lab</h1><p>This page is the steady-state target for k6.</p></body></html>---apiVersion: apps/v1kind: Deploymentmetadata:name: ${ONID}-webspec:replicas: 3selector:matchLabels:app: ${ONID}-webtemplate:metadata:labels:app: ${ONID}-webspec:containers:- name: webimage: nginx:alpineports:- containerPort: 80resources:requests:cpu: "20m"memory: "32Mi"limits:cpu: "50m"memory: "64Mi"readinessProbe:httpGet:path: /port: 80initialDelaySeconds: 2periodSeconds: 2volumeMounts:- name: sitemountPath: /usr/share/nginx/html/index.htmlsubPath: index.htmlvolumes:- name: siteconfigMap:name: ${ONID}-site---apiVersion: v1kind: Servicemetadata:name: ${ONID}-webspec:selector:app: ${ONID}-webports:- port: 80targetPort: 80---apiVersion: v1kind: ConfigMapmetadata:name: ${ONID}-edge-configdata:nginx.conf: |events {}http {server {listen 80;location / {proxy_connect_timeout 1s;proxy_read_timeout 1s;proxy_send_timeout 1s;proxy_next_upstream off;proxy_pass http://${ONID}-web;proxy_set_header Host \$host;proxy_set_header X-Real-IP \$remote_addr;}}}---apiVersion: apps/v1kind: Deploymentmetadata:name: ${ONID}-edgespec:replicas: 2selector:matchLabels:app: ${ONID}-edgetemplate:metadata:labels:app: ${ONID}-edgespec:containers:- name: edgeimage: nginx:alpineports:- containerPort: 80readinessProbe:httpGet:path: /port: 80initialDelaySeconds: 2periodSeconds: 2volumeMounts:- name: edge-configmountPath: /etc/nginx/nginx.confsubPath: nginx.confvolumes:- name: edge-configconfigMap:name: ${ONID}-edge-config---apiVersion: v1kind: Servicemetadata:name: ${ONID}-edgespec:selector:app: ${ONID}-edgeports:- port: 80targetPort: 80EOFThe backend Service still load balances across the three web Pods. Each web Pod has a 50 millicore CPU limit so the cluster has a known capacity ceiling instead of “as fast as your laptop can go”; this will matter when you stress-test it shortly. The edge Deployment is the stable front door you will port-forward into. Its one-second upstream timeouts turn a missing backend into quick bad requests instead of one-minute hangs.
The shape of our deployment is:
flowchart TB Client["client"] --> EdgeService subgraph Cluster[" "] EdgeService["edge Service"] Edge1["edge 1"] Edge2["edge 2"] WebService["web Service"] Web1["web 1"] Web2["web 2"] Web3["web 3"] EdgeService --> Edge1 EdgeService --> Edge2 Edge1 --> WebService Edge2 --> WebService WebService --> Web1 WebService --> Web2 WebService --> Web3 endThis setup is slightly artificial because minikube port-forwarding gives you a local tunnel instead of a real external load balancer. The extra edge layer is here only to keep that tunnel up while the backend Pods fail behind it.
-
Apply the manifest and wait for both deployments to become ready:
Terminal window kubectl apply -n reliability-lab -f web.yamlkubectl rollout status deployment/${ONID}-web -n reliability-labkubectl rollout status deployment/${ONID}-edge -n reliability-labYou should see:
deployment "${ONID}-web" successfully rolled outdeployment "${ONID}-edge" successfully rolled out -
In a dedicated terminal, forward the edge Service to your laptop:
Terminal window kubectl port-forward service/${ONID}-edge ${FORWARD_PORT}:80 -n reliability-labLeave this command running. The activity never changes the edge Deployment, so the local tunnel stays up while the backend deployment changes behind it.
-
In another terminal, point
curlat the local URL and verify the steady state:Terminal window curl "$TARGET_URL"You should see an HTML page whose
<h1>contains your ONID. At this point the service exists, the readiness probe is green, and the request path is working.
Define the SLO and Measure the Baseline
Section titled “Define the SLO and Measure the Baseline”Before you break anything, make the steady state measurable. In this section you will turn that request path into a one-minute k6 test with a concrete pass or fail result.
For this activity, use this SLI: a request is good only if it returns HTTP 200 and finishes in under 300 ms. At 10 requests per second for 60 seconds, one run sends about 600 requests, so an SLO of 99.5% allows about 3 bad requests.
-
Write the k6 script:
Terminal window cat <<'EOF' > ~/cs312-reliability/probe.jsimport http from 'k6/http';import { check } from 'k6';import { Rate } from 'k6/metrics';// Track the share of requests that satisfy the full SLI.const sliOk = new Rate('sli_ok');// Read the port-forward URL from the shell environment.const targetUrl = __ENV.TARGET_URL;if (!targetUrl) {throw new Error('Set TARGET_URL before running k6.');}export const options = {scenarios: {steady: {// Keep the test at 10 requests per second for one minute.executor: 'constant-arrival-rate',rate: 10,timeUnit: '1s',duration: '60s',// Start with 10 workers ready and grow to 20 if needed.preAllocatedVUs: 10,maxVUs: 20,},},thresholds: {// The run fails if fewer than 99.5% of requests are good.sli_ok: ['rate >= 0.995'],// These latency thresholds are a second check on user experience.http_req_duration: ['p(95)<300', 'p(99)<500'],},};export default function () {// Each VU sends one request and records whether it met the SLI.const res = http.get(targetUrl);const good = res.status === 200 && res.timings.duration < 300;check(res, {'status is 200': (r) => r.status === 200,});sliOk.add(good);}EOFThis writes
probe.jsinto~/cs312-reliabilityeven if this terminal is somewhere else.probe.jsexpects aTARGET_URLenvironment variable when you runk6.sli_okonly counts a request as good if both the status code and the latency requirement are satisfied.VUs are
k6virtual users: lightweight workers that repeatedly run thedefaultfunction in the script. In this test,k6keeps 10 requests per second flowing by starting with 10 ready VUs and using up to 20 if it needs more workers to hold that rate. -
Run the baseline measurement:
Terminal window k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.jsThis sends about 600 requests over one minute. The summary at the end is your steady-state measurement.
-
Check the end-of-run summary. Focus on two lines:
sli_ok: this should be at or above 99.5%http_req_duration: p95 should stay under 300 ms
If the baseline already fails, stop here and fix the service first. A chaos experiment without a healthy baseline is not measuring anything useful.
Find the Saturation Knee
Section titled “Find the Saturation Knee”The baseline tells you the service is healthy at 10 requests per second. It does not tell you how much headroom exists before it stops being healthy. A stress test answers that second question by stepping the load past the steady state at five discrete rates and asking, at each rate, whether the SLI still holds.
-
Write a second k6 script that runs five back-to-back scenarios at increasing rates, with per-stage thresholds:
Terminal window cat <<'EOF' > ~/cs312-reliability/stress.jsimport http from 'k6/http';import { Rate } from 'k6/metrics';const sliOk = new Rate('sli_ok');const targetUrl = __ENV.TARGET_URL;if (!targetUrl) {throw new Error('Set TARGET_URL before running k6.');}export const options = {scenarios: {rate_50: {executor: 'constant-arrival-rate',rate: 50, timeUnit: '1s', duration: '20s',preAllocatedVUs: 50, maxVUs: 200,startTime: '0s',},rate_200: {executor: 'constant-arrival-rate',rate: 200, timeUnit: '1s', duration: '20s',preAllocatedVUs: 100, maxVUs: 500,startTime: '25s',},rate_500: {executor: 'constant-arrival-rate',rate: 500, timeUnit: '1s', duration: '20s',preAllocatedVUs: 200, maxVUs: 1000,startTime: '50s',},rate_750: {executor: 'constant-arrival-rate',rate: 750, timeUnit: '1s', duration: '20s',preAllocatedVUs: 300, maxVUs: 1500,startTime: '75s',},rate_1000: {executor: 'constant-arrival-rate',rate: 1000, timeUnit: '1s', duration: '20s',preAllocatedVUs: 500, maxVUs: 2000,startTime: '100s',},},thresholds: {'sli_ok{scenario:rate_50}': ['rate >= 0.995'],'sli_ok{scenario:rate_200}': ['rate >= 0.995'],'sli_ok{scenario:rate_500}': ['rate >= 0.995'],'sli_ok{scenario:rate_750}': ['rate >= 0.995'],'sli_ok{scenario:rate_1000}': ['rate >= 0.995'],'http_req_duration{scenario:rate_50}': ['p(95)<300'],'http_req_duration{scenario:rate_200}': ['p(95)<300'],'http_req_duration{scenario:rate_500}': ['p(95)<300'],'http_req_duration{scenario:rate_750}': ['p(95)<300'],'http_req_duration{scenario:rate_1000}': ['p(95)<300'],},};export default function () {const res = http.get(targetUrl);const good = res.status === 200 && res.timings.duration < 300;sliOk.add(good);}EOFEach scenario is a
constant-arrival-ratetest that holds one target rate for 20 seconds. ThestartTimevalues stagger them so they run back to back with a 5-second gap between stages. k6 automatically tags every metric sample with thescenarioname it came from, which is what the threshold expressionssli_ok{scenario:rate_500}andhttp_req_duration{scenario:rate_500}filter on. The result is one pass-or-fail line per stage in the end-of-run summary. -
Run the stress test:
Terminal window k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/stress.jsThis takes about 2 minutes 20 seconds. k6 shows a per-scenario progress bar in the terminal but does not print pass/fail counts during the run; the per-stage verdict lands in the summary at the end.
-
Read the summary’s threshold block. You will see ten lines, two per stage, each marked with a green check or red cross:
✓ sli_ok{scenario:rate_50}✓ http_req_duration{scenario:rate_50}........: p(95)=...✗ sli_ok{scenario:rate_200}✓ http_req_duration{scenario:rate_200}.......: p(95)=...✗ sli_ok{scenario:rate_500}✓ http_req_duration{scenario:rate_500}.......: p(95)=...✗ sli_ok{scenario:rate_750}✗ http_req_duration{scenario:rate_750}.......: p(95)=...✗ sli_ok{scenario:rate_1000}✗ http_req_duration{scenario:rate_1000}......: p(95)=...The exact stage where the first red mark appears depends on your laptop, but with these limits it should fall somewhere between 200 and 1000 requests per second. Your run may not match this example exactly, and the two thresholds do not have to flip at the same stage.
-
Notice which threshold trips first at the knee, or whether they flip together. The two checks are measuring different failure shapes.
sli_okallows only 0.5 percent bad requests, while the p95 latency check still passes until more than 5 percent of requests cross 300 ms. That meanssli_okcan fail first if you have a small pocket of slow or non-200 responses, while the latency threshold can fail first if slowdown spreads across a larger share of requests.
Experiment 1: One Pod Fails Under Load
Section titled “Experiment 1: One Pod Fails Under Load”Start with the smallest blast radius that still touches the request path. Deleting one serving Pod tests whether replica redundancy and the Deployment controller can absorb a routine failure without burning much budget.
Write your hypothesis before running: will one deleted Pod spend any meaningful error budget while three replicas exist? If it does, how much?
-
In the k6 terminal, rerun the same one-minute test:
Terminal window k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js -
In the Kubernetes terminal, list the Pods and delete one of the three serving replicas:
Terminal window kubectl get pods -n reliability-labkubectl delete pod <one-pod-name> -n reliability-labThe delete request returns immediately. The replacement Pod is created afterward by the Deployment controller.
-
Wait until the deployment reports three ready replicas again:
Terminal window kubectl wait --for=jsonpath='{.status.readyReplicas}'=3 deployment/${ONID}-web -n reliability-lab --timeout=60sWhen this finishes, the service has returned to its original replica count.
-
When k6 completes, compare the summary to the baseline:
- Did
sli_okstay above 99.5%? - Did p95 latency stay comfortably below 300 ms?
- If there was budget spend, was it tiny or obvious?
- Did
Experiment 2: Zero Healthy Replicas
Section titled “Experiment 2: Zero Healthy Replicas”Now make the failure large enough that the service disappears from the request path. Because your port-forward terminates at the edge layer instead of the backend itself, the local tunnel stays up while the backend disappears. Scaling the deployment to zero gives you a clean no-backend window; deleting all three Pods can also cause failures, but the Deployment starts replacements immediately, so the outage duration depends more on timing.
Write your hypothesis before running: once the deployment reaches zero ready replicas, how badly will the one-minute run miss the 99.5% SLO?
-
Start the same k6 run again:
Terminal window k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js -
In the Kubernetes terminal, run the same four-command outage and recovery sequence you will reuse in the next experiment:
Terminal window kubectl scale deployment/${ONID}-web --replicas=0 -n reliability-labkubectl wait --for=delete pod -l app=${ONID}-web -n reliability-lab --timeout=60skubectl scale deployment/${ONID}-web --replicas=3 -n reliability-labkubectl wait --for=jsonpath='{.status.readyReplicas}'=3 deployment/${ONID}-web -n reliability-lab --timeout=60sThe second line returns only after the old serving Pods are gone. The third line starts recovery immediately, so this experiment and the next one use the same no-backend window.
-
When k6 finishes, inspect the summary carefully:
sli_okshould drop well below the baseline- the threshold for
sli_ok rate >= 0.995should fail - the bad fraction is the part of the budget you spent during this one-minute window
Experiment 3: Graceful Degradation at the Edge
Section titled “Experiment 3: Graceful Degradation at the Edge”The zero-replicas experiment showed the worst case: when the backend disappears, the user sees nothing. That is a binary failure mode. A reduced response is the alternative: when the backend is gone, serve something limited but still useful. In this experiment you reconfigure the edge to fail open with a static fallback, then rerun the same scale-to-zero failure to compare.
Write your hypothesis before running: if the edge serves a fallback page when the backend is unreachable, how much of the user-visible SLO survives a zero-backend window?
-
Write a new edge ConfigMap that intercepts upstream errors and serves a personalized fallback page:
Terminal window cat <<EOF > ~/cs312-reliability/edge-degraded.yamlapiVersion: v1kind: ConfigMapmetadata:name: ${ONID}-edge-configdata:nginx.conf: |events {}http {server {listen 80;location / {proxy_connect_timeout 1s;proxy_read_timeout 1s;proxy_send_timeout 1s;proxy_next_upstream off;proxy_intercept_errors on;proxy_pass http://${ONID}-web;proxy_set_header Host \$host;proxy_set_header X-Real-IP \$remote_addr;error_page 502 503 504 = @fallback;}location @fallback {default_type text/html;return 200 '<!doctype html><html><body><h1>${ONID} reliability lab</h1><p>Reduced mode: serving cached content while the backend recovers.</p></body></html>';}}}EOFkubectl apply -n reliability-lab -f ~/cs312-reliability/edge-degraded.yamlproxy_intercept_errors ontells NGINX to catch any upstream 5xx response instead of passing it straight through.error_page 502 503 504 = @fallbackredirects those failures to a named location.location @fallbackreturns a hand-written 200 with the personalized fallback page. The@prefix makes the location internal: clients cannot request it directly. -
Restart the edge Deployment so the new ConfigMap is picked up:
Terminal window kubectl rollout restart deployment/${ONID}-edge -n reliability-labkubectl rollout status deployment/${ONID}-edge -n reliability-labkubectl rollout restartrecreates the edge Pods. This is necessary because ConfigMaps mounted withsubPathare not refreshed automatically in running Pods. -
With the backend still healthy, verify nothing visible changed in steady state:
Terminal window curl "$TARGET_URL"You should still see the original page with your ONID. The fallback only fires when the backend cannot answer, so a healthy request still goes through the proxy_pass path.
-
Start the same one-minute k6 run as before:
Terminal window k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js -
In a third terminal, start a short sampling loop before you cut the backend:
Terminal window for i in $(seq 1 20); do curl -s "$TARGET_URL" | grep -E 'h1|Reduced'; sleep 1; doneThis loop runs long enough to overlap the outage window and show the page change from steady state to fallback and back.
-
In the Kubernetes terminal, repeat the exact same outage and recovery sequence from Experiment 2:
Terminal window kubectl scale deployment/${ONID}-web --replicas=0 -n reliability-labkubectl wait --for=delete pod -l app=${ONID}-web -n reliability-lab --timeout=60skubectl scale deployment/${ONID}-web --replicas=3 -n reliability-labkubectl wait --for=jsonpath='{.status.readyReplicas}'=3 deployment/${ONID}-web -n reliability-lab --timeout=60sWhile the backend has zero ready replicas, the sampling loop should show the
Reduced modeline instead of the steady-state page. When the backend recovers, the original page returns. -
When k6 finishes, compare the summary to Experiment 2:
- Did
sli_okstay higher than the zero-replicas run from Experiment 2? - Did the status-code share of requests stay close to 100 percent, even though no backend Pods existed for part of the window?
- Did latency still suffer briefly while the proxy decided that the upstream was unreachable?
- Did
Experiment 4: Failed Rollout, Safe Rollback
Section titled “Experiment 4: Failed Rollout, Safe Rollback”A failed change is less dangerous than a failed service when the rollout mechanics keep old replicas serving traffic while you undo the mistake. This section tests whether a broken rollout can stay operationally unhealthy without becoming a user-visible outage.
Write your hypothesis before running: can a rollout be broken and still stay within the SLO because the old Pods continue serving?
-
Start the same k6 run again:
Terminal window k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js -
In the Kubernetes terminal, intentionally set a bad image tag on the deployment:
Terminal window kubectl set image deployment/${ONID}-web web=nginx:this-tag-does-not-exist -n reliability-labKubernetes now tries to create new Pods from an image that cannot be pulled.
-
Inspect the Pods while the bad rollout is in progress:
Terminal window kubectl get pods -n reliability-labYou should see at least one new Pod enter
ErrImagePullorImagePullBackOffwhile older Pods remainRunning. That is the safety property you are testing. -
Undo the rollout and wait for the deployment to settle:
Terminal window kubectl rollout undo deployment/${ONID}-web -n reliability-labkubectl rollout status deployment/${ONID}-web -n reliability-lab -
When k6 finishes, compare the summary to the earlier runs:
- Did
sli_okstay above 99.5% despite the failed rollout? - Did the old Pods preserve availability while the new image failed?
- Was the bad change reversible before it became a user-visible outage?
- Did
Record the Final Steady State
Section titled “Record the Final Steady State”The last step of a reliability exercise is confirming that the system is healthy again and keeping a concrete record of what happened. The result should show both the user-visible SLI and the cluster state with your ONID in it.
-
Run one final clean baseline and save the summary to a file:
Terminal window k6 run --quiet -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js | tee final-check.txtThe
--quietflag suppresses the live progress meter, sofinal-check.txtgets the final summary instead of a stream of progress updates. This final run should pass all thresholds again. If it does not, the system is not back to steady state. -
Append one user-visible check and the Kubernetes state:
Terminal window curl -s "$TARGET_URL" | grep h1 | tee -a final-check.txtkubectl get deployment,replicaset,pods,service -n reliability-lab | tee -a final-check.txtThe first command writes the page header containing your ONID. The second command writes the edge and backend Deployments, ReplicaSets, Pods, and Services that now make up the recovered system.
-
Print the record you just created:
Terminal window cat final-check.txtYou should see a passing
k6summary, your personalized page header, and healthy Kubernetes objects named with your ONID. This is the concrete end state of the activity.
Going Further
Section titled “Going Further”You measured a load test, a stress ramp, and four controlled experiments. The next steps are to vary the failure shape, the recovery mechanism, and the observability that sits underneath all of it.
The closest hands-on equivalent of what you just built is a spike test driven by autoscaling. Attach a HorizontalPodAutoscaler to the backend (kubectl autoscale deployment/${ONID}-web --min=2 --max=8 --cpu-percent=50 -n reliability-lab is the one-line version), then write a third k6 scenario that jumps from 10 to 100 requests per second instantly. Watch kubectl get hpa,pods -w in another terminal while the run is in flight, and look for the gap between the moment the spike arrives and the moment new Pods are ready. That gap is the budget the autoscaler cannot save you from.
For a subtler controller-healing comparison, delete all three Pods at once with kubectl delete pods -l app=${ONID}-web -n reliability-lab without changing the replica count, and compare that run to the scale-to-zero run from Experiment 2. If you have already completed Prometheus and Grafana, rerun the stress ramp while watching the RED panels so you can compare k6’s end-of-run summary to the dashboard’s p95 latency and error-rate curves. And if you want a more realistic failure than Pod deletion, install LitmusChaos or Chaos Mesh in minikube and repeat the same k6 harness during a network-latency experiment.