Skip to content

Measure Error Budget

This activity puts into practice the concepts from the Reliability Engineering lecture. You will define a simple SLO for a personalized web service, probe it with k6, ramp the load past the steady state to find where the service breaks, and then repeat a one-minute test across four controlled experiments including a graceful-degradation fallback at the edge. By the end, you will have measured the saturation knee of your service, seen which failures spent error budget, observed how a fail-open fallback changes the user-visible result, and verified that the system returned to steady state.

  • Your cluster from the Minikube activity
  • kubectl already working against that cluster
  • k6 installed before class using the official k6 install guide and verified with k6 version
  • Four terminal windows or tabs: one for kubectl port-forward, one for k6, one for Kubernetes commands, and one for the short sampling loop in Experiment 3

Before you inject any failures, give yourself a small backend service and a tiny front-end NGINX layer. You will port-forward into the front layer, so the local tunnel stays up while the backend Pods fail behind it.

  1. In every terminal that will run kubectl or k6, set your ONID, local forward port, and target URL:

    Terminal window
    export ONID=your-onid
    export FORWARD_PORT=8080 # or 8081, 8082, ... if 8080 is in use
    export TARGET_URL=http://127.0.0.1:${FORWARD_PORT}/

    Replace your-onid with your actual ONID, with no angle brackets. Any later command that uses $ONID, $FORWARD_PORT, or $TARGET_URL assumes you ran this first in that terminal.

  2. Create a working directory for this activity and an idempotent namespace:

    Terminal window
    mkdir -p ~/cs312-reliability
    cd ~/cs312-reliability
    kubectl create namespace reliability-lab --dry-run=client -o yaml | kubectl apply -f -

    This keeps the files together and makes it safe to rerun the setup commands.

  3. Write the manifest for the backend service, the front-end edge layer, and the two Services:

    Terminal window
    cat <<EOF > web.yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: ${ONID}-site
    data:
    index.html: |
    <!doctype html>
    <html>
    <body>
    <h1>${ONID} reliability lab</h1>
    <p>This page is the steady-state target for k6.</p>
    </body>
    </html>
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: ${ONID}-web
    spec:
    replicas: 3
    selector:
    matchLabels:
    app: ${ONID}-web
    template:
    metadata:
    labels:
    app: ${ONID}-web
    spec:
    containers:
    - name: web
    image: nginx:alpine
    ports:
    - containerPort: 80
    resources:
    requests:
    cpu: "20m"
    memory: "32Mi"
    limits:
    cpu: "50m"
    memory: "64Mi"
    readinessProbe:
    httpGet:
    path: /
    port: 80
    initialDelaySeconds: 2
    periodSeconds: 2
    volumeMounts:
    - name: site
    mountPath: /usr/share/nginx/html/index.html
    subPath: index.html
    volumes:
    - name: site
    configMap:
    name: ${ONID}-site
    ---
    apiVersion: v1
    kind: Service
    metadata:
    name: ${ONID}-web
    spec:
    selector:
    app: ${ONID}-web
    ports:
    - port: 80
    targetPort: 80
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: ${ONID}-edge-config
    data:
    nginx.conf: |
    events {}
    http {
    server {
    listen 80;
    location / {
    proxy_connect_timeout 1s;
    proxy_read_timeout 1s;
    proxy_send_timeout 1s;
    proxy_next_upstream off;
    proxy_pass http://${ONID}-web;
    proxy_set_header Host \$host;
    proxy_set_header X-Real-IP \$remote_addr;
    }
    }
    }
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: ${ONID}-edge
    spec:
    replicas: 2
    selector:
    matchLabels:
    app: ${ONID}-edge
    template:
    metadata:
    labels:
    app: ${ONID}-edge
    spec:
    containers:
    - name: edge
    image: nginx:alpine
    ports:
    - containerPort: 80
    readinessProbe:
    httpGet:
    path: /
    port: 80
    initialDelaySeconds: 2
    periodSeconds: 2
    volumeMounts:
    - name: edge-config
    mountPath: /etc/nginx/nginx.conf
    subPath: nginx.conf
    volumes:
    - name: edge-config
    configMap:
    name: ${ONID}-edge-config
    ---
    apiVersion: v1
    kind: Service
    metadata:
    name: ${ONID}-edge
    spec:
    selector:
    app: ${ONID}-edge
    ports:
    - port: 80
    targetPort: 80
    EOF

    The backend Service still load balances across the three web Pods. Each web Pod has a 50 millicore CPU limit so the cluster has a known capacity ceiling instead of “as fast as your laptop can go”; this will matter when you stress-test it shortly. The edge Deployment is the stable front door you will port-forward into. Its one-second upstream timeouts turn a missing backend into quick bad requests instead of one-minute hangs.

    The shape of our deployment is:

    flowchart TB
      Client["client"] --> EdgeService
    
      subgraph Cluster[" "]
        EdgeService["edge Service"]
        Edge1["edge 1"]
        Edge2["edge 2"]
        WebService["web Service"]
        Web1["web 1"]
        Web2["web 2"]
        Web3["web 3"]
    
        EdgeService --> Edge1
        EdgeService --> Edge2
        Edge1 --> WebService
        Edge2 --> WebService
        WebService --> Web1
        WebService --> Web2
        WebService --> Web3
      end

    This setup is slightly artificial because minikube port-forwarding gives you a local tunnel instead of a real external load balancer. The extra edge layer is here only to keep that tunnel up while the backend Pods fail behind it.

  4. Apply the manifest and wait for both deployments to become ready:

    Terminal window
    kubectl apply -n reliability-lab -f web.yaml
    kubectl rollout status deployment/${ONID}-web -n reliability-lab
    kubectl rollout status deployment/${ONID}-edge -n reliability-lab

    You should see:

    deployment "${ONID}-web" successfully rolled out
    deployment "${ONID}-edge" successfully rolled out
  5. In a dedicated terminal, forward the edge Service to your laptop:

    Terminal window
    kubectl port-forward service/${ONID}-edge ${FORWARD_PORT}:80 -n reliability-lab

    Leave this command running. The activity never changes the edge Deployment, so the local tunnel stays up while the backend deployment changes behind it.

  6. In another terminal, point curl at the local URL and verify the steady state:

    Terminal window
    curl "$TARGET_URL"

    You should see an HTML page whose <h1> contains your ONID. At this point the service exists, the readiness probe is green, and the request path is working.


Before you break anything, make the steady state measurable. In this section you will turn that request path into a one-minute k6 test with a concrete pass or fail result.

For this activity, use this SLI: a request is good only if it returns HTTP 200 and finishes in under 300 ms. At 10 requests per second for 60 seconds, one run sends about 600 requests, so an SLO of 99.5% allows about 3 bad requests.

  1. Write the k6 script:

    Terminal window
    cat <<'EOF' > ~/cs312-reliability/probe.js
    import http from 'k6/http';
    import { check } from 'k6';
    import { Rate } from 'k6/metrics';
    // Track the share of requests that satisfy the full SLI.
    const sliOk = new Rate('sli_ok');
    // Read the port-forward URL from the shell environment.
    const targetUrl = __ENV.TARGET_URL;
    if (!targetUrl) {
    throw new Error('Set TARGET_URL before running k6.');
    }
    export const options = {
    scenarios: {
    steady: {
    // Keep the test at 10 requests per second for one minute.
    executor: 'constant-arrival-rate',
    rate: 10,
    timeUnit: '1s',
    duration: '60s',
    // Start with 10 workers ready and grow to 20 if needed.
    preAllocatedVUs: 10,
    maxVUs: 20,
    },
    },
    thresholds: {
    // The run fails if fewer than 99.5% of requests are good.
    sli_ok: ['rate >= 0.995'],
    // These latency thresholds are a second check on user experience.
    http_req_duration: ['p(95)<300', 'p(99)<500'],
    },
    };
    export default function () {
    // Each VU sends one request and records whether it met the SLI.
    const res = http.get(targetUrl);
    const good = res.status === 200 && res.timings.duration < 300;
    check(res, {
    'status is 200': (r) => r.status === 200,
    });
    sliOk.add(good);
    }
    EOF

    This writes probe.js into ~/cs312-reliability even if this terminal is somewhere else. probe.js expects a TARGET_URL environment variable when you run k6. sli_ok only counts a request as good if both the status code and the latency requirement are satisfied.

    VUs are k6 virtual users: lightweight workers that repeatedly run the default function in the script. In this test, k6 keeps 10 requests per second flowing by starting with 10 ready VUs and using up to 20 if it needs more workers to hold that rate.

  2. Run the baseline measurement:

    Terminal window
    k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js

    This sends about 600 requests over one minute. The summary at the end is your steady-state measurement.

  3. Check the end-of-run summary. Focus on two lines:

    • sli_ok: this should be at or above 99.5%
    • http_req_duration: p95 should stay under 300 ms

    If the baseline already fails, stop here and fix the service first. A chaos experiment without a healthy baseline is not measuring anything useful.


The baseline tells you the service is healthy at 10 requests per second. It does not tell you how much headroom exists before it stops being healthy. A stress test answers that second question by stepping the load past the steady state at five discrete rates and asking, at each rate, whether the SLI still holds.

  1. Write a second k6 script that runs five back-to-back scenarios at increasing rates, with per-stage thresholds:

    Terminal window
    cat <<'EOF' > ~/cs312-reliability/stress.js
    import http from 'k6/http';
    import { Rate } from 'k6/metrics';
    const sliOk = new Rate('sli_ok');
    const targetUrl = __ENV.TARGET_URL;
    if (!targetUrl) {
    throw new Error('Set TARGET_URL before running k6.');
    }
    export const options = {
    scenarios: {
    rate_50: {
    executor: 'constant-arrival-rate',
    rate: 50, timeUnit: '1s', duration: '20s',
    preAllocatedVUs: 50, maxVUs: 200,
    startTime: '0s',
    },
    rate_200: {
    executor: 'constant-arrival-rate',
    rate: 200, timeUnit: '1s', duration: '20s',
    preAllocatedVUs: 100, maxVUs: 500,
    startTime: '25s',
    },
    rate_500: {
    executor: 'constant-arrival-rate',
    rate: 500, timeUnit: '1s', duration: '20s',
    preAllocatedVUs: 200, maxVUs: 1000,
    startTime: '50s',
    },
    rate_750: {
    executor: 'constant-arrival-rate',
    rate: 750, timeUnit: '1s', duration: '20s',
    preAllocatedVUs: 300, maxVUs: 1500,
    startTime: '75s',
    },
    rate_1000: {
    executor: 'constant-arrival-rate',
    rate: 1000, timeUnit: '1s', duration: '20s',
    preAllocatedVUs: 500, maxVUs: 2000,
    startTime: '100s',
    },
    },
    thresholds: {
    'sli_ok{scenario:rate_50}': ['rate >= 0.995'],
    'sli_ok{scenario:rate_200}': ['rate >= 0.995'],
    'sli_ok{scenario:rate_500}': ['rate >= 0.995'],
    'sli_ok{scenario:rate_750}': ['rate >= 0.995'],
    'sli_ok{scenario:rate_1000}': ['rate >= 0.995'],
    'http_req_duration{scenario:rate_50}': ['p(95)<300'],
    'http_req_duration{scenario:rate_200}': ['p(95)<300'],
    'http_req_duration{scenario:rate_500}': ['p(95)<300'],
    'http_req_duration{scenario:rate_750}': ['p(95)<300'],
    'http_req_duration{scenario:rate_1000}': ['p(95)<300'],
    },
    };
    export default function () {
    const res = http.get(targetUrl);
    const good = res.status === 200 && res.timings.duration < 300;
    sliOk.add(good);
    }
    EOF

    Each scenario is a constant-arrival-rate test that holds one target rate for 20 seconds. The startTime values stagger them so they run back to back with a 5-second gap between stages. k6 automatically tags every metric sample with the scenario name it came from, which is what the threshold expressions sli_ok{scenario:rate_500} and http_req_duration{scenario:rate_500} filter on. The result is one pass-or-fail line per stage in the end-of-run summary.

  2. Run the stress test:

    Terminal window
    k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/stress.js

    This takes about 2 minutes 20 seconds. k6 shows a per-scenario progress bar in the terminal but does not print pass/fail counts during the run; the per-stage verdict lands in the summary at the end.

  3. Read the summary’s threshold block. You will see ten lines, two per stage, each marked with a green check or red cross:

    ✓ sli_ok{scenario:rate_50}
    ✓ http_req_duration{scenario:rate_50}........: p(95)=...
    ✗ sli_ok{scenario:rate_200}
    ✓ http_req_duration{scenario:rate_200}.......: p(95)=...
    ✗ sli_ok{scenario:rate_500}
    ✓ http_req_duration{scenario:rate_500}.......: p(95)=...
    ✗ sli_ok{scenario:rate_750}
    ✗ http_req_duration{scenario:rate_750}.......: p(95)=...
    ✗ sli_ok{scenario:rate_1000}
    ✗ http_req_duration{scenario:rate_1000}......: p(95)=...

    The exact stage where the first red mark appears depends on your laptop, but with these limits it should fall somewhere between 200 and 1000 requests per second. Your run may not match this example exactly, and the two thresholds do not have to flip at the same stage.

  4. Notice which threshold trips first at the knee, or whether they flip together. The two checks are measuring different failure shapes. sli_ok allows only 0.5 percent bad requests, while the p95 latency check still passes until more than 5 percent of requests cross 300 ms. That means sli_ok can fail first if you have a small pocket of slow or non-200 responses, while the latency threshold can fail first if slowdown spreads across a larger share of requests.


Start with the smallest blast radius that still touches the request path. Deleting one serving Pod tests whether replica redundancy and the Deployment controller can absorb a routine failure without burning much budget.

Write your hypothesis before running: will one deleted Pod spend any meaningful error budget while three replicas exist? If it does, how much?

  1. In the k6 terminal, rerun the same one-minute test:

    Terminal window
    k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js
  2. In the Kubernetes terminal, list the Pods and delete one of the three serving replicas:

    Terminal window
    kubectl get pods -n reliability-lab
    kubectl delete pod <one-pod-name> -n reliability-lab

    The delete request returns immediately. The replacement Pod is created afterward by the Deployment controller.

  3. Wait until the deployment reports three ready replicas again:

    Terminal window
    kubectl wait --for=jsonpath='{.status.readyReplicas}'=3 deployment/${ONID}-web -n reliability-lab --timeout=60s

    When this finishes, the service has returned to its original replica count.

  4. When k6 completes, compare the summary to the baseline:

    • Did sli_ok stay above 99.5%?
    • Did p95 latency stay comfortably below 300 ms?
    • If there was budget spend, was it tiny or obvious?

Now make the failure large enough that the service disappears from the request path. Because your port-forward terminates at the edge layer instead of the backend itself, the local tunnel stays up while the backend disappears. Scaling the deployment to zero gives you a clean no-backend window; deleting all three Pods can also cause failures, but the Deployment starts replacements immediately, so the outage duration depends more on timing.

Write your hypothesis before running: once the deployment reaches zero ready replicas, how badly will the one-minute run miss the 99.5% SLO?

  1. Start the same k6 run again:

    Terminal window
    k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js
  2. In the Kubernetes terminal, run the same four-command outage and recovery sequence you will reuse in the next experiment:

    Terminal window
    kubectl scale deployment/${ONID}-web --replicas=0 -n reliability-lab
    kubectl wait --for=delete pod -l app=${ONID}-web -n reliability-lab --timeout=60s
    kubectl scale deployment/${ONID}-web --replicas=3 -n reliability-lab
    kubectl wait --for=jsonpath='{.status.readyReplicas}'=3 deployment/${ONID}-web -n reliability-lab --timeout=60s

    The second line returns only after the old serving Pods are gone. The third line starts recovery immediately, so this experiment and the next one use the same no-backend window.

  3. When k6 finishes, inspect the summary carefully:

    • sli_ok should drop well below the baseline
    • the threshold for sli_ok rate >= 0.995 should fail
    • the bad fraction is the part of the budget you spent during this one-minute window

Experiment 3: Graceful Degradation at the Edge

Section titled “Experiment 3: Graceful Degradation at the Edge”

The zero-replicas experiment showed the worst case: when the backend disappears, the user sees nothing. That is a binary failure mode. A reduced response is the alternative: when the backend is gone, serve something limited but still useful. In this experiment you reconfigure the edge to fail open with a static fallback, then rerun the same scale-to-zero failure to compare.

Write your hypothesis before running: if the edge serves a fallback page when the backend is unreachable, how much of the user-visible SLO survives a zero-backend window?

  1. Write a new edge ConfigMap that intercepts upstream errors and serves a personalized fallback page:

    Terminal window
    cat <<EOF > ~/cs312-reliability/edge-degraded.yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: ${ONID}-edge-config
    data:
    nginx.conf: |
    events {}
    http {
    server {
    listen 80;
    location / {
    proxy_connect_timeout 1s;
    proxy_read_timeout 1s;
    proxy_send_timeout 1s;
    proxy_next_upstream off;
    proxy_intercept_errors on;
    proxy_pass http://${ONID}-web;
    proxy_set_header Host \$host;
    proxy_set_header X-Real-IP \$remote_addr;
    error_page 502 503 504 = @fallback;
    }
    location @fallback {
    default_type text/html;
    return 200 '<!doctype html><html><body><h1>${ONID} reliability lab</h1><p>Reduced mode: serving cached content while the backend recovers.</p></body></html>';
    }
    }
    }
    EOF
    kubectl apply -n reliability-lab -f ~/cs312-reliability/edge-degraded.yaml

    proxy_intercept_errors on tells NGINX to catch any upstream 5xx response instead of passing it straight through. error_page 502 503 504 = @fallback redirects those failures to a named location. location @fallback returns a hand-written 200 with the personalized fallback page. The @ prefix makes the location internal: clients cannot request it directly.

  2. Restart the edge Deployment so the new ConfigMap is picked up:

    Terminal window
    kubectl rollout restart deployment/${ONID}-edge -n reliability-lab
    kubectl rollout status deployment/${ONID}-edge -n reliability-lab

    kubectl rollout restart recreates the edge Pods. This is necessary because ConfigMaps mounted with subPath are not refreshed automatically in running Pods.

  3. With the backend still healthy, verify nothing visible changed in steady state:

    Terminal window
    curl "$TARGET_URL"

    You should still see the original page with your ONID. The fallback only fires when the backend cannot answer, so a healthy request still goes through the proxy_pass path.

  4. Start the same one-minute k6 run as before:

    Terminal window
    k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js
  5. In a third terminal, start a short sampling loop before you cut the backend:

    Terminal window
    for i in $(seq 1 20); do curl -s "$TARGET_URL" | grep -E 'h1|Reduced'; sleep 1; done

    This loop runs long enough to overlap the outage window and show the page change from steady state to fallback and back.

  6. In the Kubernetes terminal, repeat the exact same outage and recovery sequence from Experiment 2:

    Terminal window
    kubectl scale deployment/${ONID}-web --replicas=0 -n reliability-lab
    kubectl wait --for=delete pod -l app=${ONID}-web -n reliability-lab --timeout=60s
    kubectl scale deployment/${ONID}-web --replicas=3 -n reliability-lab
    kubectl wait --for=jsonpath='{.status.readyReplicas}'=3 deployment/${ONID}-web -n reliability-lab --timeout=60s

    While the backend has zero ready replicas, the sampling loop should show the Reduced mode line instead of the steady-state page. When the backend recovers, the original page returns.

  7. When k6 finishes, compare the summary to Experiment 2:

    • Did sli_ok stay higher than the zero-replicas run from Experiment 2?
    • Did the status-code share of requests stay close to 100 percent, even though no backend Pods existed for part of the window?
    • Did latency still suffer briefly while the proxy decided that the upstream was unreachable?

Experiment 4: Failed Rollout, Safe Rollback

Section titled “Experiment 4: Failed Rollout, Safe Rollback”

A failed change is less dangerous than a failed service when the rollout mechanics keep old replicas serving traffic while you undo the mistake. This section tests whether a broken rollout can stay operationally unhealthy without becoming a user-visible outage.

Write your hypothesis before running: can a rollout be broken and still stay within the SLO because the old Pods continue serving?

  1. Start the same k6 run again:

    Terminal window
    k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js
  2. In the Kubernetes terminal, intentionally set a bad image tag on the deployment:

    Terminal window
    kubectl set image deployment/${ONID}-web web=nginx:this-tag-does-not-exist -n reliability-lab

    Kubernetes now tries to create new Pods from an image that cannot be pulled.

  3. Inspect the Pods while the bad rollout is in progress:

    Terminal window
    kubectl get pods -n reliability-lab

    You should see at least one new Pod enter ErrImagePull or ImagePullBackOff while older Pods remain Running. That is the safety property you are testing.

  4. Undo the rollout and wait for the deployment to settle:

    Terminal window
    kubectl rollout undo deployment/${ONID}-web -n reliability-lab
    kubectl rollout status deployment/${ONID}-web -n reliability-lab
  5. When k6 finishes, compare the summary to the earlier runs:

    • Did sli_ok stay above 99.5% despite the failed rollout?
    • Did the old Pods preserve availability while the new image failed?
    • Was the bad change reversible before it became a user-visible outage?

The last step of a reliability exercise is confirming that the system is healthy again and keeping a concrete record of what happened. The result should show both the user-visible SLI and the cluster state with your ONID in it.

  1. Run one final clean baseline and save the summary to a file:

    Terminal window
    k6 run --quiet -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js | tee final-check.txt

    The --quiet flag suppresses the live progress meter, so final-check.txt gets the final summary instead of a stream of progress updates. This final run should pass all thresholds again. If it does not, the system is not back to steady state.

  2. Append one user-visible check and the Kubernetes state:

    Terminal window
    curl -s "$TARGET_URL" | grep h1 | tee -a final-check.txt
    kubectl get deployment,replicaset,pods,service -n reliability-lab | tee -a final-check.txt

    The first command writes the page header containing your ONID. The second command writes the edge and backend Deployments, ReplicaSets, Pods, and Services that now make up the recovered system.

  3. Print the record you just created:

    Terminal window
    cat final-check.txt

    You should see a passing k6 summary, your personalized page header, and healthy Kubernetes objects named with your ONID. This is the concrete end state of the activity.

You measured a load test, a stress ramp, and four controlled experiments. The next steps are to vary the failure shape, the recovery mechanism, and the observability that sits underneath all of it.

The closest hands-on equivalent of what you just built is a spike test driven by autoscaling. Attach a HorizontalPodAutoscaler to the backend (kubectl autoscale deployment/${ONID}-web --min=2 --max=8 --cpu-percent=50 -n reliability-lab is the one-line version), then write a third k6 scenario that jumps from 10 to 100 requests per second instantly. Watch kubectl get hpa,pods -w in another terminal while the run is in flight, and look for the gap between the moment the spike arrives and the moment new Pods are ready. That gap is the budget the autoscaler cannot save you from.

For a subtler controller-healing comparison, delete all three Pods at once with kubectl delete pods -l app=${ONID}-web -n reliability-lab without changing the replica count, and compare that run to the scale-to-zero run from Experiment 2. If you have already completed Prometheus and Grafana, rerun the stress ramp while watching the RED panels so you can compare k6’s end-of-run summary to the dashboard’s p95 latency and error-rate curves. And if you want a more realistic failure than Pod deletion, install LitmusChaos or Chaos Mesh in minikube and repeat the same k6 harness during a network-latency experiment.