Measure Error Budget

This activity puts into practice the concepts from the Reliability Engineering lecture. You will define a simple SLO for a personalized web service, probe it with k6, ramp the load past the steady state to find where the service breaks, and then repeat a one-minute test across four controlled experiments including a graceful-degradation fallback at the edge. By the end, you will have measured the saturation knee of your service, seen which failures spent error budget, observed how a fail-open fallback changes the user-visible result, and verified that the system returned to steady state.

What You Will Need

Your cluster from the Minikube activity
kubectl already working against that cluster
k6 installed before class using the official k6 install guide and verified with k6 version
Four terminal windows or tabs: one for kubectl port-forward, one for k6, one for Kubernetes commands, and one for the short sampling loop in Experiment 3

Deploy the Test Service

Before you inject any failures, give yourself a small backend service and a tiny front-end NGINX layer. You will port-forward into the front layer, so the local tunnel stays up while the backend Pods fail behind it.

In every terminal that will run kubectl or k6, set your ONID, local forward port, and target URL:
Terminal window
```
export ONID=your-onid
export FORWARD_PORT=8080 # or 8081, 8082, ... if 8080 is in use
export TARGET_URL=http://127.0.0.1:${FORWARD_PORT}/
```
Replace your-onid with your actual ONID, with no angle brackets. Any later command that uses $ONID, $FORWARD_PORT, or $TARGET_URL assumes you ran this first in that terminal.
Create a working directory for this activity and an idempotent namespace:
Terminal window
```
mkdir -p ~/cs312-reliability
cd ~/cs312-reliability
kubectl create namespace reliability-lab --dry-run=client -o yaml | kubectl apply -f -
```
This keeps the files together and makes it safe to rerun the setup commands.

Write the manifest for the backend service, the front-end edge layer, and the two Services:

cat <<EOF > web.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ${ONID}-site
data:
  index.html: |
    <!doctype html>
    <html>
      <body>
        <h1>${ONID} reliability lab</h1>
        <p>This page is the steady-state target for k6.</p>
      </body>
    </html>
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ${ONID}-web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ${ONID}-web
  template:
    metadata:
      labels:
        app: ${ONID}-web
    spec:
      containers:
        - name: web
          image: nginx:alpine
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: "20m"
              memory: "32Mi"
            limits:
              cpu: "50m"
              memory: "64Mi"
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 2
            periodSeconds: 2
          volumeMounts:
            - name: site
              mountPath: /usr/share/nginx/html/index.html
              subPath: index.html
      volumes:
        - name: site
          configMap:
            name: ${ONID}-site
---
apiVersion: v1
kind: Service
metadata:
  name: ${ONID}-web
spec:
  selector:
    app: ${ONID}-web
  ports:
    - port: 80
      targetPort: 80
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ${ONID}-edge-config
data:
  nginx.conf: |
    events {}
    http {
      server {
        listen 80;
        location / {
          proxy_connect_timeout 1s;
          proxy_read_timeout 1s;
          proxy_send_timeout 1s;
          proxy_next_upstream off;
          proxy_pass http://${ONID}-web;
          proxy_set_header Host \$host;
          proxy_set_header X-Real-IP \$remote_addr;
        }
      }
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ${ONID}-edge
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ${ONID}-edge
  template:
    metadata:
      labels:
        app: ${ONID}-edge
    spec:
      containers:
        - name: edge
          image: nginx:alpine
          ports:
            - containerPort: 80
          readinessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 2
            periodSeconds: 2
          volumeMounts:
            - name: edge-config
              mountPath: /etc/nginx/nginx.conf
              subPath: nginx.conf
      volumes:
        - name: edge-config
          configMap:
            name: ${ONID}-edge-config
---
apiVersion: v1
kind: Service
metadata:
  name: ${ONID}-edge
spec:
  selector:
    app: ${ONID}-edge
  ports:
    - port: 80
      targetPort: 80
EOF

The backend Service still load balances across the three web Pods. Each web Pod has a 50 millicore CPU limit so the cluster has a known capacity ceiling instead of “as fast as your laptop can go”; this will matter when you stress-test it shortly. The edge Deployment is the stable front door you will port-forward into. Its one-second upstream timeouts turn a missing backend into quick bad requests instead of one-minute hangs.

The shape of our deployment is:

flowchart TB
  Client["client"] --> EdgeService

  subgraph Cluster[" "]
    EdgeService["edge Service"]
    Edge1["edge 1"]
    Edge2["edge 2"]
    WebService["web Service"]
    Web1["web 1"]
    Web2["web 2"]
    Web3["web 3"]

    EdgeService --> Edge1
    EdgeService --> Edge2
    Edge1 --> WebService
    Edge2 --> WebService
    WebService --> Web1
    WebService --> Web2
    WebService --> Web3
  end

This setup is slightly artificial because minikube port-forwarding gives you a local tunnel instead of a real external load balancer. The extra edge layer is here only to keep that tunnel up while the backend Pods fail behind it.

Apply the manifest and wait for both deployments to become ready:

kubectl apply -n reliability-lab -f web.yaml
kubectl rollout status deployment/${ONID}-web -n reliability-lab
kubectl rollout status deployment/${ONID}-edge -n reliability-lab

You should see:

deployment "${ONID}-web" successfully rolled out
deployment "${ONID}-edge" successfully rolled out

In a dedicated terminal, forward the edge Service to your laptop:
Terminal window
```
kubectl port-forward service/${ONID}-edge ${FORWARD_PORT}:80 -n reliability-lab
```
Leave this command running. The activity never changes the edge Deployment, so the local tunnel stays up while the backend deployment changes behind it.
In another terminal, point curl at the local URL and verify the steady state:
Terminal window
```
curl "$TARGET_URL"
```
You should see an HTML page whose <h1> contains your ONID. At this point the service exists, the readiness probe is green, and the request path is working.

Define the SLO and Measure the Baseline

Before you break anything, make the steady state measurable. In this section you will turn that request path into a one-minute k6 test with a concrete pass or fail result.

For this activity, use this SLI: a request is good only if it returns HTTP 200 and finishes in under 300 ms. At 10 requests per second for 60 seconds, one run sends about 600 requests, so an SLO of 99.5% allows about 3 bad requests.

Write the k6 script:

cat <<'EOF' > ~/cs312-reliability/probe.js
import http from 'k6/http';
import { check } from 'k6';
import { Rate } from 'k6/metrics';

// Track the share of requests that satisfy the full SLI.
const sliOk = new Rate('sli_ok');

// Read the port-forward URL from the shell environment.
const targetUrl = __ENV.TARGET_URL;

if (!targetUrl) {
  throw new Error('Set TARGET_URL before running k6.');
}

export const options = {
  scenarios: {
    steady: {
      // Keep the test at 10 requests per second for one minute.
      executor: 'constant-arrival-rate',
      rate: 10,
      timeUnit: '1s',
      duration: '60s',
      // Start with 10 workers ready and grow to 20 if needed.
      preAllocatedVUs: 10,
      maxVUs: 20,
    },
  },
  thresholds: {
    // The run fails if fewer than 99.5% of requests are good.
    sli_ok: ['rate >= 0.995'],
    // These latency thresholds are a second check on user experience.
    http_req_duration: ['p(95)<300', 'p(99)<500'],
  },
};

export default function () {
  // Each VU sends one request and records whether it met the SLI.
  const res = http.get(targetUrl);
  const good = res.status === 200 && res.timings.duration < 300;

  check(res, {
    'status is 200': (r) => r.status === 200,
  });

  sliOk.add(good);
}
EOF

This writes probe.js into ~/cs312-reliability even if this terminal is somewhere else. probe.js expects a TARGET_URL environment variable when you run k6. sli_ok only counts a request as good if both the status code and the latency requirement are satisfied.

VUs are k6 virtual users: lightweight workers that repeatedly run the default function in the script. In this test, k6 keeps 10 requests per second flowing by starting with 10 ready VUs and using up to 20 if it needs more workers to hold that rate.

Run the baseline measurement:
Terminal window
```
k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js
```
This sends about 600 requests over one minute. The summary at the end is your steady-state measurement.
Check the end-of-run summary. Focus on two lines:
- sli_ok: this should be at or above 99.5%
- http_req_duration: p95 should stay under 300 ms
If the baseline already fails, stop here and fix the service first. A chaos experiment without a healthy baseline is not measuring anything useful.

Find the Saturation Knee

The baseline tells you the service is healthy at 10 requests per second. It does not tell you how much headroom exists before it stops being healthy. A stress test answers that second question by stepping the load past the steady state at five discrete rates and asking, at each rate, whether the SLI still holds.

Write a second k6 script that runs five back-to-back scenarios at increasing rates, with per-stage thresholds:

cat <<'EOF' > ~/cs312-reliability/stress.js
import http from 'k6/http';
import { Rate } from 'k6/metrics';

const sliOk = new Rate('sli_ok');
const targetUrl = __ENV.TARGET_URL;

if (!targetUrl) {
  throw new Error('Set TARGET_URL before running k6.');
}

export const options = {
  scenarios: {
    rate_50: {
      executor: 'constant-arrival-rate',
      rate: 50, timeUnit: '1s', duration: '20s',
      preAllocatedVUs: 50, maxVUs: 200,
      startTime: '0s',
    },
    rate_200: {
      executor: 'constant-arrival-rate',
      rate: 200, timeUnit: '1s', duration: '20s',
      preAllocatedVUs: 100, maxVUs: 500,
      startTime: '25s',
    },
    rate_500: {
      executor: 'constant-arrival-rate',
      rate: 500, timeUnit: '1s', duration: '20s',
      preAllocatedVUs: 200, maxVUs: 1000,
      startTime: '50s',
    },
    rate_750: {
      executor: 'constant-arrival-rate',
      rate: 750, timeUnit: '1s', duration: '20s',
      preAllocatedVUs: 300, maxVUs: 1500,
      startTime: '75s',
    },
    rate_1000: {
      executor: 'constant-arrival-rate',
      rate: 1000, timeUnit: '1s', duration: '20s',
      preAllocatedVUs: 500, maxVUs: 2000,
      startTime: '100s',
    },
  },
  thresholds: {
    'sli_ok{scenario:rate_50}':              ['rate >= 0.995'],
    'sli_ok{scenario:rate_200}':             ['rate >= 0.995'],
    'sli_ok{scenario:rate_500}':             ['rate >= 0.995'],
    'sli_ok{scenario:rate_750}':             ['rate >= 0.995'],
    'sli_ok{scenario:rate_1000}':            ['rate >= 0.995'],
    'http_req_duration{scenario:rate_50}':   ['p(95)<300'],
    'http_req_duration{scenario:rate_200}':  ['p(95)<300'],
    'http_req_duration{scenario:rate_500}':  ['p(95)<300'],
    'http_req_duration{scenario:rate_750}':  ['p(95)<300'],
    'http_req_duration{scenario:rate_1000}': ['p(95)<300'],
  },
};

export default function () {
  const res = http.get(targetUrl);
  const good = res.status === 200 && res.timings.duration < 300;
  sliOk.add(good);
}
EOF

Each scenario is a constant-arrival-rate test that holds one target rate for 20 seconds. The startTime values stagger them so they run back to back with a 5-second gap between stages. k6 automatically tags every metric sample with the scenario name it came from, which is what the threshold expressions sli_ok{scenario:rate_500} and http_req_duration{scenario:rate_500} filter on. The result is one pass-or-fail line per stage in the end-of-run summary.

Run the stress test:
Terminal window
```
k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/stress.js
```
This takes about 2 minutes 20 seconds. k6 shows a per-scenario progress bar in the terminal but does not print pass/fail counts during the run; the per-stage verdict lands in the summary at the end.

Read the summary’s threshold block. You will see ten lines, two per stage, each marked with a green check or red cross:

✓ sli_ok{scenario:rate_50}
✓ http_req_duration{scenario:rate_50}........: p(95)=...
✗ sli_ok{scenario:rate_200}
✓ http_req_duration{scenario:rate_200}.......: p(95)=...
✗ sli_ok{scenario:rate_500}
✓ http_req_duration{scenario:rate_500}.......: p(95)=...
✗ sli_ok{scenario:rate_750}
✗ http_req_duration{scenario:rate_750}.......: p(95)=...
✗ sli_ok{scenario:rate_1000}
✗ http_req_duration{scenario:rate_1000}......: p(95)=...

The exact stage where the first red mark appears depends on your laptop, but with these limits it should fall somewhere between 200 and 1000 requests per second. Your run may not match this example exactly, and the two thresholds do not have to flip at the same stage.

Notice which threshold trips first at the knee, or whether they flip together. The two checks are measuring different failure shapes. sli_ok allows only 0.5 percent bad requests, while the p95 latency check still passes until more than 5 percent of requests cross 300 ms. That means sli_ok can fail first if you have a small pocket of slow or non-200 responses, while the latency threshold can fail first if slowdown spreads across a larger share of requests.

Experiment 1: One Pod Fails Under Load

Start with the smallest blast radius that still touches the request path. Deleting one serving Pod tests whether replica redundancy and the Deployment controller can absorb a routine failure without burning much budget.

Write your hypothesis before running: will one deleted Pod spend any meaningful error budget while three replicas exist? If it does, how much?

In the k6 terminal, rerun the same one-minute test:

k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js

In the Kubernetes terminal, list the Pods and delete one of the three serving replicas:
Terminal window
```
kubectl get pods -n reliability-lab
kubectl delete pod <one-pod-name> -n reliability-lab
```
The delete request returns immediately. The replacement Pod is created afterward by the Deployment controller.
Wait until the deployment reports three ready replicas again:
Terminal window
```
kubectl wait --for=jsonpath='{.status.readyReplicas}'=3 deployment/${ONID}-web -n reliability-lab --timeout=60s
```
When this finishes, the service has returned to its original replica count.
When k6 completes, compare the summary to the baseline:
- Did sli_ok stay above 99.5%?
- Did p95 latency stay comfortably below 300 ms?
- If there was budget spend, was it tiny or obvious?

Experiment 2: Zero Healthy Replicas

Now make the failure large enough that the service disappears from the request path. Because your port-forward terminates at the edge layer instead of the backend itself, the local tunnel stays up while the backend disappears. Scaling the deployment to zero gives you a clean no-backend window; deleting all three Pods can also cause failures, but the Deployment starts replacements immediately, so the outage duration depends more on timing.

Write your hypothesis before running: once the deployment reaches zero ready replicas, how badly will the one-minute run miss the 99.5% SLO?

Start the same k6 run again:

k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js

In the Kubernetes terminal, run the same four-command outage and recovery sequence you will reuse in the next experiment:

kubectl scale deployment/${ONID}-web --replicas=0 -n reliability-lab
kubectl wait --for=delete pod -l app=${ONID}-web -n reliability-lab --timeout=60s
kubectl scale deployment/${ONID}-web --replicas=3 -n reliability-lab
kubectl wait --for=jsonpath='{.status.readyReplicas}'=3 deployment/${ONID}-web -n reliability-lab --timeout=60s

The second line returns only after the old serving Pods are gone. The third line starts recovery immediately, so this experiment and the next one use the same no-backend window.

When k6 finishes, inspect the summary carefully:
- sli_ok should drop well below the baseline
- the threshold for sli_ok rate >= 0.995 should fail
- the bad fraction is the part of the budget you spent during this one-minute window

Experiment 3: Graceful Degradation at the Edge

The zero-replicas experiment showed the worst case: when the backend disappears, the user sees nothing. That is a binary failure mode. A reduced response is the alternative: when the backend is gone, serve something limited but still useful. In this experiment you reconfigure the edge to fail open with a static fallback, then rerun the same scale-to-zero failure to compare.

Write your hypothesis before running: if the edge serves a fallback page when the backend is unreachable, how much of the user-visible SLO survives a zero-backend window?

Write a new edge ConfigMap that intercepts upstream errors and serves a personalized fallback page:

cat <<EOF > ~/cs312-reliability/edge-degraded.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ${ONID}-edge-config
data:
  nginx.conf: |
    events {}
    http {
      server {
        listen 80;
        location / {
          proxy_connect_timeout 1s;
          proxy_read_timeout 1s;
          proxy_send_timeout 1s;
          proxy_next_upstream off;
          proxy_intercept_errors on;
          proxy_pass http://${ONID}-web;
          proxy_set_header Host \$host;
          proxy_set_header X-Real-IP \$remote_addr;
          error_page 502 503 504 = @fallback;
        }
        location @fallback {
          default_type text/html;
          return 200 '<!doctype html><html><body><h1>${ONID} reliability lab</h1><p>Reduced mode: serving cached content while the backend recovers.</p></body></html>';
        }
      }
    }
EOF
kubectl apply -n reliability-lab -f ~/cs312-reliability/edge-degraded.yaml

proxy_intercept_errors on tells NGINX to catch any upstream 5xx response instead of passing it straight through. error_page 502 503 504 = @fallback redirects those failures to a named location. location @fallback returns a hand-written 200 with the personalized fallback page. The @ prefix makes the location internal: clients cannot request it directly.

Restart the edge Deployment so the new ConfigMap is picked up:
Terminal window
```
kubectl rollout restart deployment/${ONID}-edge -n reliability-lab
kubectl rollout status deployment/${ONID}-edge -n reliability-lab
```
kubectl rollout restart recreates the edge Pods. This is necessary because ConfigMaps mounted with subPath are not refreshed automatically in running Pods.
With the backend still healthy, verify nothing visible changed in steady state:
Terminal window
```
curl "$TARGET_URL"
```
You should still see the original page with your ONID. The fallback only fires when the backend cannot answer, so a healthy request still goes through the proxy_pass path.

Start the same one-minute k6 run as before:

k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js

In a third terminal, start a short sampling loop before you cut the backend:
Terminal window
```
for i in $(seq 1 20); do curl -s "$TARGET_URL" | grep -E 'h1|Reduced'; sleep 1; done
```
This loop runs long enough to overlap the outage window and show the page change from steady state to fallback and back.

In the Kubernetes terminal, repeat the exact same outage and recovery sequence from Experiment 2:

kubectl scale deployment/${ONID}-web --replicas=0 -n reliability-lab
kubectl wait --for=delete pod -l app=${ONID}-web -n reliability-lab --timeout=60s
kubectl scale deployment/${ONID}-web --replicas=3 -n reliability-lab
kubectl wait --for=jsonpath='{.status.readyReplicas}'=3 deployment/${ONID}-web -n reliability-lab --timeout=60s

While the backend has zero ready replicas, the sampling loop should show the Reduced mode line instead of the steady-state page. When the backend recovers, the original page returns.

When k6 finishes, compare the summary to Experiment 2:
- Did sli_ok stay higher than the zero-replicas run from Experiment 2?
- Did the status-code share of requests stay close to 100 percent, even though no backend Pods existed for part of the window?
- Did latency still suffer briefly while the proxy decided that the upstream was unreachable?

Experiment 4: Failed Rollout, Safe Rollback

A failed change is less dangerous than a failed service when the rollout mechanics keep old replicas serving traffic while you undo the mistake. This section tests whether a broken rollout can stay operationally unhealthy without becoming a user-visible outage.

Write your hypothesis before running: can a rollout be broken and still stay within the SLO because the old Pods continue serving?

Start the same k6 run again:

k6 run -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js

In the Kubernetes terminal, intentionally set a bad image tag on the deployment:
Terminal window
```
kubectl set image deployment/${ONID}-web web=nginx:this-tag-does-not-exist -n reliability-lab
```
Kubernetes now tries to create new Pods from an image that cannot be pulled.
Inspect the Pods while the bad rollout is in progress:
Terminal window
```
kubectl get pods -n reliability-lab
```
You should see at least one new Pod enter ErrImagePull or ImagePullBackOff while older Pods remain Running. That is the safety property you are testing.

Undo the rollout and wait for the deployment to settle:

kubectl rollout undo deployment/${ONID}-web -n reliability-lab
kubectl rollout status deployment/${ONID}-web -n reliability-lab

When k6 finishes, compare the summary to the earlier runs:
- Did sli_ok stay above 99.5% despite the failed rollout?
- Did the old Pods preserve availability while the new image failed?
- Was the bad change reversible before it became a user-visible outage?

Record the Final Steady State

The last step of a reliability exercise is confirming that the system is healthy again and keeping a concrete record of what happened. The result should show both the user-visible SLI and the cluster state with your ONID in it.

Run one final clean baseline and save the summary to a file:
Terminal window
```
k6 run --quiet -e TARGET_URL="$TARGET_URL" ~/cs312-reliability/probe.js | tee final-check.txt
```
The --quiet flag suppresses the live progress meter, so final-check.txt gets the final summary instead of a stream of progress updates. This final run should pass all thresholds again. If it does not, the system is not back to steady state.
Append one user-visible check and the Kubernetes state:
Terminal window
```
curl -s "$TARGET_URL" | grep h1 | tee -a final-check.txt
kubectl get deployment,replicaset,pods,service -n reliability-lab | tee -a final-check.txt
```
The first command writes the page header containing your ONID. The second command writes the edge and backend Deployments, ReplicaSets, Pods, and Services that now make up the recovered system.
Print the record you just created:
Terminal window
```
cat final-check.txt
```
You should see a passing k6 summary, your personalized page header, and healthy Kubernetes objects named with your ONID. This is the concrete end state of the activity.

Going Further

You measured a load test, a stress ramp, and four controlled experiments. The next steps are to vary the failure shape, the recovery mechanism, and the observability that sits underneath all of it.

The closest hands-on equivalent of what you just built is a spike test driven by autoscaling. Attach a HorizontalPodAutoscaler to the backend (kubectl autoscale deployment/${ONID}-web --min=2 --max=8 --cpu-percent=50 -n reliability-lab is the one-line version), then write a third k6 scenario that jumps from 10 to 100 requests per second instantly. Watch kubectl get hpa,pods -w in another terminal while the run is in flight, and look for the gap between the moment the spike arrives and the moment new Pods are ready. That gap is the budget the autoscaler cannot save you from.

For a subtler controller-healing comparison, delete all three Pods at once with kubectl delete pods -l app=${ONID}-web -n reliability-lab without changing the replica count, and compare that run to the scale-to-zero run from Experiment 2. If you have already completed Prometheus and Grafana, rerun the stress ramp while watching the RED panels so you can compare k6’s end-of-run summary to the dashboard’s p95 latency and error-rate curves. And if you want a more realistic failure than Pod deletion, install LitmusChaos or Chaos Mesh in minikube and repeat the same k6 harness during a network-latency experiment.