Observability Workshop: Metrics and Dashboards (Prometheus + Grafana)

Gerald’s investor visited last week and asked, “How do you know your systems are healthy?” Gerald said, “We check.” The investor asked, “How?” There was a long silence. Gerald turned to you. You also said nothing. The investor left. Gerald now needs dashboards “for the next meeting.”

Observability is the practice of instrumenting your systems so you can answer questions like these from the outside, using metrics, logs, and traces. Your Kubernetes cluster runs containers, restarts them when they crash, and rolls out updates; but you have no visibility into what is happening inside. In this lab, you will deploy Prometheus (a metrics collection engine) and Grafana (a visualization platform) on your k3s cluster, build a custom dashboard, define alerts with runbooks, and trigger a real incident to see the entire detection-and-response loop in action.

Before You Start

You need:

The k3s cluster from Labs 7-8 with all workloads still running: nginx (reverse proxy), WordPress, and MariaDB
SSH access to your EC2 instance

If you ended your AWS Academy session, restart it. The EC2 instance and cluster will still be there. If you deleted the resources from Labs 7-8, re-apply all manifests from those labs before starting.

Key Concepts

Monitoring vs. Observability: Monitoring is checking predefined metrics against thresholds (“is CPU above 90%?”). Observability goes further; it means your system emits enough data that you can diagnose unexpected problems without deploying new instrumentation.

Prometheus: An open-source metrics collection system. It uses a pull model: Prometheus periodically scrapes HTTP endpoints (called exporters) to collect metrics. These metrics are stored as time-series data (values indexed by timestamp).

Grafana: An open-source platform for building dashboards from data sources like Prometheus. It turns raw metrics into visual panels: graphs, gauges, and tables.

PromQL (Prometheus Query Language): The query language for selecting and aggregating Prometheus metrics. For example, rate(http_requests_total[5m]) computes the per-second rate of HTTP requests over the last 5 minutes.

Helm: A package manager for Kubernetes. Helm charts bundle Kubernetes manifests, default configuration, and dependencies into installable packages, similar to how zypper packages software for SUSE Linux or apt packages software for Debian-based systems.

Questions

Watch for the answers to these questions as you follow the tutorial.

What is the current memory usage percentage on the Node Exporter dashboard in Grafana? Write down the value and the panel name. (4 points)
Write down the PromQL (Prometheus Query Language) query you used for the pod restart count panel. What is the current restart count? (5 points)
In your PrometheusRule for the crash loop alert, what PromQL expression triggers the alert, and what is the for duration? (5 points)
After killing nginx repeatedly to trigger the crash loop alert, how many minutes did it take for the alert to fire? What status did the alert show (Inactive, Pending, or Firing)? (5 points)
What does kubectl logs <pod> --previous show, and why is the --previous flag necessary for debugging container crashes? (3 points)
Get your TA’s initials showing your Grafana dashboard with live metrics and at least one alert in Firing state. (3 points)

Tutorial

Resize the EC2 Instance to t3.large

Before installing the monitoring stack, resize your EC2 instance. kube-prometheus-stack is heavy enough that smaller instance types often time out during CRD installation or make the API server unresponsive.

Open the EC2 instance in the AWS Console

In AWS Console, go to EC2 > Instances and select your lab instance.
Stop the instance

Choose Instance state > Stop instance and wait until the instance state is stopped.
Change the instance type

With the instance selected, choose Actions > Instance settings > Change instance type. Select t3.large and save.
Start the instance again

Choose Instance state > Start instance and wait for both status checks to pass.
Reconnect over SSH and verify the cluster

If your instance uses an auto-assigned public IP, it may have changed after stop/start. Reconnect using the current public IP, then verify Kubernetes is responding:
Reconnect over SSH and Verify the Cluster
```
kubectl get nodes -o wide
kubectl get pods -A
```
Continue only after the node is Ready and system pods are back up.

Installing Helm

Install Helm on the EC2 instance

Helm runs as a client on your machine and communicates with the Kubernetes API (Application Programming Interface) server:
Install Helm on the EC2 Instance
```
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
```
Verify the installation
Verify the Installation
```
helm version
```

Deploying the Monitoring Stack

The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Node Exporter, and a set of pre-configured dashboards and alerting rules. It is the standard starting point for Kubernetes monitoring.

Add the Helm repository

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Install the monitoring stack
Install the Monitoring Stack
```
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.service.type=NodePort \
  --set grafana.service.nodePort=31000 \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.ruleSelectorNilUsesHelmValues=false \
  --set grafana.sidecar.image.tag=1.28.0
```
This command:
- Installs the chart with the release name monitoring in a new monitoring namespace
- Exposes Grafana on NodePort 31000 so you can access it from your browser
- Configures Prometheus to pick up ServiceMonitors and PrometheusRules from all namespaces (not just the monitoring namespace)
- Pins the Grafana sidecar image to 1.28.0 to avoid a known broken image tag in the latest chart release
The installation takes 2-3 minutes. Helm downloads container images for Prometheus, Grafana, kube-state-metrics, and Node Exporter.
Wait for all Pods to be ready
Wait for All Pods to Be Ready
```
kubectl get pods -n monitoring -w
```
Wait until all Pods show Running and READY as 1/1 or 2/2. Press Ctrl+C when done.
Access Grafana

On the EC2 instance, run:
Access Grafana
```
kubectl -n monitoring port-forward svc/monitoring-grafana 3000:80 --address 127.0.0.1
```
Keep that terminal open. On your laptop, create an SSH tunnel to that forwarded port:
Connect with SSH
```
ssh -i ~/Downloads/cs312-key.pem -N -L 3000:127.0.0.1:3000 ec2-user@<your-public-ip>
```
Then open http://localhost:3000 in your browser.

Optional direct access (requires Security Group inbound rule for TCP 31000):
Deploying the Monitoring Stack Output
```
http://<your-ec2-public-ip>:31000
```
If access still fails, verify in order:
- kubectl -n monitoring get pods shows Grafana as Running/Ready
- kubectl -n monitoring get svc monitoring-grafana shows the expected port
- the two port-forward commands above are still running with no errors
The username is admin. Retrieve the generated password with:
Retrieve Grafana Admin Password
```
kubectl get secret -n monitoring monitoring-grafana -o jsonpath="{.data.admin-password}" | base64 -d
```

Exploring Default Dashboards

Browse the pre-built dashboards

In Grafana, click the hamburger menu (three lines, top left), then Dashboards. You will see several pre-installed dashboards. Open Node Exporter / Nodes (the name may vary slightly).
Identify the key panels

This dashboard shows hardware-level metrics from the Node Exporter, which runs on each Kubernetes node and exposes operating system metrics:
- CPU Usage: How much processing capacity is being used
- Memory Usage: Current RAM consumption as a percentage
- Disk I/O: Read and write activity on the node’s storage
- Network Traffic: Bytes sent and received
Find the memory usage panel and record the current percentage value and the panel’s exact name for your lab questions.
Explore other dashboards

Check out the Kubernetes / Compute Resources / Pod dashboard. This shows per-Pod CPU and memory consumption, useful for spotting a container that is using more resources than expected.

Building a Custom Dashboard

Create a new dashboard

In Grafana, click Dashboards > New > New Dashboard. Click Add visualization. Select the Prometheus data source.
Panel 1: Pod restart count

In the query editor, enter this PromQL query:
Pod Restart Count Query
```
kube_pod_container_status_restarts_total{namespace="default"}
```
This metric tracks the cumulative number of restarts for every container in the default namespace (nginx, wordpress, and mariadb). Set the panel title to “Pod Restarts (default namespace).” Choose a visualization type of Time series so you can see restarts over time per container.

Click Apply to save the panel.
Panel 2: Node memory usage percentage

Add another panel with this query:
Node Memory Usage Query
```
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)
```
This calculates the percentage of memory in use by subtracting available memory from total memory. Title it “Node Memory Usage %.” Use a Gauge visualization with a max value of 100.
Panel 3: Pod ready status

Add a third panel:
Pod Ready Status Query
```
kube_pod_status_ready{namespace="default", condition="true"}
```
This returns 1 for Pods that are ready and 0 for those that are not, across all workloads in the default namespace (nginx, wordpress, mariadb). Title it “Pods Ready (default namespace).” Use a Stat visualization. When all pods are healthy you will see one series per Pod, each showing 1.
Save the dashboard

Click the save icon (floppy disk) in the top-right corner. Name it “CS 312 Lab Dashboard.”

Defining Alerts

Alerts notify you when something goes wrong; but poorly designed alerts create alert fatigue, where operators start ignoring notifications because too many are false positives. Every alert should be actionable: when it fires, the operator should know exactly what to check and what to do first. This is where runbooks come in, short documents linked to each alert that describe the first-response steps.

Create a PrometheusRule manifest

PrometheusRule is a Kubernetes Custom Resource Definition (CRD) that Prometheus watches for alert definitions.

vim alerting-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cs312-alerts
  namespace: monitoring
  labels:
    release: monitoring
spec:
  groups:
    - name: cs312.rules
      rules:
        - alert: PodCrashLooping
          expr: increase(kube_pod_container_status_restarts_total{namespace="default"}[5m]) > 3
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.pod }} is crash looping"
            runbook: |
              1. Run: kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
              2. Check Events section for error messages
              3. Check logs: kubectl logs {{ $labels.pod }} --previous
              4. If bad image, rollback: kubectl rollout undo deployment/<name>

        - alert: HighNodeCPU
          expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "Node CPU usage above 80% for 2 minutes"
            runbook: |
              1. Run: kubectl top pods --all-namespaces --sort-by=cpu
              2. Identify the Pod consuming the most CPU
              3. Check if it is expected (build, migration) or unexpected (infinite loop)
              4. If unexpected, check resource limits and consider scaling

        - alert: PodNotReady
          expr: kube_pod_status_ready{namespace="default", condition="true"} == 0
          for: 1m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} is not ready"
            runbook: |
              1. Run: kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
              2. Check readiness probe configuration and recent events
              3. Check logs: kubectl logs {{ $labels.pod }}
              4. If the Pod is failing to start, check image and config

Let’s break down the structure:

expr: The PromQL expression that triggers the alert. increase(...[5m]) > 3 means “more than 3 restarts in the last 5 minutes.”
for: How long the condition must be true before the alert fires. This prevents one-time blips from triggering alerts.
labels.severity: Used for routing (e.g., critical alerts page on-call, warnings go to a Slack channel).
annotations.runbook: The first-response steps an operator should take.

Apply the alert rules
Apply the Alert Rules
```
kubectl apply -f alerting-rules.yaml
```
Verify in Prometheus

You can check that Prometheus loaded the rules by port-forwarding to the Prometheus UI. Open a second SSH session to the EC2 instance and run:
Verify in Prometheus
```
kubectl port-forward svc/monitoring-kube-prometheus-prometheus -n monitoring 9091:9090 --address 127.0.0.1
```
Keep that session open (Ctrl+C stops the tunnel). On your laptop, create an SSH tunnel:
Connect with SSH
```
ssh -i ~/Downloads/cs312-key.pem -N -L 9091:127.0.0.1:9091 ec2-user@<your-public-ip>
```
Then open http://localhost:9091/alerts in your browser. You should see your three alerts listed in “Inactive” state.

Triggering an Alert

Trigger the crash loop alert

You will force nginx containers to restart repeatedly by killing the main process inside them. The main nginx process runs as PID 1:
Trigger the Crash Loop Alert
```
for i in $(seq 1 5); do
  kubectl exec $(kubectl get pods -l app=nginx -o jsonpath='{.items[0].metadata.name}') -- nginx -s stop
  sleep 10
done
```
This stops the nginx process 5 times. Each stop causes the container to exit, and Kubernetes restarts it within the same Pod. After several restarts in 5 minutes, the PodCrashLooping alert should fire.
Watch the alert status

Check the alert in Prometheus (refresh http://localhost:9091/alerts) or in Grafana (Alerting > Alert rules). The alert will transition from Inactive to Pending (condition is true but for duration has not elapsed) to Firing.

Record how many minutes it took for the alert to fire.
Check the dashboard

Go to your “CS 312 Lab Dashboard” in Grafana. The pod restart count panel should show the increased restart count.
Investigate with logs

Find the Pod that was restarted:
Investigate with Logs
```
kubectl get pods
```
View the logs from the previous (terminated) container instance:
View Pod Logs
```
kubectl logs <restarted-pod-name> --previous
```
The --previous flag is essential for debugging crashes; without it, you see logs from the current (running) instance, which does not contain the error that caused the restart. Record the last 5 lines.

Clean Up

When you are done with all three labs (Labs 7-9), uninstall the monitoring stack and remove all workloads:

helm uninstall monitoring -n monitoring
kubectl delete namespace monitoring
kubectl delete \
  -f wordpress-ingress.yaml \
  -f nginx-deployment.yaml -f nginx-service.yaml -f nginx-configmap.yaml \
  -f wordpress-deployment.yaml -f wordpress-service.yaml \
  -f mariadb-deployment.yaml -f mariadb-service.yaml -f db-secret.yaml \
  -f mariadb-pvc.yaml

You can also terminate the EC2 instance if you no longer need it.

If you just need to pause, end your AWS Academy Learner Lab session instead. Everything persists on the EC2 instance.

You have now deployed a complete monitoring stack, built dashboards that answer “is my WordPress stack healthy?”, defined alerts with runbooks, and experienced the full incident loop: a failure occurs, metrics detect it, an alert fires, and you investigate with logs. These are the same observability practices used in production systems at every scale.