Skip to content

Observability Workshop: Metrics and Dashboards (Prometheus + Grafana)

Gerald’s investor visited last week and asked, “How do you know your systems are healthy?” Gerald said, “We check.” The investor asked, “How?” There was a long silence. Gerald turned to you. You also said nothing. The investor left. Gerald now needs dashboards “for the next meeting.”

Observability is the practice of instrumenting your systems so you can answer questions like these from the outside, using metrics, logs, and traces. Your Kubernetes cluster runs containers, restarts them when they crash, and rolls out updates; but you have no visibility into what is happening inside. In this lab, you will deploy Prometheus (a metrics collection engine) and Grafana (a visualization platform) on your k3s cluster, build a custom dashboard, define alerts with runbooks, and trigger a real incident to see the entire detection-and-response loop in action.

You need:

  • The k3s cluster from Labs 7-8 with all workloads still running: nginx (reverse proxy), WordPress, and MariaDB
  • SSH access to your EC2 instance

If you ended your AWS Academy session, restart it. The EC2 instance and cluster will still be there. If you deleted the resources from Labs 7-8, re-apply all manifests from those labs before starting.

Monitoring vs. Observability: Monitoring is checking predefined metrics against thresholds (“is CPU above 90%?”). Observability goes further; it means your system emits enough data that you can diagnose unexpected problems without deploying new instrumentation.

Prometheus: An open-source metrics collection system. It uses a pull model: Prometheus periodically scrapes HTTP endpoints (called exporters) to collect metrics. These metrics are stored as time-series data (values indexed by timestamp).

Grafana: An open-source platform for building dashboards from data sources like Prometheus. It turns raw metrics into visual panels: graphs, gauges, and tables.

PromQL (Prometheus Query Language): The query language for selecting and aggregating Prometheus metrics. For example, rate(http_requests_total[5m]) computes the per-second rate of HTTP requests over the last 5 minutes.

Helm: A package manager for Kubernetes. Helm charts bundle Kubernetes manifests, default configuration, and dependencies into installable packages, similar to how zypper packages software for SUSE Linux or apt packages software for Debian-based systems.

Watch for the answers to these questions as you follow the tutorial.

  1. What is the current memory usage percentage on the Node Exporter dashboard in Grafana? Write down the value and the panel name. (4 points)
  2. Write down the PromQL (Prometheus Query Language) query you used for the pod restart count panel. What is the current restart count? (5 points)
  3. In your PrometheusRule for the crash loop alert, what PromQL expression triggers the alert, and what is the for duration? (5 points)
  4. After killing nginx repeatedly to trigger the crash loop alert, how many minutes did it take for the alert to fire? What status did the alert show (Inactive, Pending, or Firing)? (5 points)
  5. What does kubectl logs <pod> --previous show, and why is the --previous flag necessary for debugging container crashes? (3 points)
  6. Get your TA’s initials showing your Grafana dashboard with live metrics and at least one alert in Firing state. (3 points)

Before installing the monitoring stack, resize your EC2 instance. kube-prometheus-stack is heavy enough that smaller instance types often time out during CRD installation or make the API server unresponsive.

  1. Open the EC2 instance in the AWS Console

    In AWS Console, go to EC2 > Instances and select your lab instance.

  2. Stop the instance

    Choose Instance state > Stop instance and wait until the instance state is stopped.

  3. Change the instance type

    With the instance selected, choose Actions > Instance settings > Change instance type. Select t3.large and save.

  4. Start the instance again

    Choose Instance state > Start instance and wait for both status checks to pass.

  5. Reconnect over SSH and verify the cluster

    If your instance uses an auto-assigned public IP, it may have changed after stop/start. Reconnect using the current public IP, then verify Kubernetes is responding:

    Terminal window
    kubectl get nodes -o wide
    kubectl get pods -A

    Continue only after the node is Ready and system pods are back up.

  1. Install Helm on the EC2 instance

    Helm runs as a client on your machine and communicates with the Kubernetes API (Application Programming Interface) server:

    Terminal window
    curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
  2. Verify the installation

    Terminal window
    helm version

The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Node Exporter, and a set of pre-configured dashboards and alerting rules. It is the standard starting point for Kubernetes monitoring.

  1. Add the Helm repository

    Terminal window
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
  2. Install the monitoring stack

    Terminal window
    helm install monitoring prometheus-community/kube-prometheus-stack \
    --namespace monitoring --create-namespace \
    --set grafana.service.type=NodePort \
    --set grafana.service.nodePort=31000 \
    --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
    --set prometheus.prometheusSpec.ruleSelectorNilUsesHelmValues=false \
    --set grafana.sidecar.image.tag=1.28.0

    This command:

    • Installs the chart with the release name monitoring in a new monitoring namespace
    • Exposes Grafana on NodePort 31000 so you can access it from your browser
    • Configures Prometheus to pick up ServiceMonitors and PrometheusRules from all namespaces (not just the monitoring namespace)
    • Pins the Grafana sidecar image to 1.28.0 to avoid a known broken image tag in the latest chart release
  3. Wait for all Pods to be ready

    Terminal window
    kubectl get pods -n monitoring -w

    Wait until all Pods show Running and READY as 1/1 or 2/2. Press Ctrl+C when done.

  4. Access Grafana

    On the EC2 instance, run:

    Terminal window
    kubectl -n monitoring port-forward svc/monitoring-grafana 3000:80 --address 127.0.0.1

    Keep that terminal open. On your laptop, create an SSH tunnel to that forwarded port:

    Terminal window
    ssh -i ~/Downloads/cs312-key.pem -N -L 3000:127.0.0.1:3000 ec2-user@<your-public-ip>

    Then open http://localhost:3000 in your browser.

    Optional direct access (requires Security Group inbound rule for TCP 31000):

    http://<your-ec2-public-ip>:31000

    If access still fails, verify in order:

    • kubectl -n monitoring get pods shows Grafana as Running/Ready
    • kubectl -n monitoring get svc monitoring-grafana shows the expected port
    • the two port-forward commands above are still running with no errors

    The username is admin. Retrieve the generated password with:

    Terminal window
    kubectl get secret -n monitoring monitoring-grafana -o jsonpath="{.data.admin-password}" | base64 -d
  1. Browse the pre-built dashboards

    In Grafana, click the hamburger menu (three lines, top left), then Dashboards. You will see several pre-installed dashboards. Open Node Exporter / Nodes (the name may vary slightly).

  2. Identify the key panels

    This dashboard shows hardware-level metrics from the Node Exporter, which runs on each Kubernetes node and exposes operating system metrics:

    • CPU Usage: How much processing capacity is being used
    • Memory Usage: Current RAM consumption as a percentage
    • Disk I/O: Read and write activity on the node’s storage
    • Network Traffic: Bytes sent and received

    Find the memory usage panel and record the current percentage value and the panel’s exact name for your lab questions.

  3. Explore other dashboards

    Check out the Kubernetes / Compute Resources / Pod dashboard. This shows per-Pod CPU and memory consumption, useful for spotting a container that is using more resources than expected.

  1. Create a new dashboard

    In Grafana, click Dashboards > New > New Dashboard. Click Add visualization. Select the Prometheus data source.

  2. Panel 1: Pod restart count

    In the query editor, enter this PromQL query:

    kube_pod_container_status_restarts_total{namespace="default"}

    This metric tracks the cumulative number of restarts for every container in the default namespace (nginx, wordpress, and mariadb). Set the panel title to “Pod Restarts (default namespace).” Choose a visualization type of Time series so you can see restarts over time per container.

    Click Apply to save the panel.

  3. Panel 2: Node memory usage percentage

    Add another panel with this query:

    100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)

    This calculates the percentage of memory in use by subtracting available memory from total memory. Title it “Node Memory Usage %.” Use a Gauge visualization with a max value of 100.

  4. Panel 3: Pod ready status

    Add a third panel:

    kube_pod_status_ready{namespace="default", condition="true"}

    This returns 1 for Pods that are ready and 0 for those that are not, across all workloads in the default namespace (nginx, wordpress, mariadb). Title it “Pods Ready (default namespace).” Use a Stat visualization. When all pods are healthy you will see one series per Pod, each showing 1.

  5. Save the dashboard

    Click the save icon (floppy disk) in the top-right corner. Name it “CS 312 Lab Dashboard.”

Alerts notify you when something goes wrong; but poorly designed alerts create alert fatigue, where operators start ignoring notifications because too many are false positives. Every alert should be actionable: when it fires, the operator should know exactly what to check and what to do first. This is where runbooks come in, short documents linked to each alert that describe the first-response steps.

  1. Create a PrometheusRule manifest

    PrometheusRule is a Kubernetes Custom Resource Definition (CRD) that Prometheus watches for alert definitions.

    Terminal window
    vim alerting-rules.yaml
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
    name: cs312-alerts
    namespace: monitoring
    labels:
    release: monitoring
    spec:
    groups:
    - name: cs312.rules
    rules:
    - alert: PodCrashLooping
    expr: increase(kube_pod_container_status_restarts_total{namespace="default"}[5m]) > 3
    for: 1m
    labels:
    severity: critical
    annotations:
    summary: "Pod {{ $labels.pod }} is crash looping"
    runbook: |
    1. Run: kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
    2. Check Events section for error messages
    3. Check logs: kubectl logs {{ $labels.pod }} --previous
    4. If bad image, rollback: kubectl rollout undo deployment/<name>
    - alert: HighNodeCPU
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
    for: 2m
    labels:
    severity: warning
    annotations:
    summary: "Node CPU usage above 80% for 2 minutes"
    runbook: |
    1. Run: kubectl top pods --all-namespaces --sort-by=cpu
    2. Identify the Pod consuming the most CPU
    3. Check if it is expected (build, migration) or unexpected (infinite loop)
    4. If unexpected, check resource limits and consider scaling
    - alert: PodNotReady
    expr: kube_pod_status_ready{namespace="default", condition="true"} == 0
    for: 1m
    labels:
    severity: warning
    annotations:
    summary: "Pod {{ $labels.pod }} is not ready"
    runbook: |
    1. Run: kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}
    2. Check readiness probe configuration and recent events
    3. Check logs: kubectl logs {{ $labels.pod }}
    4. If the Pod is failing to start, check image and config

    Let’s break down the structure:

    • expr: The PromQL expression that triggers the alert. increase(...[5m]) > 3 means “more than 3 restarts in the last 5 minutes.”
    • for: How long the condition must be true before the alert fires. This prevents one-time blips from triggering alerts.
    • labels.severity: Used for routing (e.g., critical alerts page on-call, warnings go to a Slack channel).
    • annotations.runbook: The first-response steps an operator should take.
  2. Apply the alert rules

    Terminal window
    kubectl apply -f alerting-rules.yaml
  3. Verify in Prometheus

    You can check that Prometheus loaded the rules by port-forwarding to the Prometheus UI. Open a second SSH session to the EC2 instance and run:

    Terminal window
    kubectl port-forward svc/monitoring-kube-prometheus-prometheus -n monitoring 9091:9090 --address 127.0.0.1

    Keep that session open (Ctrl+C stops the tunnel). On your laptop, create an SSH tunnel:

    Terminal window
    ssh -i ~/Downloads/cs312-key.pem -N -L 9091:127.0.0.1:9091 ec2-user@<your-public-ip>

    Then open http://localhost:9091/alerts in your browser. You should see your three alerts listed in “Inactive” state.

  1. Trigger the crash loop alert

    You will force nginx containers to restart repeatedly by killing the main process inside them. The main nginx process runs as PID 1:

    Terminal window
    for i in $(seq 1 5); do
    kubectl exec $(kubectl get pods -l app=nginx -o jsonpath='{.items[0].metadata.name}') -- nginx -s stop
    sleep 10
    done

    This stops the nginx process 5 times. Each stop causes the container to exit, and Kubernetes restarts it within the same Pod. After several restarts in 5 minutes, the PodCrashLooping alert should fire.

  2. Watch the alert status

    Check the alert in Prometheus (refresh http://localhost:9091/alerts) or in Grafana (Alerting > Alert rules). The alert will transition from Inactive to Pending (condition is true but for duration has not elapsed) to Firing.

    Record how many minutes it took for the alert to fire.

  3. Check the dashboard

    Go to your “CS 312 Lab Dashboard” in Grafana. The pod restart count panel should show the increased restart count.

  4. Investigate with logs

    Find the Pod that was restarted:

    Terminal window
    kubectl get pods

    View the logs from the previous (terminated) container instance:

    Terminal window
    kubectl logs <restarted-pod-name> --previous

    The --previous flag is essential for debugging crashes; without it, you see logs from the current (running) instance, which does not contain the error that caused the restart. Record the last 5 lines.

When you are done with all three labs (Labs 7-9), uninstall the monitoring stack and remove all workloads:

Terminal window
helm uninstall monitoring -n monitoring
kubectl delete namespace monitoring
kubectl delete \
-f wordpress-ingress.yaml \
-f nginx-deployment.yaml -f nginx-service.yaml -f nginx-configmap.yaml \
-f wordpress-deployment.yaml -f wordpress-service.yaml \
-f mariadb-deployment.yaml -f mariadb-service.yaml -f db-secret.yaml \
-f mariadb-pvc.yaml

You can also terminate the EC2 instance if you no longer need it.

If you just need to pause, end your AWS Academy Learner Lab session instead. Everything persists on the EC2 instance.


You have now deployed a complete monitoring stack, built dashboards that answer “is my WordPress stack healthy?”, defined alerts with runbooks, and experienced the full incident loop: a failure occurs, metrics detect it, an alert fires, and you investigate with logs. These are the same observability practices used in production systems at every scale.