Observability Workshop: Prometheus and Grafana
Gerald’s investor visited last week and asked, “How do you know your systems are healthy?” Gerald said, “We check.” The investor asked, “How?” There was a long silence. Gerald turned to you. You also said nothing. The investor left. Gerald now needs dashboards “for the next meeting.”
Observability is the practice of instrumenting your systems so you can answer questions like these from the outside, using metrics, logs, and traces. Your Kubernetes cluster runs containers, restarts them when they crash, and rolls out updates; but you have no visibility into what is happening inside. In this lab, you will deploy Prometheus (a metrics collection engine) and Grafana (a visualization platform) on your k3s cluster, build a custom dashboard, define alerts with runbooks, and trigger a real incident to see the entire detection-and-response loop in action.
Before You Start
Section titled “Before You Start”You need:
- The k3s cluster from Labs 7-8 with the nginx Deployment and Service still running
- SSH access to your EC2 instance
If you no longer have the Lab 7-8 setup, re-apply the manifests from those labs before starting.
Key Concepts
Section titled “Key Concepts”Monitoring vs. Observability: Monitoring is checking predefined metrics against thresholds (“is CPU above 90%?”). Observability goes further; it means your system emits enough data that you can diagnose unexpected problems without deploying new instrumentation.
Prometheus: An open-source metrics collection system. It uses a pull model: Prometheus periodically scrapes HTTP endpoints (called exporters) to collect metrics. These metrics are stored as time-series data (values indexed by timestamp).
Grafana: An open-source platform for building dashboards from data sources like Prometheus. It turns raw metrics into visual panels: graphs, gauges, and tables.
PromQL (Prometheus Query Language): The query language for selecting and aggregating Prometheus metrics. For example, rate(http_requests_total[5m]) computes the per-second rate of HTTP requests over the last 5 minutes.
Helm: A package manager for Kubernetes. Helm charts bundle Kubernetes manifests, default configuration, and dependencies into installable packages, similar to how apt packages software for Ubuntu.
Questions
Section titled “Questions”Watch for the answers to these questions as you follow the tutorial.
- What is the current memory usage percentage on the Node Exporter dashboard in Grafana? Write down the value and the panel name. (4 points)
- Write down the PromQL (Prometheus Query Language) query you used for the pod restart count panel. What is the current restart count? (5 points)
- In your PrometheusRule for the crash loop alert, what PromQL expression triggers the alert, and what is the
forduration? (5 points) - After killing nginx repeatedly to trigger the crash loop alert, how many minutes did it take for the alert to fire? What status did the alert show (Inactive, Pending, or Firing)? (5 points)
- What does
kubectl logs <pod> --previousshow, and why is the--previousflag necessary for debugging container crashes? (3 points) - Get your TA’s initials showing your Grafana dashboard with live metrics and at least one alert in Firing state. (3 points)
Tutorial
Section titled “Tutorial”Installing Helm
Section titled “Installing Helm”-
Install Helm on the EC2 instance
Helm runs as a client on your machine and communicates with the Kubernetes API (Application Programming Interface) server:
Terminal window curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash -
Verify the installation
Terminal window helm version
Deploying the Monitoring Stack
Section titled “Deploying the Monitoring Stack”The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Node Exporter, and a set of pre-configured dashboards and alerting rules. It is the standard starting point for Kubernetes monitoring.
-
Add the Helm repository
Terminal window helm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo update -
Install the monitoring stack
Terminal window helm install monitoring prometheus-community/kube-prometheus-stack \--namespace monitoring --create-namespace \--set grafana.service.type=NodePort \--set grafana.service.nodePort=31000 \--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \--set prometheus.prometheusSpec.ruleSelectorNilUsesHelmValues=falseThis command:
- Installs the chart with the release name
monitoringin a newmonitoringnamespace - Exposes Grafana on NodePort 31000 so you can access it from your browser
- Configures Prometheus to pick up ServiceMonitors and PrometheusRules from all namespaces (not just the monitoring namespace)
- Installs the chart with the release name
-
Wait for all Pods to be ready
Terminal window kubectl get pods -n monitoring -wWait until all Pods show
RunningandREADYas1/1or2/2. Press Ctrl+C when done. -
Access Grafana
If your Security Group allows port 31000, open Grafana directly in your browser:
http://<your-ec2-public-ip>:31000If port 31000 is not open in your Security Group, use SSH port forwarding from your laptop:
Terminal window ssh -i ~/Downloads/cs312-key.pem -L 3000:localhost:31000 ubuntu@<your-public-ip>Then open
http://localhost:3000in your browser.The default login credentials are:
- Username:
admin - Password:
prom-operator
- Username:
Exploring Default Dashboards
Section titled “Exploring Default Dashboards”-
Browse the pre-built dashboards
In Grafana, click the hamburger menu (three lines, top left), then Dashboards. You will see several pre-installed dashboards. Open Node Exporter / Nodes (the name may vary slightly).
-
Identify the key panels
This dashboard shows hardware-level metrics from the Node Exporter, which runs on each Kubernetes node and exposes operating system metrics:
- CPU Usage: How much processing capacity is being used
- Memory Usage: Current RAM consumption as a percentage
- Disk I/O: Read and write activity on the node’s storage
- Network Traffic: Bytes sent and received
Find the memory usage panel and record the current percentage value and the panel’s exact name for your lab questions.
-
Explore other dashboards
Check out the Kubernetes / Compute Resources / Pod dashboard. This shows per-Pod CPU and memory consumption, useful for spotting a container that is using more resources than expected.
Building a Custom Dashboard
Section titled “Building a Custom Dashboard”-
Create a new dashboard
In Grafana, click Dashboards > New > New Dashboard. Click Add visualization. Select the Prometheus data source.
-
Panel 1: Pod restart count
In the query editor, enter this PromQL query:
kube_pod_container_status_restarts_total{namespace="default", container="nginx"}This metric tracks the cumulative number of restarts for each container. Set the panel title to “Nginx Pod Restarts.” Choose a visualization type of Stat or Time series.
Click Apply to save the panel.
-
Panel 2: Node memory usage percentage
Add another panel with this query:
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)This calculates the percentage of memory in use by subtracting available memory from total memory. Title it “Node Memory Usage %.” Use a Gauge visualization with a max value of 100.
-
Panel 3: Pod ready status
Add a third panel:
kube_pod_status_ready{namespace="default", condition="true", pod=~"nginx.*"}This returns 1 for Pods that are ready and 0 for those that are not. Title it “Nginx Pods Ready.” Use a Stat visualization.
-
Save the dashboard
Click the save icon (floppy disk) in the top-right corner. Name it “CS 312 Lab Dashboard.”
Defining Alerts
Section titled “Defining Alerts”Alerts notify you when something goes wrong; but poorly designed alerts create alert fatigue, where operators start ignoring notifications because too many are false positives. Every alert should be actionable: when it fires, the operator should know exactly what to check and what to do first. This is where runbooks come in, short documents linked to each alert that describe the first-response steps.
-
Create a PrometheusRule manifest
PrometheusRule is a Kubernetes Custom Resource Definition (CRD) that Prometheus watches for alert definitions.
Terminal window vim alerting-rules.yamlapiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata:name: cs312-alertsnamespace: monitoringlabels:release: monitoringspec:groups:- name: cs312.rulesrules:- alert: PodCrashLoopingexpr: increase(kube_pod_container_status_restarts_total{namespace="default"}[5m]) > 3for: 1mlabels:severity: criticalannotations:summary: "Pod {{ $labels.pod }} is crash looping"runbook: |1. Run: kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}2. Check Events section for error messages3. Check logs: kubectl logs {{ $labels.pod }} --previous4. If bad image, rollback: kubectl rollout undo deployment/<name>- alert: HighNodeCPUexpr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80for: 2mlabels:severity: warningannotations:summary: "Node CPU usage above 80% for 2 minutes"runbook: |1. Run: kubectl top pods --all-namespaces --sort-by=cpu2. Identify the Pod consuming the most CPU3. Check if it is expected (build, migration) or unexpected (infinite loop)4. If unexpected, check resource limits and consider scaling- alert: PodNotReadyexpr: kube_pod_status_ready{namespace="default", condition="true"} == 0for: 1mlabels:severity: warningannotations:summary: "Pod {{ $labels.pod }} is not ready"runbook: |1. Run: kubectl describe pod {{ $labels.pod }} -n {{ $labels.namespace }}2. Check readiness probe configuration and recent events3. Check logs: kubectl logs {{ $labels.pod }}4. If the Pod is failing to start, check image and configLet’s break down the structure:
expr: The PromQL expression that triggers the alert.increase(...[5m]) > 3means “more than 3 restarts in the last 5 minutes.”for: How long the condition must be true before the alert fires. This prevents one-time blips from triggering alerts.labels.severity: Used for routing (e.g., critical alerts page on-call, warnings go to a Slack channel).annotations.runbook: The first-response steps an operator should take.
-
Apply the alert rules
Terminal window kubectl apply -f alerting-rules.yaml -
Verify in Prometheus
You can check that Prometheus loaded the rules by port-forwarding to the Prometheus UI:
Terminal window kubectl port-forward svc/monitoring-kube-prometheus-prometheus -n monitoring 9090:9090 &Then visit
http://localhost:9090/alerts(orhttp://<public-ip>:9090/alertsif the port is open). You should see your three alerts listed in “Inactive” state.
Triggering an Alert
Section titled “Triggering an Alert”-
Trigger the crash loop alert
You will force nginx containers to restart repeatedly by killing the main process inside them. The main nginx process runs as PID 1:
Terminal window for i in $(seq 1 5); dokubectl exec $(kubectl get pods -l app=nginx -o jsonpath='{.items[0].metadata.name}') -- kill 1sleep 5doneThis kills the nginx process 5 times. Each kill causes the container to exit, and Kubernetes restarts it. After several restarts in 5 minutes, the
PodCrashLoopingalert should fire. -
Watch the alert status
Check the alert in Prometheus (refresh
http://localhost:9090/alerts) or in Grafana (Alerting > Alert rules). The alert will transition from Inactive to Pending (condition is true butforduration has not elapsed) to Firing.Record how many minutes it took for the alert to fire.
-
Check the dashboard
Go to your “CS 312 Lab Dashboard” in Grafana. The pod restart count panel should show the increased restart count.
-
Investigate with logs
Find the Pod that was restarted:
Terminal window kubectl get podsView the logs from the previous (terminated) container instance:
Terminal window kubectl logs <restarted-pod-name> --previousThe
--previousflag is essential for debugging crashes; without it, you see logs from the current (running) instance, which does not contain the error that caused the restart. Record the last 5 lines.
Clean Up
Section titled “Clean Up”When you are done with the lab, you can uninstall the monitoring stack:
helm uninstall monitoring -n monitoringkubectl delete namespace monitoringIf you are also done with the nginx deployment:
kubectl delete -f nginx-deployment.yaml -f nginx-service.yaml -f nginx-configmap.yamlYou have now deployed a complete monitoring stack, built dashboards that answer “is my service healthy?”, defined alerts with runbooks, and experienced the full incident loop: a failure occurs, metrics detect it, an alert fires, and you investigate with logs. These are the same observability practices used in production systems at every scale.