Prometheus and Grafana
This activity puts into practice the concepts from the Monitoring, Alerting, and Observability lecture. You will install a lightweight Prometheus and Grafana stack on minikube using Helm, deploy a small instrumented service, and work through three layers of the lecture: what Prometheus actually collects (the raw exposition format and metric types), how to turn those metrics into RED and USE signals on a dashboard, and how an absence-based alert compares to the full dead man’s switch pattern from the lecture. By the end, you will have a working local monitoring stack, a six-panel dashboard, and a firing alert you triggered yourself.
What You Will Need
Section titled “What You Will Need”- The Minikube activity completed once already, or
minikube,kubectl, and Docker installed from the official guides helminstalled. Use the install guide at helm.sh/docs/intro/install/.- About 6 GiB of free RAM for Docker plus the local cluster
Start the Cluster and Install the Helm Releases
Section titled “Start the Cluster and Install the Helm Releases”Before you look at any metrics, bring up a clean local monitoring environment. In this section you will start minikube, write two small values files, and install Prometheus and Grafana as Helm releases.
-
Verify the tools you will use throughout the activity:
Terminal window docker versionminikube versionkubectl version --clienthelm versionEach command should print version information and return you to the prompt. Fix any missing tool before you continue.
-
Make sure minikube is running:
Terminal window minikube statusIf the cluster is stopped, start it now:
Terminal window minikube start --driver=docker --memory=4096 --cpus=2When minikube is ready,
kubectl get nodesshould show one node in theReadystate. -
Create a working directory for this activity:
Terminal window mkdir -p ~/cs312-monitoringcd ~/cs312-monitoringKeep the Helm values files and manifests from this activity here.
-
Write the Prometheus values file:
Terminal window cat <<'EOF' > prometheus-values.yamlalertmanager:enabled: falsekube-state-metrics:enabled: falseprometheus-pushgateway:enabled: falseserver:persistentVolume:enabled: falsescrapeConfigs:kubernetes-service-endpoints:scrape_interval: 15sserverFiles:alerting_rules.yml:groups:- name: podinfo.rulesinterval: 15srules:- alert: PodinfoMissingexpr: absent(up{service="podinfo", namespace="observability"})for: 1mlabels:severity: warningannotations:summary: "podinfo metrics target disappeared"description: "Prometheus has not seen the podinfo target in the observability namespace for 1 minute."EOFThis keeps the Prometheus install small, speeds up the
kubernetes-service-endpointsscrape job to15sso thepodinfodemo responds quickly, and evaluates the demo alert rule group every15swithout rewriting Prometheus’s entire global config. Alertmanager stays disabled, so in this activity you will inspect the firing rule in Prometheus rather than route a notification anywhere. Node Exporter is left enabled at its default cadence so you can query Linux OS metrics later in this activity. -
Write the Grafana values file:
Terminal window cat <<'EOF' > grafana-values.yamladminPassword: cs312grafanapersistence:enabled: falsedatasources:datasources.yaml:apiVersion: 1datasources:- name: Prometheustype: prometheusurl: http://metrics-prometheus-server.monitoring.svc.cluster.localaccess: proxyisDefault: trueservice:type: ClusterIPEOFThis pre-provisions the Prometheus data source so you do not have to create it by hand in the UI.
-
Add the Helm repositories and install the two releases:
Terminal window helm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo add grafana-community https://grafana-community.github.io/helm-chartshelm repo updatehelm install metrics prometheus-community/prometheus \--namespace monitoring \--create-namespace \-f prometheus-values.yamlhelm install grafana grafana-community/grafana \--namespace monitoring \-f grafana-values.yamlYou should now have two Helm releases in the
monitoringnamespace: one for Prometheus and one for Grafana. -
Verify the releases and wait for the Pods to be ready:
Terminal window helm list -n monitoringkubectl get pods -n monitoring -wWait until the Prometheus server and Grafana pods show
Running. You will also see anode-exporterDaemonSet pod come up; that is Prometheus’s agent for Linux OS metrics. PressCtrl+Cwhen the watch settles. -
Start two port-forwards so you can use the web interfaces:
In one terminal:
Terminal window kubectl -n monitoring port-forward svc/grafana 3000:80In a second terminal:
Terminal window kubectl -n monitoring port-forward svc/metrics-prometheus-server 9090:80Keep both terminals open. Grafana will be available at
http://127.0.0.1:3000and Prometheus athttp://127.0.0.1:9090.
Deploy a Service and Let Prometheus Discover It
Section titled “Deploy a Service and Let Prometheus Discover It”Now you need something that produces application metrics. In this section you will deploy podinfo, a small demo web service that already exposes Prometheus metrics, and give it one port for normal HTTP traffic and one port just for /metrics.
Prometheus is already running in the monitoring namespace from the previous section. The new piece here is the application side: a Service in the observability namespace that tells Prometheus, through annotations, where that metrics endpoint lives so it can start scraping it automatically.
-
Write the workload manifest:
Terminal window cat <<'EOF' > podinfo.yamlapiVersion: v1kind: Namespacemetadata:name: observability---apiVersion: apps/v1kind: Deploymentmetadata:name: podinfonamespace: observabilitylabels:app: podinfospec:replicas: 1selector:matchLabels:app: podinfotemplate:metadata:labels:app: podinfospec:containers:- name: podinfoimage: ghcr.io/stefanprodan/podinfo:6.11.2command:- ./podinfo- --level=info- --port=9898- --port-metrics=9797ports:- name: httpcontainerPort: 9898- name: metricscontainerPort: 9797readinessProbe:httpGet:path: /readyzport: httplivenessProbe:httpGet:path: /healthzport: http---apiVersion: v1kind: Servicemetadata:name: podinfonamespace: observabilityannotations:prometheus.io/scrape: "true"prometheus.io/port: "9797"prometheus.io/path: "/metrics"spec:selector:app: podinfoports:- name: httpport: 9898targetPort: http- name: metricsport: 9797targetPort: metricsEOFThe Deployment runs
podinfo, which is just a sample application for demos and testing. It serves the app itself on port9898, and it serves Prometheus metrics separately on port9797.The Service annotations are the important part. They tell Prometheus that this Service should be scraped, which port to use, and which path contains the metrics text.
-
Apply the manifest and wait for the Deployment to become ready:
Terminal window kubectl apply -f podinfo.yamlkubectl rollout status deployment/podinfo -n observabilitykubectl get pods,svc -n observabilityContinue only after the Pod is available and the Service exists.
-
Open port-forwards to both the application and its metrics endpoint:
In a third terminal:
Terminal window kubectl -n observability port-forward svc/podinfo 9898:9898 9797:9797Keep that terminal open. You will use
http://127.0.0.1:9898for test traffic andhttp://127.0.0.1:9797/metricsto inspect the raw exposition output. The first URL is what a normal client would hit. The second URL is what Prometheus hits on every scrape. -
Confirm that Prometheus discovered the new target:
Open
http://127.0.0.1:9090/targetsin your browser. Look for a target from theobservabilitynamespace with thepodinfoService name.The state should become
UPwithin a minute. That means Prometheus discovered the Service from Kubernetes metadata, connected topodinfoon port9797, and successfully fetched/metrics. You created that scrape relationship with three Service annotations, not with a hand-editedprometheus.yml. -
Run a simple query to confirm the target exists:
In the Prometheus expression browser, run:
up{service="podinfo", namespace="observability"}A value of
1means the scrape succeeded. A value of0means the target exists but is failing to scrape. -
Look at the raw data Prometheus is collecting:
Terminal window curl -s http://127.0.0.1:9797/metrics | head -50This is the key mental model for the rest of the activity: you are looking directly at the HTTP response that
podinforeturns when someone requests/metrics. Prometheus is not inventing these numbers.podinfois exposing them, and Prometheus is repeatedly fetching this endpoint and storing what it sees.Find three things in the output:
- A
# TYPE ... counterline: this metric only ever increases. Therate()function will convert it to a per-second rate over a time window. - A
# TYPE ... gaugeline: this is a point-in-time value that can go up or down, such as current memory usage or active connection count. - Lines ending in
_bucket{le="..."}: histogram buckets. Each bucket counts how many observations fell below that boundary value.histogram_quantile()reads across all buckets at query time to compute any percentile you ask for.
This text format is the Prometheus exposition format.
podinfois one instrumented application that speaks it; other exporters use the same overall pattern even though their metric names differ.podinfoalso has OpenTelemetry support upstream, but this activity is not using OTLP or an OpenTelemetry Collector; it is scraping the Prometheus-format/metricsendpoint directly. In this setup,podinfois the app, the Kubernetes Service makes its metrics endpoint reachable, Prometheus is the scraper and time-series database, and Grafana will later read from Prometheus to draw dashboards. The text you just fetched is the exact payload Prometheus parses on each scrape. - A
Query RED and USE Signals
Section titled “Query RED and USE Signals”Once the target is being scraped, make it do something interesting. In this section you will generate traffic, run a small set of Prometheus checks to validate labels and target state, and then build all RED and USE panels directly in Grafana.
A normal PromQL query usually has one of these shapes:
# Instant selector ("what is true right now")# this selects time series by metric name and exact label match# we an use regex match with =~ instead of =metric_name{label="value"}
# Range function over time ("how has this changed over 5 minutes")# the rate function converts a counter to a per-second rate over the windowrate(metric_name{label="value"}[5m])
# Aggregated result ("combine many series into a summary")# this aggregates series while keeping only listed labelssum by (label1, label2) (rate(metric_name{label="value"}[5m]))You can find all available query functions in the Prometheus documentation.
-
Send a burst of mixed traffic to the service:
Terminal window for i in {1..20}; do curl -s http://127.0.0.1:9898/ >/dev/null; donefor i in {1..5}; do curl -s http://127.0.0.1:9898/status/500 >/dev/null; donefor i in {1..5}; do curl -s http://127.0.0.1:9898/delay/1 >/dev/null; doneThis gives you three things to measure immediately: total request volume, error count, and slow requests.
-
Run two quick checks in the Prometheus expression browser:
Before building dashboard panels, run two small validation queries: one to confirm target identity labels, and one to confirm request labels (
path,status) and spot probe traffic.sum by (job, instance, service, namespace) (up{service="podinfo", namespace="observability"})sum by (path, status) (rate(http_request_duration_seconds_count{service="podinfo", namespace="observability"}[5m]))The first query should show one
podinfotarget under thekubernetes-service-endpointsjob. The second query makes an important behavior visible before you build panels: Kubernetes is continuously calling/readyzand/healthz, so those probe requests will dominate simple request-rate graphs unless you filter them out. -
Log into Grafana and create the dashboard shell first:
Open
http://127.0.0.1:3000. Log in with usernameadminand passwordcs312grafana.Create a new dashboard and set the time range to the last 15 minutes. You will add six panels in two rows:
- Row 1 (service behavior):
Request rate,5xx rate,p95 latency,Podinfo target up - Row 2 (node resources):
Node CPU utilization (%),Node memory used (%)
- Row 1 (service behavior):
-
Add the RED panels in Grafana, one query at a time:
Panel 1:
Request ratesum(rate(http_request_duration_seconds_count{service="podinfo", namespace="observability", path=~"root|status|delay"}[5m]))This is RED Rate.
http_request_duration_seconds_countis a counter from the latency histogram family, sorate(...[5m])gives requests per second. Thepathregex intentionally includes only the routes you generated and excludes/readyzand/healthzprobe traffic.Panel 2:
5xx ratesum(rate(http_request_duration_seconds_count{service="podinfo", namespace="observability", path=~"root|status|delay", status=~"5.."}[5m]))This is RED Errors. It uses the same base counter and filters to status codes matching
5..(all 5xx). If this panel is empty, send a few more/status/500requests, wait about 30 seconds, and refresh.Panel 3:
p95 latencyhistogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{service="podinfo", namespace="observability", path=~"root|status|delay"}[5m])))This is RED Duration at the 95th percentile.
http_request_duration_seconds_bucketprovides histogram bucket counters labeled byle(less-than-or-equal bucket boundary).rate()computes per-second bucket increases,sum by (le)combines matching buckets, andhistogram_quantile(0.95, ...)estimates p95.These three panels use the same source metric family on purpose:
_countfor request volume and errors,_bucketfor latency percentile. -
Add a direct target-availability Stat panel:
Panel 4:
Podinfo target upmax(up{service="podinfo", namespace="observability"}) or on() vector(0)up{...}is per-target scrape health.max(...)collapses to one value.or on() vector(0)provides a fallback zero when the left side has no series at all, so the panel shows0instead ofNo dataif the target disappears from discovery. -
Add the USE panels in Grafana:
Panel 5:
Node CPU utilization (%)100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))node_cpu_seconds_totalis a counter partitioned by CPU mode.mode="idle"selects idle time only.rate(...[5m])gives the idle fraction per second,avg(...)averages across cores and instances in this small cluster, and1 - idlegives used CPU fraction. Multiplying by 100 gives a percent.Panel 6:
Node memory used (%)100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))This computes memory usage percent as
1 - available/total.MemAvailableis the kernel estimate of allocatable memory without heavy reclaim.Both values come from Node Exporter. In minikube this is typically one node, so these queries act as cluster-node health checks.
Your dashboard now answers three different questions: RED panels show user-visible service behavior,
Podinfo target upshows scrape visibility of the service, and USE panels show node resource pressure. -
Generate a second burst of traffic while watching the dashboard refresh:
Terminal window for i in {1..10}; do curl -s http://127.0.0.1:9898/ >/dev/null; donefor i in {1..3}; do curl -s http://127.0.0.1:9898/status/500 >/dev/null; doneWithin the next refresh interval, the request-rate and 5xx panels should jump. The CPU panel may also tick up slightly.
Watch an Alert Fire and Recover
Section titled “Watch an Alert Fire and Recover”Dashboards are useful only when someone is looking at them. In this stripped-down stack, the alert stops at Prometheus itself because Alertmanager is disabled, so you will observe rule evaluation on the Prometheus alerts page rather than send a notification. The PodinfoMissing alert uses the absent() function to detect when a target goes completely silent. That is one important ingredient in the lecture’s dead man’s switch discussion, but it is not the full pattern because no secondary receiver is checking that alerts continue to arrive.
-
Open the Prometheus alerts page:
Visit
http://127.0.0.1:9090/alertsin your browser.You should see the
PodinfoMissingrule listed in theinactivestate. That is the expected state while Prometheus can still scrape the target. It will move toPendingand thenFiringonly after the target disappears for the fullfor: 1mwindow. -
Confirm where that rule came from:
Terminal window helm get values metrics -n monitoringYou should see the alert definition under
serverFiles.alerting_rules.yml. This is configuration-as-code for an alert rule: the same workflow that deploys the application also deploys its monitoring. In this activity, that configuration stops at Prometheus rule evaluation because Alertmanager is turned off in the values file. -
Scale the Deployment to zero replicas:
Terminal window kubectl scale deployment/podinfo --replicas=0 -n observabilitykubectl get pods -n observability -wWatch until the Pod disappears. Then return to the Prometheus alerts page.
-
Wait for the rule to move from Pending to Firing:
The expression uses
absent()and the rule hasfor: 1m, so the alert first shows asPendingand then transitions toFiringafter one continuous minute of absence. Theforclause prevents a single missed scrape from moving the rule intoFiring; the target must stay absent for the full duration.While the alert is in
Firing, the dashboard and the Prometheus alerts page give you different kinds of evidence. The RED panels show the recent request, error, and latency history from before Podinfo disappeared. ThePodinfo target uppanel shows the direct availability signal for the scrape target, and the USE panels show whether the node itself is still healthy. The Prometheus alerts page then shows the current state change: thePodinfoMissingrule is firing because the target has been absent for long enough. In a larger stack, Alertmanager would sit after this step and decide whether to notify anyone. -
Restore the service:
Terminal window kubectl scale deployment/podinfo --replicas=1 -n observabilitykubectl rollout status deployment/podinfo -n observabilityAfter the Pod returns and Prometheus scrapes it again, the alert should clear from the Firing state.
-
Clean up when you are finished:
Delete the demo application namespace:
Terminal window kubectl delete namespace observability --wait=trueTo remove the monitoring stack as well:
Terminal window helm uninstall grafana -n monitoringhelm uninstall metrics -n monitoringkubectl delete namespace monitoring --wait=trueIf you plan to continue to the logging activity with the same cluster, leave the
monitoringnamespace in place.
Going Further
Section titled “Going Further”You have worked through the pull model end to end: raw exposition format, PromQL rate and quantile functions, a RED and USE dashboard, and an absence-based alert rule. Two natural next steps pull in different directions.
For deeper monitoring, compare this activity’s two-release setup to the kube-prometheus-stack chart used in the later lab. Run helm show values prometheus-community/kube-prometheus-stack | less and notice how much more of the Kubernetes ecosystem it packages: the Prometheus Operator, ServiceMonitors and PrometheusRules as Kubernetes objects, kube-state-metrics for Kubernetes-level signals, and pre-built dashboards for the cluster. Pay particular attention to ServiceMonitor, which replaces the annotation-based discovery you used here.
For deeper alerting, read the Google SRE chapter Alerting on SLOs and try writing a burn-rate rule against the error data you collected. A burn rate of 14.4 means a 30-day error budget exhausts in two days. The PromQL expression in the lecture gives you the starting point; the Workbook explains how to pair a short window with a long window to reduce false positives from transient spikes while still resetting quickly once the incident ends.