Prometheus and Grafana

This activity puts into practice the concepts from the Monitoring, Alerting, and Observability lecture. You will install a lightweight Prometheus and Grafana stack on minikube using Helm, deploy a small instrumented service, and work through three layers of the lecture: what Prometheus actually collects (the raw exposition format and metric types), how to turn those metrics into RED and USE signals on a dashboard, and how an absence-based alert compares to the full dead man’s switch pattern from the lecture. By the end, you will have a working local monitoring stack, a six-panel dashboard, and a firing alert you triggered yourself.

What You Will Need

The Minikube activity completed once already, or minikube, kubectl, and Docker installed from the official guides
helm installed. Use the install guide at helm.sh/docs/intro/install/.
About 6 GiB of free RAM for Docker plus the local cluster

Start the Cluster and Install the Helm Releases

Before you look at any metrics, bring up a clean local monitoring environment. In this section you will start minikube, write two small values files, and install Prometheus and Grafana as Helm releases.

Verify the tools you will use throughout the activity:
Terminal window
```
docker version
minikube version
kubectl version --client
helm version
```
Each command should print version information and return you to the prompt. Fix any missing tool before you continue.
Make sure minikube is running:
Terminal window
```
minikube status
```
If the cluster is stopped, start it now:
Terminal window
```
minikube start --driver=docker --memory=4096 --cpus=2
```
When minikube is ready, kubectl get nodes should show one node in the Ready state.
Create a working directory for this activity:
Terminal window
```
mkdir -p ~/cs312-monitoring
cd ~/cs312-monitoring
```
Keep the Helm values files and manifests from this activity here.

Write the Prometheus values file:

cat <<'EOF' > prometheus-values.yaml
alertmanager:
  enabled: false

kube-state-metrics:
  enabled: false

prometheus-pushgateway:
  enabled: false

server:
  persistentVolume:
    enabled: false

scrapeConfigs:
  kubernetes-service-endpoints:
    scrape_interval: 15s

serverFiles:
  alerting_rules.yml:
    groups:
      - name: podinfo.rules
        interval: 15s
        rules:
          - alert: PodinfoMissing
            expr: absent(up{service="podinfo", namespace="observability"})
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: "podinfo metrics target disappeared"
              description: "Prometheus has not seen the podinfo target in the observability namespace for 1 minute."
EOF

This keeps the Prometheus install small, speeds up the kubernetes-service-endpoints scrape job to 15s so the podinfo demo responds quickly, and evaluates the demo alert rule group every 15s without rewriting Prometheus’s entire global config. Alertmanager stays disabled, so in this activity you will inspect the firing rule in Prometheus rather than route a notification anywhere. Node Exporter is left enabled at its default cadence so you can query Linux OS metrics later in this activity.

Write the Grafana values file:

cat <<'EOF' > grafana-values.yaml
adminPassword: cs312grafana

persistence:
  enabled: false

datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        url: http://metrics-prometheus-server.monitoring.svc.cluster.local
        access: proxy
        isDefault: true

service:
  type: ClusterIP
EOF

This pre-provisions the Prometheus data source so you do not have to create it by hand in the UI.

Add the Helm repositories and install the two releases:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana-community https://grafana-community.github.io/helm-charts
helm repo update

helm install metrics prometheus-community/prometheus \
  --namespace monitoring \
  --create-namespace \
  -f prometheus-values.yaml

helm install grafana grafana-community/grafana \
  --namespace monitoring \
  -f grafana-values.yaml

You should now have two Helm releases in the monitoring namespace: one for Prometheus and one for Grafana.

Verify the releases and wait for the Pods to be ready:
Terminal window
```
helm list -n monitoring
kubectl get pods -n monitoring -w
```
Wait until the Prometheus server and Grafana pods show Running. You will also see a node-exporter DaemonSet pod come up; that is Prometheus’s agent for Linux OS metrics. Press Ctrl+C when the watch settles.
Start two port-forwards so you can use the web interfaces:

In one terminal:
Terminal window
```
kubectl -n monitoring port-forward svc/grafana 3000:80
```
In a second terminal:
Terminal window
```
kubectl -n monitoring port-forward svc/metrics-prometheus-server 9090:80
```
Keep both terminals open. Grafana will be available at http://127.0.0.1:3000 and Prometheus at http://127.0.0.1:9090.

Deploy a Service and Let Prometheus Discover It

Now you need something that produces application metrics. In this section you will deploy podinfo, a small demo web service that already exposes Prometheus metrics, and give it one port for normal HTTP traffic and one port just for /metrics.

Prometheus is already running in the monitoring namespace from the previous section. The new piece here is the application side: a Service in the observability namespace that tells Prometheus, through annotations, where that metrics endpoint lives so it can start scraping it automatically.

Write the workload manifest:

cat <<'EOF' > podinfo.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: observability
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: podinfo
  namespace: observability
  labels:
    app: podinfo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: podinfo
  template:
    metadata:
      labels:
        app: podinfo
    spec:
      containers:
        - name: podinfo
          image: ghcr.io/stefanprodan/podinfo:6.11.2
          command:
            - ./podinfo
            - --level=info
            - --port=9898
            - --port-metrics=9797
          ports:
            - name: http
              containerPort: 9898
            - name: metrics
              containerPort: 9797
          readinessProbe:
            httpGet:
              path: /readyz
              port: http
          livenessProbe:
            httpGet:
              path: /healthz
              port: http
---
apiVersion: v1
kind: Service
metadata:
  name: podinfo
  namespace: observability
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9797"
    prometheus.io/path: "/metrics"
spec:
  selector:
    app: podinfo
  ports:
    - name: http
      port: 9898
      targetPort: http
    - name: metrics
      port: 9797
      targetPort: metrics
EOF

The Deployment runs podinfo, which is just a sample application for demos and testing. It serves the app itself on port 9898, and it serves Prometheus metrics separately on port 9797.

The Service annotations are the important part. They tell Prometheus that this Service should be scraped, which port to use, and which path contains the metrics text.

Apply the manifest and wait for the Deployment to become ready:
Terminal window
```
kubectl apply -f podinfo.yaml
kubectl rollout status deployment/podinfo -n observability
kubectl get pods,svc -n observability
```
Continue only after the Pod is available and the Service exists.
Open port-forwards to both the application and its metrics endpoint:

In a third terminal:
Terminal window
```
kubectl -n observability port-forward svc/podinfo 9898:9898 9797:9797
```
Keep that terminal open. You will use http://127.0.0.1:9898 for test traffic and http://127.0.0.1:9797/metrics to inspect the raw exposition output. The first URL is what a normal client would hit. The second URL is what Prometheus hits on every scrape.
Confirm that Prometheus discovered the new target:

Open http://127.0.0.1:9090/targets in your browser. Look for a target from the observability namespace with the podinfo Service name.

The state should become UP within a minute. That means Prometheus discovered the Service from Kubernetes metadata, connected to podinfo on port 9797, and successfully fetched /metrics. You created that scrape relationship with three Service annotations, not with a hand-edited prometheus.yml.
Run a simple query to confirm the target exists:

In the Prometheus expression browser, run:
```
up{service="podinfo", namespace="observability"}
```
A value of 1 means the scrape succeeded. A value of 0 means the target exists but is failing to scrape.
Look at the raw data Prometheus is collecting:
Terminal window
```
curl -s http://127.0.0.1:9797/metrics | head -50
```
This is the key mental model for the rest of the activity: you are looking directly at the HTTP response that podinfo returns when someone requests /metrics. Prometheus is not inventing these numbers. podinfo is exposing them, and Prometheus is repeatedly fetching this endpoint and storing what it sees.

Find three things in the output:
- A # TYPE ... counter line: this metric only ever increases. The rate() function will convert it to a per-second rate over a time window.
- A # TYPE ... gauge line: this is a point-in-time value that can go up or down, such as current memory usage or active connection count.
- Lines ending in _bucket{le="..."}: histogram buckets. Each bucket counts how many observations fell below that boundary value. histogram_quantile() reads across all buckets at query time to compute any percentile you ask for.
This text format is the Prometheus exposition format. podinfo is one instrumented application that speaks it; other exporters use the same overall pattern even though their metric names differ. podinfo also has OpenTelemetry support upstream, but this activity is not using OTLP or an OpenTelemetry Collector; it is scraping the Prometheus-format /metrics endpoint directly. In this setup, podinfo is the app, the Kubernetes Service makes its metrics endpoint reachable, Prometheus is the scraper and time-series database, and Grafana will later read from Prometheus to draw dashboards. The text you just fetched is the exact payload Prometheus parses on each scrape.

Query RED and USE Signals

Once the target is being scraped, make it do something interesting. In this section you will generate traffic, run a small set of Prometheus checks to validate labels and target state, and then build all RED and USE panels directly in Grafana.

A normal PromQL query usually has one of these shapes:

# Instant selector ("what is true right now")
# this selects time series by metric name and exact label match
# we an use regex match with =~ instead of =
metric_name{label="value"}

# Range function over time ("how has this changed over 5 minutes")
# the rate function converts a counter to a per-second rate over the window
rate(metric_name{label="value"}[5m])

# Aggregated result ("combine many series into a summary")
# this aggregates series while keeping only listed labels
sum by (label1, label2) (rate(metric_name{label="value"}[5m]))

You can find all available query functions in the Prometheus documentation.

Send a burst of mixed traffic to the service:

for i in {1..20}; do curl -s http://127.0.0.1:9898/ >/dev/null; done
for i in {1..5}; do curl -s http://127.0.0.1:9898/status/500 >/dev/null; done
for i in {1..5}; do curl -s http://127.0.0.1:9898/delay/1 >/dev/null; done

This gives you three things to measure immediately: total request volume, error count, and slow requests.

Run two quick checks in the Prometheus expression browser:

Before building dashboard panels, run two small validation queries: one to confirm target identity labels, and one to confirm request labels (path, status) and spot probe traffic.
```
sum by (job, instance, service, namespace) (up{service="podinfo", namespace="observability"})
```
```
sum by (path, status) (rate(http_request_duration_seconds_count{service="podinfo", namespace="observability"}[5m]))
```
The first query should show one podinfo target under the kubernetes-service-endpoints job. The second query makes an important behavior visible before you build panels: Kubernetes is continuously calling /readyz and /healthz, so those probe requests will dominate simple request-rate graphs unless you filter them out.
Log into Grafana and create the dashboard shell first:

Open http://127.0.0.1:3000. Log in with username admin and password cs312grafana.

Create a new dashboard and set the time range to the last 15 minutes. You will add six panels in two rows:
- Row 1 (service behavior): Request rate, 5xx rate, p95 latency, Podinfo target up
- Row 2 (node resources): Node CPU utilization (%), Node memory used (%)
Add the RED panels in Grafana, one query at a time:

Panel 1: Request rate
```
sum(rate(http_request_duration_seconds_count{service="podinfo", namespace="observability", path=~"root|status|delay"}[5m]))
```
This is RED Rate. http_request_duration_seconds_count is a counter from the latency histogram family, so rate(...[5m]) gives requests per second. The path regex intentionally includes only the routes you generated and excludes /readyz and /healthz probe traffic.

Panel 2: 5xx rate
```
sum(rate(http_request_duration_seconds_count{service="podinfo", namespace="observability", path=~"root|status|delay", status=~"5.."}[5m]))
```
This is RED Errors. It uses the same base counter and filters to status codes matching 5.. (all 5xx). If this panel is empty, send a few more /status/500 requests, wait about 30 seconds, and refresh.

Panel 3: p95 latency
```
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{service="podinfo", namespace="observability", path=~"root|status|delay"}[5m])))
```
This is RED Duration at the 95th percentile. http_request_duration_seconds_bucket provides histogram bucket counters labeled by le (less-than-or-equal bucket boundary). rate() computes per-second bucket increases, sum by (le) combines matching buckets, and histogram_quantile(0.95, ...) estimates p95.

These three panels use the same source metric family on purpose: _count for request volume and errors, _bucket for latency percentile.
Add a direct target-availability Stat panel:

Panel 4: Podinfo target up
```
max(up{service="podinfo", namespace="observability"}) or on() vector(0)
```
up{...} is per-target scrape health. max(...) collapses to one value. or on() vector(0) provides a fallback zero when the left side has no series at all, so the panel shows 0 instead of No data if the target disappears from discovery.
Add the USE panels in Grafana:

Panel 5: Node CPU utilization (%)
```
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])))
```
node_cpu_seconds_total is a counter partitioned by CPU mode. mode="idle" selects idle time only. rate(...[5m]) gives the idle fraction per second, avg(...) averages across cores and instances in this small cluster, and 1 - idle gives used CPU fraction. Multiplying by 100 gives a percent.

Panel 6: Node memory used (%)
```
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
```
This computes memory usage percent as 1 - available/total. MemAvailable is the kernel estimate of allocatable memory without heavy reclaim.

Both values come from Node Exporter. In minikube this is typically one node, so these queries act as cluster-node health checks.

Your dashboard now answers three different questions: RED panels show user-visible service behavior, Podinfo target up shows scrape visibility of the service, and USE panels show node resource pressure.
Generate a second burst of traffic while watching the dashboard refresh:
Terminal window
```
for i in {1..10}; do curl -s http://127.0.0.1:9898/ >/dev/null; done
for i in {1..3}; do curl -s http://127.0.0.1:9898/status/500 >/dev/null; done
```
Within the next refresh interval, the request-rate and 5xx panels should jump. The CPU panel may also tick up slightly.

Watch an Alert Fire and Recover

Dashboards are useful only when someone is looking at them. In this stripped-down stack, the alert stops at Prometheus itself because Alertmanager is disabled, so you will observe rule evaluation on the Prometheus alerts page rather than send a notification. The PodinfoMissing alert uses the absent() function to detect when a target goes completely silent. That is one important ingredient in the lecture’s dead man’s switch discussion, but it is not the full pattern because no secondary receiver is checking that alerts continue to arrive.

Open the Prometheus alerts page:

Visit http://127.0.0.1:9090/alerts in your browser.

You should see the PodinfoMissing rule listed in the inactive state. That is the expected state while Prometheus can still scrape the target. It will move to Pending and then Firing only after the target disappears for the full for: 1m window.
Confirm where that rule came from:
Terminal window
```
helm get values metrics -n monitoring
```
You should see the alert definition under serverFiles.alerting_rules.yml. This is configuration-as-code for an alert rule: the same workflow that deploys the application also deploys its monitoring. In this activity, that configuration stops at Prometheus rule evaluation because Alertmanager is turned off in the values file.
Scale the Deployment to zero replicas:
Terminal window
```
kubectl scale deployment/podinfo --replicas=0 -n observability
kubectl get pods -n observability -w
```
Watch until the Pod disappears. Then return to the Prometheus alerts page.
Wait for the rule to move from Pending to Firing:

The expression uses absent() and the rule has for: 1m, so the alert first shows as Pending and then transitions to Firing after one continuous minute of absence. The for clause prevents a single missed scrape from moving the rule into Firing; the target must stay absent for the full duration.

While the alert is in Firing, the dashboard and the Prometheus alerts page give you different kinds of evidence. The RED panels show the recent request, error, and latency history from before Podinfo disappeared. The Podinfo target up panel shows the direct availability signal for the scrape target, and the USE panels show whether the node itself is still healthy. The Prometheus alerts page then shows the current state change: the PodinfoMissing rule is firing because the target has been absent for long enough. In a larger stack, Alertmanager would sit after this step and decide whether to notify anyone.
Restore the service:
Terminal window
```
kubectl scale deployment/podinfo --replicas=1 -n observability
kubectl rollout status deployment/podinfo -n observability
```
After the Pod returns and Prometheus scrapes it again, the alert should clear from the Firing state.
Clean up when you are finished:

Delete the demo application namespace:
Terminal window
```
kubectl delete namespace observability --wait=true
```
To remove the monitoring stack as well:
Terminal window
```
helm uninstall grafana -n monitoring
helm uninstall metrics -n monitoring
kubectl delete namespace monitoring --wait=true
```
If you plan to continue to the logging activity with the same cluster, leave the monitoring namespace in place.

Going Further

You have worked through the pull model end to end: raw exposition format, PromQL rate and quantile functions, a RED and USE dashboard, and an absence-based alert rule. Two natural next steps pull in different directions.

For deeper monitoring, compare this activity’s two-release setup to the kube-prometheus-stack chart used in the later lab. Run helm show values prometheus-community/kube-prometheus-stack | less and notice how much more of the Kubernetes ecosystem it packages: the Prometheus Operator, ServiceMonitors and PrometheusRules as Kubernetes objects, kube-state-metrics for Kubernetes-level signals, and pre-built dashboards for the cluster. Pay particular attention to ServiceMonitor, which replaces the annotation-based discovery you used here.

For deeper alerting, read the Google SRE chapter Alerting on SLOs and try writing a burn-rate rule against the error data you collected. A burn rate of 14.4 means a 30-day error budget exhausts in two days. The PromQL expression in the lecture gives you the starting point; the Workbook explains how to pair a short window with a long window to reduce false positives from transient spikes while still resetting quickly once the incident ends.