Follow the Logs
This activity puts into practice the concepts from the Log Management and Incident Investigation lecture. You will first investigate a staged incident on a Linux server with journalctl, flat files, and auditd, then switch once to minikube to practice kubectl logs --previous and watch Fluent Bit enrich container logs. By the end, you will have a timestamped incident timeline from the server-side investigation and a direct view of how Kubernetes log collection works before logs reach a real backend.
The flow is split by environment on purpose. The first half stays on the Linux machine so you can work through persistent host and application logs without changing context, then the second half moves to your local minikube cluster for the container-specific part of the lecture, where logs are ephemeral and usually forwarded to a backend such as Loki or Elasticsearch. In this demo, Fluent Bit writes to stdout instead so you can inspect the forwarded records directly with kubectl logs.
What You Will Need
Section titled “What You Will Need”- Access to an Ubuntu or Debian Linux machine where you have
sudoprivileges journalctl,logger,grep, andawkon that machineauditdon the Linux machine. Install it if it is not already present:sudo apt-get install -y auditd && sudo systemctl enable auditd --now(this might not work on WSL2, so if you are using WSL, you can skip the auditd section)kubectlon your local machine and a running minikube cluster from the Minikube activity. Start minikube before class if it is not already running.
Build the Incident Bundle
Section titled “Build the Incident Bundle”Before investigating anything, you need a controlled set of logs to work with. This bundle captures the same staged failure sequence each time: a long-running export job opens too many PostgreSQL sessions, exhausts database connection slots, and eventually pushes the order API into timeouts and HTTP 500 errors. Three of the files in this bundle simulate the flat log files you would usually find under /var/log/ or an application-specific log directory, while db-host-journal.log is a saved excerpt from another machine’s systemd journal. The journal entries you write here are real: logger writes through the syslog socket at /dev/log. Many Linux services reach the journal differently, either through systemd-managed stdout and stderr or through syslog, but the resulting entries are all searchable with journalctl.
Run the following commands on your Linux machine.
-
Create the working directory:
Terminal window mkdir -p ~/cs312-log-activity/bundlecd ~/cs312-log-activity -
Create the nginx access log:
Terminal window cat > ~/cs312-log-activity/bundle/nginx-access.log << 'EOF'203.0.113.10 - - [15/Mar/2026:03:02:05 +0000] "GET /api/orders HTTP/1.1" 200 912 "-" "curl/8.5.0"198.51.100.24 - - [15/Mar/2026:03:02:18 +0000] "GET /healthz HTTP/1.1" 200 31 "-" "kube-probe/1.32"203.0.113.11 - - [15/Mar/2026:03:02:44 +0000] "GET /api/orders HTTP/1.1" 200 905 "-" "curl/8.5.0"203.0.113.12 - - [15/Mar/2026:03:03:01 +0000] "GET /api/orders HTTP/1.1" 200 918 "-" "curl/8.5.0"198.51.100.24 - - [15/Mar/2026:03:03:10 +0000] "GET /healthz HTTP/1.1" 200 31 "-" "kube-probe/1.32"203.0.113.13 - - [15/Mar/2026:03:03:22 +0000] "GET /api/orders HTTP/1.1" 200 921 "-" "curl/8.5.0"203.0.113.14 - - [15/Mar/2026:03:03:44 +0000] "GET /api/orders HTTP/1.1" 200 910 "-" "curl/8.5.0"203.0.113.15 - - [15/Mar/2026:03:04:01 +0000] "GET /api/orders HTTP/1.1" 500 162 "-" "curl/8.5.0"203.0.113.16 - - [15/Mar/2026:03:04:07 +0000] "GET /api/orders HTTP/1.1" 500 162 "-" "curl/8.5.0"198.51.100.24 - - [15/Mar/2026:03:04:10 +0000] "GET /healthz HTTP/1.1" 503 31 "-" "kube-probe/1.32"203.0.113.17 - - [15/Mar/2026:03:04:18 +0000] "GET /api/orders HTTP/1.1" 500 162 "-" "curl/8.5.0"203.0.113.18 - - [15/Mar/2026:03:04:31 +0000] "GET /api/orders HTTP/1.1" 500 162 "-" "curl/8.5.0"203.0.113.19 - - [15/Mar/2026:03:05:02 +0000] "GET /api/orders HTTP/1.1" 500 162 "-" "curl/8.5.0"203.0.113.20 - - [15/Mar/2026:03:05:19 +0000] "GET /api/orders HTTP/1.1" 500 162 "-" "curl/8.5.0"198.51.100.24 - - [15/Mar/2026:03:05:22 +0000] "GET /healthz HTTP/1.1" 503 31 "-" "kube-probe/1.32"203.0.113.21 - - [15/Mar/2026:03:05:40 +0000] "GET /api/orders HTTP/1.1" 500 162 "-" "curl/8.5.0"EOF -
Create the application JSON log:
Terminal window cat > ~/cs312-log-activity/bundle/order-api.jsonl << 'EOF'{"timestamp":"2026-03-15T03:00:52.011Z","level":"info","service":"order-api","message":"worker pool healthy","request_id":"req-1001","duration_ms":19}{"timestamp":"2026-03-15T03:01:17.125Z","level":"error","service":"order-api","message":"database connection timeout, served stale cache","upstream_host":"postgres-prod","upstream_port":5432,"request_id":"req-1002","duration_ms":30012,"fallback":"stale-cache","http_status":200}{"timestamp":"2026-03-15T03:02:06.993Z","level":"error","service":"order-api","message":"database connection timeout, served stale cache","upstream_host":"postgres-prod","upstream_port":5432,"request_id":"req-1003","duration_ms":30009,"fallback":"stale-cache","http_status":200}{"timestamp":"2026-03-15T03:03:14.411Z","level":"warn","service":"order-api","message":"connection pool exhausted, retry queue growing","request_id":"req-1004","duration_ms":1050,"retry_queue_depth":7}{"timestamp":"2026-03-15T03:03:58.731Z","level":"error","service":"order-api","message":"database connection timeout, fallback unavailable","upstream_host":"postgres-prod","upstream_port":5432,"request_id":"req-1005","duration_ms":30001,"fallback":"none","http_status":500}{"timestamp":"2026-03-15T03:04:12.841Z","level":"fatal","service":"order-api","message":"startup dependency check failed after restart","upstream_host":"postgres-prod","upstream_port":5432,"request_id":"req-1006","duration_ms":0}EOF -
Create the PostgreSQL log:
Terminal window cat > ~/cs312-log-activity/bundle/postgres.log << 'EOF'2026-03-15 03:00:59 UTC [4122] LOG: checkpoint starting: time2026-03-15 03:01:04 UTC [4122] FATAL: remaining connection slots are reserved for non-replication superuser connections2026-03-15 03:01:21 UTC [4128] FATAL: remaining connection slots are reserved for non-replication superuser connections2026-03-15 03:02:07 UTC [4134] FATAL: remaining connection slots are reserved for non-replication superuser connections2026-03-15 03:03:58 UTC [4197] FATAL: remaining connection slots are reserved for non-replication superuser connections2026-03-15 03:04:13 UTC [4201] LOG: background worker "logical replication launcher" exited with exit code 1EOF -
Create the database host journal excerpt:
Terminal window cat > ~/cs312-log-activity/bundle/db-host-journal.log << 'EOF'Mar 15 02:58:23 db-prod-01 systemd[1]: Started nightly-export.service - Finance CSV export.Mar 15 02:58:24 db-prod-01 nightly-export[5111]: opened 90 database sessions for region sweepMar 15 02:58:29 db-prod-01 nightly-export[5111]: exporting order archive partition for 2026-03-14Mar 15 03:01:04 db-prod-01 postgres[4122]: remaining connection slots are reserved for non-replication superuser connectionsMar 15 03:01:05 db-prod-01 systemd[1]: nightly-export.service still running after 2min 41sMar 15 03:04:14 db-prod-01 nightly-export[5111]: export job still holding open sessions waiting on downstream writerEOF -
Confirm all four files exist:
Terminal window ls -1 ~/cs312-log-activity/bundleYou should see
db-host-journal.log,nginx-access.log,order-api.jsonl, andpostgres.log. -
Write real journal entries on your server:
Terminal window logger -t cs312-log-activity "incident bundle ready for $(whoami) on $(hostname)"logger -p user.warning -t cs312-log-activity "practice warning: the 5xx spike begins at 03:04 UTC"loggersends messages directly to the systemd journal, tagged with the program namecs312-log-activity. These entries are just a controlled test signal so you can practice journal filters in steps 8 and 9. In real systems, similar journal entries often appear naturally from services that log through systemd-managed stdout/stderr or/dev/log. -
Query the entries you just wrote:
Terminal window journalctl -t cs312-log-activity --since "5 minutes ago" --no-pagerThe
-tflag filters by syslog identifier. The journal indexes entries by identifier, priority, and timestamp, so this filter uses metadata rather than scanning every message as plain text. -
Filter on priority:
Terminal window journalctl -t cs312-log-activity -p warning --since "5 minutes ago" --no-pagerOnly the warning-level entry appears. Severity is stored as structured metadata, so this filter never requires scanning message text.
-
Check the boot history:
Terminal window journalctl --list-boots | head -5The journal keeps logs grouped by boot session.
journalctl --list-bootsshows those sessions, where0is the current boot and-1is the previous one. If an incident crosses a reboot,journalctl -b -1lets you jump directly to the logs from the prior boot instead of guessing a time window.
Scope the Symptom
Section titled “Scope the Symptom”An investigation starts at the edge: the log source closest to the customer. First find when users began failing, then move inward and backward until you can explain what set the failure in motion.
flowchart LR A[Web Server<br/>nginx-access.log] --> B[Application API<br/>order-api.jsonl] B --> C[Database <br/>postgres.log] C --> D[DB Host<br/>db-host-journal.log]
Read top to bottom during triage: start where users are failing, then move inward to application, database engine, and host-service context. If the first error you find already assumes something else went wrong earlier, widen the time window backward.
-
Count 5xx responses by minute to find the onset:
Terminal window # Keep only HTTP status >=500, extract the timestamp field (cutting seconds), then group and count by minute.awk '$9 >= 500 {print $4}' ~/cs312-log-activity/bundle/nginx-access.log \| cut -d: -f1-3 | sort | uniq -cYou will see zero errors before
03:04, then a sudden jump. In this dataset, the errors appear all at once rather than climbing gradually. -
Show the first customer-visible failure:
Terminal window awk '$9 >= 500 {print; exit}' ~/cs312-log-activity/bundle/nginx-access.logThis prints the first request that actually returned a 500. The timestamp is
03:04:01, which is the first moment the problem becomes visible to a user. -
Find the earliest application-side error:
Terminal window # Find the first matching result in the application loggrep '"level":"error"' ~/cs312-log-activity/bundle/order-api.jsonl | head -1The first application error appears at
03:01:17, almost three minutes before the first HTTP 500. That gap is realistic here because the early database timeouts were absorbed by stale-cache fallback insideorder-api, so the edge still returned200for a while. Use this earlier application error as the anchor for the deeper investigation. -
Correlate the degradation window across application, database, and host logs:
Terminal window grep '"level":"error"' ~/cs312-log-activity/bundle/order-api.jsonlawk '$2 >= "03:01:00" && $2 < "03:05:00"' ~/cs312-log-activity/bundle/postgres.logawk '$3 >= "03:01:00" && $3 < "03:05:00"' ~/cs312-log-activity/bundle/db-host-journal.logNow that you have the first application error at
03:01:17, hold the time window fixed across the other logs instead of searching by message text. These commands keep only entries from03:01:00through03:04:59in each file so you can compare the same interval at every layer.order-api.jsonlshows two early database timeouts that still served stale cache, then a pool-exhaustion warning at03:03:14, and finally a timeout at03:03:58with no fallback left.postgres.logshows repeated connection-slot exhaustion starting at03:01:04and a later worker issue at03:04:13.db-host-journal.logshows the export service still running at03:01:05and still holding sessions at03:04:14. -
Use the service name you just surfaced to search earlier entries in the same host log:
Terminal window grep 'nightly-export' ~/cs312-log-activity/bundle/db-host-journal.logThe earlier lines show the export job starting at
02:58:23and opening 90 database sessions at02:58:24. If that service name had not appeared, we would have to look backwards with a wider time window and more guesswork about what to search for.
At this point, you have surfaced the whole causal chain needed for a minimal incident timeline: precursor at 02:58, database exhaustion at 03:01, the first API timeout at 03:01:17, and the first user-visible HTTP 500 at 03:04:01.
Write the Incident Timeline
Section titled “Write the Incident Timeline”You now have enough evidence to write the incident timeline without introducing any rows you have not already seen. A good timeline is selective: include the smallest set of entries that proves when the precursor began, when the system failed internally, and when users felt the impact.
-
Set your name in a shell variable. Replace
YOUR NAMEwith your actual name:Terminal window export CS312_NAME="YOUR NAME" -
Create the timeline file from the evidence you just surfaced:
Terminal window cat <<EOF > ~/cs312-log-activity/timeline.txtInvestigator: ${CS312_NAME}Host: $(hostname)Generated: $(date -u +%FT%TZ)Incident: order-api HTTP 500 spike (15 Mar 2026 starting 03:04 UTC)Root cause: nightly-export opened 90 long-lived PostgreSQL sessions,exhausting PostgreSQL connection slots and leading to order-api timeouts.TIME (UTC) SOURCE EVENT02:58:24 db-host-journal.log nightly-export opened 90 database sessions03:01:04 postgres.log connection slots exhausted (FATAL)03:01:17 order-api.jsonl first database timeout (req-1002, 30012ms)03:04:01 nginx-access.log first HTTP 500 on /api/orders03:04:14 db-host-journal.log export job still holding open sessionsEOFEvery row in this table came from a command in the previous section. The point is not to copy every line you saw. It is to keep only the entries that make the cause chain defensible.
-
Print the timeline:
Terminal window cat ~/cs312-log-activity/timeline.txtYour name, hostname, and cause chain should all appear in one clean block. Notice the gap between internal failure (
03:01:04) and first user-visible symptom (03:04:01): for almost three minutes,order-apiwas still masking some database timeouts with stale-cache fallback before that degraded path ran out. That gap is the MTTD window if alerting fires only when the HTTP 5xx rate rises.
Check the Audit Trail
Section titled “Check the Audit Trail”Before leaving the Linux machine, inspect one more log source: the kernel audit subsystem. Unlike application logs, audit records are generated at the kernel boundary, so they can show a file access even when the program that performed it writes nothing of its own.
-
Confirm auditd is running:
Terminal window sudo systemctl status auditd --no-pagerThe status should show
active (running). If it showsinactive, runsudo systemctl start auditdbefore continuing. -
See a summary of recent audit activity:
Terminal window sudo aureport --summaryaureportreads/var/log/audit/audit.logand produces category totals: logins, file accesses, executions, and more. The audit subsystem has been recording events since the daemon started, with no application involvement. -
Add a temporary watch rule on one of the bundle files:
Terminal window AUDIT_FILE="$HOME/cs312-log-activity/bundle/order-api.jsonl"sudo auditctl -w "$AUDIT_FILE" -p r -k cs312-audit-wspecifies the file to watch,-p rwatches for read access, and-ksets a search key so you can find these events later. Some systems also printOld style watch rules are slowerwhen you add this rule. That message is a performance warning about the older watch-rule syntax, not a sign that the rule failed. For one temporary watch on one file in this activity, you can ignore the message and continue. The rule is active until the next reboot or until you remove it explicitly. -
Trigger the rule by reading the file:
Terminal window grep '"level":"fatal"' "$AUDIT_FILE" > /dev/nullsleep 1Your
grepcommand read the file. The audit subsystem recorded that read at the kernel level regardless of whethergrepitself does any logging. -
Search for the audit event:
Terminal window sudo ausearch -k cs312-audit --start recentThe audit record shows the file path, the process that accessed it, the real user ID, and the timestamp. This is the kind of evidence that appears in compliance audits and security investigations: what accessed this file, and when.
-
Remove the watch rule:
Terminal window sudo auditctl -W "$AUDIT_FILE" -p r -k cs312-audit-W(capital W) removes the specific watch. The audit log entry you just created remains in/var/log/audit/audit.logeven after the rule is removed.
Trace a Crash in Kubernetes
Section titled “Trace a Crash in Kubernetes”You have finished the Linux-machine portion of the activity. Switch once to your local machine terminal for this section and the next.
In containerized environments, the current container log is often not the one that contains the crash. This section uses your local minikube cluster to practice the investigation commands before you need them in a real incident.
-
Make sure minikube is running:
Terminal window minikube statusIf the cluster is stopped:
Terminal window minikube start --driver=docker --memory=4096 --cpus=2 -
Deploy a pod that prints errors and then exits:
Terminal window kubectl apply -f - <<'EOF'apiVersion: v1kind: Podmetadata:name: crash-demospec:restartPolicy: OnFailurecontainers:- name: crash-demoimage: busybox:1.36command:- sh- -c- |echo "INFO starting crash-demo pid=1"echo "INFO listening on :8080"echo "INFO readiness probe succeeded"sleep 5echo "ERROR database connection timeout host=postgres-prod port=5432 request_id=req-9001"echo "FATAL panic: unable to start HTTP server because dependency check failed"exit 1EOFThe container prints startup messages that look healthy, waits five seconds, then logs the error and exits with a non-zero code.
-
Watch the pod move through its lifecycle:
Terminal window kubectl get pod crash-demo -wYou will see it move through
Pending,Running,Error, and after several restarts,CrashLoopBackOff. Press Ctrl+C when you see the backoff status appear. -
Read the current container’s log:
Terminal window kubectl logs crash-demoYou see the most recently started container’s output. In a real incident, this container might look completely healthy because Kubernetes restarted it fresh after the crash. The evidence you need is in the previous container.
-
Read the previous container’s log:
Terminal window kubectl logs crash-demo --previous--previousretrieves the log from the terminated container that ran before the current one. When a container crashes and restarts, the new container writes to a new log stream. Without--previous, the crash evidence disappears behind the fresh startup. If this command fails in local environments, it usually means the runtime no longer has the prior log stream (for example after rotation or cleanup). In that case, continue withkubectl describe pod crash-demofor restart evidence andkubectl logs crash-demo -c crash-demofor current crash output. -
Inspect the cluster’s view of what happened:
Terminal window kubectl describe pod crash-demoScroll to the
Eventssection at the bottom. You will see restart-related entries such asStartedandBackOff. These are control-plane events from the kubelet and scheduler, not from the application. A failed image pull, an OOM kill, or a readiness probe failure would also appear here before any application log exists. -
Clean up:
Terminal window kubectl delete pod crash-demo
Watch Fluent Bit Collect Logs
Section titled “Watch Fluent Bit Collect Logs”The previous section showed how to pull logs from a single pod on demand. In production, a collection agent runs continuously on every node, reading container log files from the node’s filesystem, enriching each entry with Kubernetes metadata, and forwarding the result to a centralized store such as Loki, Elasticsearch, or a managed cloud logging service. In this demo, Fluent Bit writes to stdout instead so you can watch the forwarded records directly with kubectl logs.
Stay on your local machine terminal.
-
Deploy a pod that writes continuous structured output:
Terminal window kubectl apply -f - <<'EOF'apiVersion: v1kind: Podmetadata:name: log-generatorspec:containers:- name: log-generatorimage: busybox:1.36command:- sh- -c- |n=1while true; doprintf '{"service":"order-api","level":"info","message":"heartbeat","count":%d}\n' "$n"n=$((n+1))sleep 3doneEOFThis pod writes one JSON line every three seconds to stdout, simulating an application that uses structured logging.
-
Deploy Fluent Bit as a DaemonSet:
Terminal window kubectl apply -f - <<'EOF'apiVersion: v1kind: Namespacemetadata:name: logging---apiVersion: v1kind: ServiceAccountmetadata:name: fluent-bitnamespace: logging---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:name: fluent-bitrules:- apiGroups: [""]resources: ["pods", "namespaces", "nodes"]verbs: ["get", "watch", "list"]---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata:name: fluent-bitroleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: fluent-bitsubjects:- kind: ServiceAccountname: fluent-bitnamespace: logging---apiVersion: v1kind: ConfigMapmetadata:name: fluent-bit-confignamespace: loggingdata:fluent-bit.conf: |[SERVICE]Flush 2Log_Level infoDaemon Off[INPUT]Name tailPath /var/log/containers/*.logExclude_Path /var/log/containers/*_logging_fluent-bit-*.logTag kube.*Refresh_Interval 5Skip_Long_Lines On[FILTER]Name kubernetesMatch kube.*Kube_URL https://kubernetes.default.svc:443Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crtKube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/tokenMerge_Log OffKeep_Log OnLabels OnAnnotations Off[OUTPUT]Name stdoutMatch *Format json_lines---apiVersion: apps/v1kind: DaemonSetmetadata:name: fluent-bitnamespace: logginglabels:app: fluent-bitspec:selector:matchLabels:app: fluent-bittemplate:metadata:labels:app: fluent-bitspec:serviceAccountName: fluent-bittolerations:- key: node-role.kubernetes.io/control-planeoperator: Existseffect: NoSchedule- key: node-role.kubernetes.io/masteroperator: Existseffect: NoSchedulecontainers:- name: fluent-bitimage: fluent/fluent-bit:3.2imagePullPolicy: IfNotPresentvolumeMounts:- name: varlogmountPath: /var/logreadOnly: true- name: containersmountPath: /var/lib/docker/containersreadOnly: true- name: configmountPath: /fluent-bit/etc/fluent-bit.confsubPath: fluent-bit.confvolumes:- name: varloghostPath:path: /var/log- name: containershostPath:path: /var/lib/docker/containers- name: configconfigMap:name: fluent-bit-configEOFThe manifest creates a
loggingnamespace, a ServiceAccount with cluster-read permissions so Fluent Bit can look up pod metadata, a ConfigMap with the pipeline configuration, and the DaemonSet itself. It mounts both/var/logand/var/lib/docker/containersfrom the node because, on a Docker-backed minikube node, the files under/var/log/containers/are symlinks that ultimately resolve into Docker’s container log directory. The tail input also excludes Fluent Bit’s own container log so the demo output does not loop back into itself. Because minikube is a single-node cluster, exactly one Fluent Bit pod will start. -
Wait for the Fluent Bit pod to be ready:
Terminal window kubectl rollout status daemonset/fluent-bit -n loggingYou should see
daemon set "fluent-bit" successfully rolled out. If it takes more than a minute, check the pod status withkubectl get pods -n logging. -
Stream Fluent Bit’s output:
Terminal window kubectl logs -n logging -l app=fluent-bit --tail=10 -fYou will see the most recent forwarded entries immediately, then new JSON objects as they arrive. Each object represents one log entry that Fluent Bit read from
/var/log/containers/on the node. Press Ctrl+C after several lines appear. -
Examine one log-generator entry:
Terminal window kubectl logs -n logging -l app=fluent-bit --tail=200 \| grep '"pod_name":"log-generator"' | tail -1In the output, look for the
kubernetesobject. It will contain fields that the log-generator pod never wrote:"kubernetes":{"pod_name":"log-generator","namespace_name":"default","container_name":"log-generator",...}The application wrote only its JSON heartbeat. Fluent Bit added the pod identity by reading the log file’s path (
/var/log/containers/log-generator_default_log-generator-<id>.log) and querying the Kubernetes API for the matching pod’s metadata. Thelogfield holds what the pod wrote; thekubernetesobject holds what the pipeline added. -
Clean up:
Terminal window kubectl delete pod log-generatorkubectl delete namespace loggingkubectl delete clusterrole fluent-bitkubectl delete clusterrolebinding fluent-bit
Going Further
Section titled “Going Further”You have the core investigation loop: scope the symptom, narrow the time window, broaden across adjacent sources, and write up the cause chain. The prepared bundle gave you a controlled version of that loop. The natural next step is to run the same process against live data.
If you still have a web service running from an earlier course activity, repeat this workflow on live data. Use journalctl -u <service> -p err --since "1 hour ago", inspect the matching files in /var/log, and compare timestamps to what your service was doing at that time.
To add a real log storage backend, the Loki getting-started guide walks you through deploying a minimal Loki instance. Once Loki is running, reconfigure the Fluent Bit output in this activity from stdout to Loki and write your first LogQL query. Start with {namespace="default"} to pull all logs from the default namespace, then chain |= "heartbeat" to filter by message content. Comparing response time between a wide label selector and a narrow one makes the cost tradeoff in Loki’s label-first index model concrete.
For deeper log parsing, install jq on your server and rewrite the JSON investigation from the “Scope the Symptom” section using jq rather than grep. A filter like jq 'select(.duration_ms > 5000)' works on any JSON log file where that field exists, without the text-matching fragility of a regex against a JSON string. That is the next step up from ad-hoc terminal searching toward a repeatable toolkit.