Skip to content

Log Management and Incident Investigation

Logs exist everywhere. A Kubernetes pod writes to stdout. A bare metal server sends authentication events to syslog. A network switch stores interface errors in a local log buffer. A hardware storage controller records disk failures in a firmware event log that persists across reboots. A database engine writes a transaction log that is separate from anything the operating system sees. The formats differ, the tools differ, and the collection architectures differ, but the underlying fact does not change: every system that processes requests or manages state leaves a record of what it did, and that record is the primary evidence for understanding what went wrong and why.

Log management is the practice of collecting, storing, and querying those records systematically. When an alert fires and metrics tell you that the error rate is elevated, the answer to “why” is almost always waiting in a log. This lecture covers where logs come from across the full range of infrastructure environments, how log collection and forwarding work at both small and large scale, how to search logs effectively under time pressure, how container logging in Kubernetes works differently from traditional Linux logging, and how to use logs to reconstruct the sequence of events during an incident.

Metrics and logs are complementary. You cannot replace one with the other, and understanding why helps you know which to reach for first.

Metrics are collected at regular, predetermined intervals regardless of whether anything interesting is happening. In many production setups, Prometheus scrapes memory usage, CPU utilization, and request counts every 15 seconds, though the upstream default global scrape interval is 1 minute unless overridden. The result is a dense, uniform time series: every interval of the past week has a data point, even the ones where nothing was wrong. This uniformity is what makes metrics cheap to store, fast to query, and well-suited for dashboards and alerting.

Logs are event-driven. A log entry is written when something happens: a request completes, an error is thrown, a configuration file is reloaded, a user authenticates. Between events, nothing is written. A quiet system writes few logs; a system under heavy load or under attack writes many. This variable density is the first thing that distinguishes log analysis from metrics analysis.

The more important difference is cardinality. Each log line may contain values that are unique or nearly unique: a specific user ID, a particular request path, an error message with a line number, a transaction ID that traces through multiple services. This richness is what makes logs irreplaceable for diagnosis. A metric that says “error rate is 5%” tells you something is wrong. A log line that says “connection refused at postgres-prod.internal:5432 from worker-node-3 after 30-second timeout” tells you exactly what is wrong, which system is involved, and where the failure occurred.

In practice: metrics detect problems and quantify their scope. Logs explain what specifically happened and why. In a well-instrumented system, an alert fires from a metric, and the investigation begins in the logs.

Before you can search logs effectively, you need to understand their structure. Regardless of source or format, every useful log entry carries the same core information.

The timestamp is the single most important field. Without it, you cannot reconstruct a sequence of events or determine whether a log entry predates or follows an alert. Timestamps should be precise (millisecond or microsecond resolution in busy systems), include the timezone or use UTC explicitly, and conform to a standard format such as ISO 8601: for example, 2026-03-15T03:04:12.841Z. Systems without accurate clock synchronization via NTP produce timestamps that cannot be reliably correlated across machines. If server A’s clock is 5 seconds ahead of server B’s, the logs from both servers will tell a misleading causal story.

The severity level classifies the importance of the event. The syslog standard defines eight levels numbered 0 through 7, from EMERG (the system is unusable) at the top to DEBUG (fine-grained detail intended for development use) at the bottom. Most application frameworks use a simplified subset: DEBUG, INFO, WARN, ERROR, and FATAL or CRITICAL. In production, most services should log at INFO or WARN and above. Running at DEBUG in production generates enormous volume, can expose sensitive data in log output, and obscures the signals you need when investigating an incident.

LevelNameMeaning
0EMERGSystem is unusable; immediate action required
1ALERTAction must be taken immediately
2CRITCritical condition (hardware failure, etc.)
3ERRError condition; something failed
4WARNINGUnexpected condition but system continues
5NOTICENormal but significant event
6INFORoutine informational message
7DEBUGFine-grained detail for development

The source identifies which process, service, or component wrote the entry: the hostname, the process name, the container name, or a structured service identifier. In an environment where many services write logs to the same centralized store, source identification is essential for filtering.

The message body is the human-readable or machine-parseable description of the event. A good message is specific: “database connection refused at postgres-prod:5432 after 30-second timeout, transaction ID 7e4a2b” is useful. “Connection error” is not. Messages that appear during incidents often become search strings, so specificity matters operationally.

In a system where a single user action triggers requests across multiple services, individual log entries from each service are difficult to connect without a shared identifier. A correlation ID (also called a request ID or trace ID) is a UUID generated at the entry point of a request and propagated through all downstream calls via HTTP headers (typically X-Request-ID or X-Correlation-ID). Every log entry emitted while processing that request includes the correlation ID as a structured field.

This single practice transforms multi-service incident investigation. Instead of correlating log entries from five services by timestamp (approximate and error-prone), you can filter all five services’ logs in a centralized logging system by a single ID. The correlation ID is the practical bridge between logs and traces: it provides trace-like capability (following a request across services) without requiring a full distributed tracing infrastructure. When searching logs in a query language, you might filter by a label like request_id="7e4a2b93-..." or parse JSON fields to find matching values. The specific syntax depends on your logging platform.

Syslog is frequently described as a log format, but it is more accurately a protocol, a daemon ecosystem, and a convention that predates Linux itself. Understanding all three layers helps you reason about how logs flow in traditional environments and why the cloud-native logging stack was built the way it was.

RFC 5424 defines syslog as a network protocol for forwarding log messages from a sender to a receiver. Each message includes a priority value (encoding both facility and severity), a timestamp, a hostname, an application name, a process ID, and a message body. The message is transmitted over the network either as UDP datagrams or over a TCP connection.

UDP syslog (port 514) is fire-and-forget: the sender emits a datagram and does not wait for acknowledgment. This is fast and low-overhead, but log messages are silently dropped if the receiver is unavailable or if the network is congested. For non-critical log data where some loss is acceptable, UDP is still widely used. TCP syslog (also port 514 by convention, or 6514 for TLS-encrypted transport) is more reliable because TCP retransmits lost segments and exposes broken connections. That still is not an end-to-end delivery guarantee at the syslog application layer: if the connection breaks, the sender may not know exactly which messages the receiver actually committed. For security and audit logs where loss must be minimized, TCP or TLS-encrypted TCP is the normal choice.

On modern Linux systems, the syslog protocol is implemented by a daemon: most commonly rsyslog (Rocket-fast System for Log processing) or syslog-ng. Both daemons receive log entries from local sources (the kernel, systemd, application processes writing to the syslog socket at /dev/log), apply rules for filtering and routing, and deliver the results to configured destinations.

The routing capabilities of rsyslog and syslog-ng are what distinguish them from simple log files. A ruleset might say: entries with facility auth and severity warning or higher go to /var/log/auth.log and are also forwarded over TCP to the central syslog server; entries from the web server process go to /var/log/nginx/error.log; everything else goes to /var/log/syslog. This granular routing is how traditional Linux environments have handled log aggregation for decades without agents like Fluent Bit.

The standard centralized logging architecture in traditional (pre-container) Linux environments uses syslog forwarding: each host runs rsyslog, which forwards log entries in real time over TCP to a central syslog aggregator. The aggregator writes to disk, feeds into a log management platform, or both.

This architecture works well at moderate scale and requires no third-party agents beyond the rsyslog daemon that is already running on every host. Its limitations become visible at large scale: high-volume hosts can saturate the aggregator, TCP connections must be maintained from every host to the central server, and adding metadata (Kubernetes namespace, service name, environment label) requires rsyslog configuration rather than automated enrichment. These limitations are why purpose-built collection agents like Fluent Bit emerged and why container environments generally use agent-based collection rather than syslog forwarding.

JSON structured logging has become the standard in cloud-native environments, particularly for applications running in containers. Instead of a flat string, each log entry is a JSON object with named fields:

{
"timestamp": "2026-03-15T03:04:12.841Z",
"level": "error",
"service": "order-api",
"request_id": "7e4a2b93-f1c2-4d8e-a391-b2c3d4e5f678",
"message": "upstream connection timeout",
"upstream_host": "10.0.1.23",
"upstream_port": 3000,
"client_ip": "203.0.113.45",
"request_path": "/api/orders",
"duration_ms": 30012
}

Structured logs are trivial to parse programmatically and work naturally with centralized logging systems. A log aggregation system can index every field and support queries like “find all entries where upstream_host is 10.0.1.23 and duration_ms is greater than 5000.” The tradeoff is that JSON entries are harder to read at a glance than a compact syslog line, especially when many fields are present.

In practice, many production systems use structured logging for the data pipeline (so centralized storage can index and filter efficiently) while supporting a human-readable format in development environments. A logging library configured with an environment variable can switch between formats without code changes.

Log entries do not appear spontaneously. They are produced by logging frameworks embedded in application code, and those frameworks make consequential decisions about what gets written, at what level, in which format, and to which destination.

Log4j (Java) is the most widely known logging framework in enterprise software. Log4j2 (the current version) supports appenders (destinations: file, stdout, syslog socket, network endpoint), layouts (text, JSON, pattern-based), and filters (include or exclude entries matching conditions). Log4j2 also supports dynamic log level adjustment via a JMX API or an HTTP endpoint: an operator can change the log level of a running application from INFO to DEBUG without a restart, capture what they need, and return it to INFO. This is valuable for diagnosing issues that are difficult to reproduce in non-production environments.

Python’s logging module, Go’s zap and logrus, Node.js’s Winston, and PHP’s Monolog follow the same conceptual model: a logger hierarchy, severity level filtering, multiple output handlers, and configurable formatters. The configuration layer is what lets operators change log verbosity and destination without redeploying.

Understanding the logging framework in use for a given application matters during incidents. If a service is producing no log output during a known error condition, the first question is whether the log level is set too high (suppressing the relevant entries) or whether the framework’s output handler is misconfigured (sending logs to a file that is being rotated, or to a network destination that is unreachable).

Log Verbosity, Performance, and What Not to Log

Section titled “Log Verbosity, Performance, and What Not to Log”

Log verbosity has a direct performance cost. Each log entry requires serialization (converting the in-memory object to text or JSON), a write system call, and potentially a network transmission if logs are forwarded in real time. A high-traffic service logging at DEBUG level can double or triple its I/O write throughput and consume significant CPU for serialization alone. On a server with limited I/O bandwidth, excessive logging can become a performance bottleneck that affects the service it is instrumenting.

Production services should run at INFO or WARN level by default. DEBUG level should be enabled only temporarily, for specific components or request paths, when diagnosing a problem that cannot be reproduced at a lower verbosity. Modern logging frameworks support scoped level adjustment: you can set one package or class to DEBUG while the rest of the application stays at INFO, minimizing the performance impact.

Some data must never appear in logs, regardless of log level:

Passwords and authentication credentials of any kind, including plaintext passwords during failed authentication attempts. Many naive authentication log implementations inadvertently log the submitted password alongside the username in failure messages.

Session tokens, API keys, and secrets. A session token in a log entry is a session token that can be stolen by anyone with access to the logs. Log storage, centralized logging platforms, and SIEM systems are typically accessible to many engineers; logs are not a secret store.

Full credit card numbers, social security numbers, and other regulated PII (Personally Identifiable Information). PCI-DSS (Payment Card Industry Data Security Standard), HIPAA (Health Insurance Portability and Accountability Act), GDPR (General Data Protection Regulation), and most other compliance frameworks explicitly prohibit logging regulated identifiers in plaintext. Logging masked or tokenized versions (last four digits of a card number, a hash of a Social Security Number) is generally acceptable; logging the full value is not.

Full request bodies when those bodies may contain sensitive form fields. Logging request paths and response codes is standard practice; logging the full request body is usually not, because form submissions, API payloads, and file uploads often contain sensitive data you did not anticipate.

The operational principle is to log identifiers that let you find and correlate records (user IDs, request IDs, transaction IDs) without logging the sensitive values those identifiers protect (passwords, tokens, regulated PII). This distinction keeps logs useful for investigation while keeping them safe for broad access.

On a modern Linux system using systemd, most logs flow through two channels: the systemd journal and traditional log files under /var/log.

The systemd journal is a structured, binary log database managed by systemd-journald. It captures output from all systemd-managed services, kernel messages, and anything sent through the syslog socket. You query it with journalctl. The journal’s defining advantage over plain text log files is indexing: entries are indexed by unit, priority, timestamp, and other metadata, so filtering is fast even on a busy system. The journal is stored in /var/log/journal/ when persistent storage is configured (preferred for production, because it survives reboots) or in /run/log/journal/ for volatile storage (which resets on reboot).

The /var/log directory remains the traditional home for plain-text log files. A typical listing includes:

  • /var/log/syslog (Debian/Ubuntu) or /var/log/messages (RHEL/CentOS): general system messages
  • /var/log/auth.log (Debian) or /var/log/secure (RHEL): authentication events, SSH sessions, sudo usage
  • /var/log/kern.log: kernel messages
  • /var/log/nginx/: Nginx access and error logs
  • /var/log/mysql/ or /var/log/postgresql/: database logs

Plain-text log files are readable with cat, less, grep, and awk without special tooling. That simplicity is a durability advantage: you can process log files from a decade ago with the same tools you use today.

The systemd journal and /var/log capture events that processes choose to emit. The Linux kernel audit subsystem, managed by auditd, captures events at the system call level regardless of whether any application has been configured to log them. This distinction is fundamental: an application can be compromised and its own logging disabled, but kernel-mediated audit records can still capture actions the application would never choose to report. That does not make audit logs magically tamper-proof. A sufficiently privileged attacker can change rules or disable auditing unless the system has been configured to prevent that, which is why sensitive systems ship audit records off-host quickly.

You configure the audit subsystem using auditctl rules. Common rules include watching specific files for access or modification, recording all executions of privileged commands, and logging privilege escalation events:

Terminal window
# Log all writes to the sudoers file
auditctl -w /etc/sudoers -p wa -k sudoers-change
# Log all executions of su
auditctl -w /usr/bin/su -p x -k privilege-escalation
# Log common file-open syscalls in /etc by non-root users (64-bit systems)
auditctl -a always,exit -F arch=b64 -S open -S openat -F dir=/etc -F auid>=1000 -k etc-access

The audit log at /var/log/audit/audit.log records structured entries for each matched event, including the timestamp, the user who triggered it, the process name and PID, and the specific system call and arguments. This is the log that compliance auditors typically request: PCI-DSS requires audit trails for access to cardholder data systems, HIPAA requires audit logs for access to protected health information, and FedRAMP (Federal Risk and Authorization Management Program) requires system-level audit logging with specific retention periods.

ausearch and aureport are the command-line tools for querying the audit log. ausearch -k sudoers-change --start today returns all audit events tagged with the sudoers-change key since midnight, making it easy to answer “who changed /etc/sudoers and when?”

Unlike application logs, audit logs are intended for accountability and compliance rather than operational telemetry. They should be shipped to a centralized, append-only store as quickly as possible: a compromised host with root access can modify /var/log/audit/audit.log, but audit entries that have already been shipped to a centralized store are preserved.

Some failures appear in logs that exist below the operating system level. Hardware components maintain their own event logs independently of whether the OS is running, and these logs can contain early warnings that precede OS-level symptoms by hours or days.

The Baseboard Management Controller (BMC) is a small embedded processor on server motherboards that operates independently of the main CPU and OS. It maintains a System Event Log (SEL): a persistent record of hardware events including power supply failures, fan speed anomalies, temperature threshold crossings, CPU errors, and memory errors. On Dell hardware this interface is called iDRAC; on HPE it is iLO; on most other hardware it is accessible via the IPMI protocol.

Terminal window
# Read the System Event Log via IPMI
ipmitool sel list

ECC (Error-Correcting Code) memory corrects single-bit errors automatically, but it records them. A server whose memory is generating increasing numbers of correctable ECC errors is approaching hardware failure; the ECC error rate is visible in the kernel log (journalctl -k | grep "EDAC") and in the BMC SEL long before the memory fails catastrophically and causes an OS crash. Catching this signal early allows planned hardware replacement instead of emergency response.

Disk controllers similarly maintain their own logs and expose health data through S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology). The smartctl -a /dev/sda command reads a drive’s health data including reallocated sector count, uncorrectable error count, and temperature history. Reallocated sectors are a leading indicator of imminent drive failure: the drive is mapping failed sectors to spare areas, which is a mechanism with a finite capacity.

These hardware-layer signals are not visible in application logs or even in the systemd journal until the hardware degradation causes an OS-level symptom. An operator who only reads application logs will miss early warning signs that would allow proactive intervention.

Log data does not only inform human decisions; it can drive automated operational responses. fail2ban is the canonical example: a daemon that parses log files in real time, matches patterns using configurable regular expressions, and executes actions when thresholds are crossed.

The default use case is SSH brute-force protection. fail2ban watches /var/log/auth.log (or the systemd journal unit for sshd), counts failed authentication attempts per source IP address within a rolling time window, and when an IP exceeds the configured threshold (typically 5 to 10 failures), executes a firewall action, often by adding iptables or nftables rules that block traffic from that IP for a configured duration (typically 10 minutes to 1 hour).

[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 5
bantime = 3600
findtime = 600

fail2ban ships with filters for dozens of common applications: Nginx, Apache, Postfix, Dovecot, vsftpd, and others. You can also write custom filters using Python regular expressions to match any log pattern specific to your application. A web application that logs failed login attempts in a structured format can have a fail2ban filter that blocks IPs after repeated authentication failures against the application’s own authentication system, not just SSH.

The broader principle that fail2ban illustrates is that logs are a real-time stream that can drive operational behavior, not just a historical record to consult after the fact. This same principle underlies log-based alerting in Loki and complex event processing in SIEM systems.

When you SSH into a Linux server and read /var/log/nginx/access.log, you are reading a file that Nginx wrote directly to disk. In a containerized, Kubernetes-managed environment, this relationship changes fundamentally. The Kubernetes logging architecture separates application output, node-level collection, and centralized aggregation. Understanding that split is essential for effective troubleshooting, because the behavior of kubectl logs is surprising until you understand the mechanism behind it.

Containers are expected to write their log output to standard output (stdout) and standard error (stderr). This is a deliberate design decision: it makes log routing and centralization the responsibility of the infrastructure rather than the application. The application does not need to know where its logs go; it writes to stdout/stderr, and the container runtime handles the rest. This also makes containers portable: the same image behaves consistently whether it runs locally, on a VM, or in a Kubernetes cluster.

The container runtime intercepts stdout/stderr from each container and writes it to log files on the node’s filesystem, typically at /var/log/pods/<namespace>_<pod-name>_<pod-uid>/<container-name>/0.log. Each container has its own log file. The runtime rotates these files as they grow, according to its configuration.

The kubelet manages these log files and exposes them through the Kubernetes API. When you run kubectl logs <pod-name>, you are making an API request to the Kubernetes API server, which forwards the request to the kubelet running on the node where the pod is scheduled. The kubelet reads from the container’s log file and streams the content back. This indirection is what makes kubectl logs work from any machine with API access to the cluster, without needing node-level SSH access.

When a container crashes and Kubernetes restarts it, the restarted container is a new process writing to a new log file. The log from the crashed container is not in the current container’s log stream. The kubelet saves a copy of the terminated container’s log for a brief period, and --previous retrieves that saved log:

Terminal window
kubectl logs <pod-name> --previous

Without --previous, you see only the logs from the currently running container, which started fresh after the restart. The error that caused the crash is in the previous container’s log. This behavior surprises engineers used to traditional log files that accumulate indefinitely: in Kubernetes, the current and previous container logs are separate, and you must know which you are asking for.

Additional flags: --since limits output to entries within a time window (kubectl logs <pod-name> --since=1h), -f streams new log entries in real time (equivalent to tail -f), and -c specifies which container to query in a multi-container pod.

Container logs capture what the application wrote to stdout and stderr. Kubernetes events capture what the cluster’s control plane observed: the scheduler assigning a pod to a node, the kubelet pulling a container image, the OOM killer terminating a container, a readiness probe failing repeatedly. These are two separate channels of information.

kubectl describe pod <pod-name> is frequently the fastest path to root cause for pod-level failures. The Events section at the bottom shows the cluster’s record of what happened to the pod: image pull errors, resource limit violations, liveness probe failures, and evictions all appear here. These are infrastructure-level events that the application cannot see and therefore cannot log. An image pull failure appears in the Events section before any application logs exist, because the container never started.

Kubernetes events are short-lived by default. In upstream Kubernetes, the kube-apiserver default for --event-ttl is 1h, though cluster operators and managed platforms can change this. If you are investigating an incident that happened hours ago, the events may no longer be available even if the pod still exists. This is one of the practical reasons to export events or persist them in a centralized observability system.

The fastest path to a specific log entry on a Linux system is a small set of command-line tools. They are worth knowing well even when you use a centralized logging platform in production, because the platform will not always be available when you are debugging a fresh instance, a newly provisioned VM, or a node that has lost connectivity.

The commands in this section rely on regular expressions for pattern matching. The Shell Scripting and Automation Basics lecture covers regex syntax in depth, including the difference between basic and extended regular expressions, character classes, quantifiers, and anchors. This section focuses on the practical application of those tools to log investigation. If you encounter a regex pattern here that is unfamiliar, refer to the shell scripting lecture for the underlying syntax and concepts, or visit regular expressions 101 for interactive testing and explanation.

grep searches for lines matching a pattern. For case-insensitive search:

Terminal window
grep -i "error" /var/log/nginx/error.log

To show context around each match, -B adds lines before and -A adds lines after:

Terminal window
grep -B2 -A2 "Connection refused" /var/log/nginx/error.log

To exclude matching lines, useful for filtering out health check requests:

Terminal window
grep -v "GET /healthz" /var/log/nginx/access.log

journalctl queries the systemd journal with rich filtering. Filtering simultaneously by service unit, severity level, and time window:

Terminal window
journalctl -u nginx.service -p err --since "2026-03-15 02:50:00" --until "2026-03-15 03:10:00"

To follow new entries in real time:

Terminal window
journalctl -u myapp.service -f

For plain-text log files, tail -f streams new lines as they are written. Piping to grep watches for a specific pattern in the real-time stream:

Terminal window
tail -f /var/log/nginx/access.log | grep " 500 "

awk excels at extracting and summarizing fields from structured text. To count how many times each status code appeared in an Nginx access log (the status code is the ninth field in the default log format):

Terminal window
awk '{print $9}' /var/log/nginx/access.log | \
sort | \
uniq -c | \
sort -rn | \
head
# awk '{print $9}': print only the HTTP status code field
# sort: group identical status codes together
# uniq -c: count each grouped status code
# sort -rn: sort counts numerically, highest first
# head: show only the top results

To find the top client IPs generating 500 errors:

Terminal window
awk '$9 == 500 {print $1}' /var/log/nginx/access.log | \
sort | uniq -c | sort -rn | head
# awk '$9 == 500 {print $1}': filter rows with status 500, then print client IP field
# sort | uniq -c | sort -rn | head: count by IP and show highest-frequency sources first

This tells you immediately whether 500 errors are coming from all clients (suggesting a server-side problem) or a small number of sources (suggesting a bad request pattern or a targeted attack).

Regular expressions extend grep’s ability to match patterns. To match any HTTP 5xx response:

Terminal window
grep -E '" [5][0-9]{2} ' /var/log/nginx/access.log

To extract and count usernames from failed SSH authentication attempts:

Terminal window
grep "Failed password for" /var/log/auth.log | \
sed -E 's/.*Failed password for (invalid user )?([^ ]+).*/\2/' | \
sort | uniq -c | sort -rn
# grep "Failed password for": keep only failed password lines
# sed -E ...: extract the username token, handling both valid and "invalid user" cases

The difference between a long list of failed passwords for root, admin, and test from many different IPs (an automated brute-force scan, probably not targeted) versus focused failed attempts for actual employee usernames from a single unfamiliar IP (potentially credential stuffing) is visible in this output within seconds.

Effective log investigation is less about memorizing commands and more about having a systematic approach. Three phases structure most successful investigations regardless of whether the incident is on a single server or across a Kubernetes cluster.

Scope the problem first. Before opening a log file, define exactly what you are looking for. What is the symptom: HTTP 500 errors, a pod that will not start, slow database queries? Which service or component is exhibiting it? When did it start? Scoping forces precision and prevents you from aimlessly searching large log files without a hypothesis. The more specifically you can state the symptom, the more precisely you can filter the logs.

Narrow the time window. Log files for busy services can contain millions of entries per day. Use timestamps to focus on the window around when the symptom appeared. The cause precedes the effect; look earlier than the symptom, not later.

Before reading individual lines, start with a count. A per-minute frequency histogram reveals when errors began and whether the pattern is continuous or bursty:

Terminal window
awk '{print $4}' /var/log/nginx/access.log | \
cut -d: -f1-3 | \
sort | uniq -c
# awk '{print $4}': print the timestamp field from each log line
# cut -d: -f1-3: keep date+hour+minute for per-minute buckets
# sort: group identical minute values together
# uniq -c: count entries per minute

Patterns invisible in a wall of raw log text become obvious in a histogram. A sudden spike at one timestamp suggests a discrete event (a deployment, a process crash, a configuration change). A gradual climb over 10 minutes suggests degradation (a memory leak, a slow query worsening over time). The pattern changes your investigation strategy.

Broaden across sources. Once you have a hypothesis from the first log you examined, test it against adjacent sources. If the nginx error log says “upstream connection refused,” check the application server’s log for the same timestamp. If the application server shows “database connection timeout,” check the database log and the system journal on the database host. Root causes almost always appear in a log one level removed from the symptom.

Cross-referencing is where the most important discoveries happen. A database timeout in the application log should have a corresponding entry in the database log. If the database log shows nothing at the same timestamp, the problem is probably the network between them, not the database itself.

A concrete walkthrough illustrates how the three phases work together. At 03:07, an alert fires: the order API error rate has exceeded 5% for the past three minutes.

Scope: The alert identifies the service and approximate time. The symptom window started around 03:04.

Narrow: Start with a per-minute count of errors rather than reading raw log output:

Terminal window
awk '$9 >= 500 {print $4}' /var/log/nginx/access.log | cut -d: -f1-3 | sort | uniq -c | tail -20
# awk '$9 >= 500 {print $4}': keep only 5xx rows and print the timestamp field
# cut -d: -f1-3: keep date + hour + minute so entries bucket per minute
# sort: group identical minute buckets together
# uniq -c: count requests per minute bucket
# tail -20: show the most recent 20 buckets

The output shows 0 errors at 03:03, then 847 errors at 03:04. The symptom started abruptly, suggesting a discrete event rather than gradual degradation. Look at the 5 minutes before 03:04:

Terminal window
journalctl -u order-api.service --since "03:00:00" --until "03:04:00" -p err
# -u order-api.service: query only the order-api service unit
# --since / --until: narrow output to the pre-symptom time window
# -p err: show only error-priority entries and above

Three entries between 03:01:17 and 03:03:58: all database connection timeout errors with the same upstream host (postgres-prod:5432). The database connection is timing out. The application error at 03:04 is a consequence, not the cause.

Broaden: Check the PostgreSQL log for the same window:

Terminal window
grep "03:01\|03:02\|03:03" /var/log/postgresql/postgresql.log | grep -i "error\|fatal\|deadlock\|lock"
# first grep: keep only lines from the target minutes (03:01 through 03:03)
# second grep -i: case-insensitive filter for failure and locking keywords

A stream of FATAL: remaining connection slots are reserved for non-replication superuser connections entries starting at 03:01:04. PostgreSQL hit its maximum connection limit.

Now check why: look at the system journal on the database host for the 30 minutes before:

Terminal window
journalctl -p warning --since "02:30:00" --until "03:01:00" | grep -i "connection\|memory\|postgres"
# journalctl -p warning: include warning-level and higher system entries
# --since / --until: scan the 30 minutes before the first database failures
# grep -i: keep lines related to connection pressure, memory pressure, or PostgreSQL

The journal shows a scheduled data export job started at 02:58:23. That job opened 90 long-running database connections for a batch export, consuming the remaining connection pool headroom. The order API could not acquire connections for its own requests.

Root cause: a scheduled export job consumed available database connections, leaving none for the application, causing timeouts and a spike in error rate. The entire investigation, from alert to root cause, took approximately 8 minutes and used four log sources: the nginx access log, the application service journal, the PostgreSQL log, and the system journal on the database host.

The investigation succeeded because each source was checked in order of proximity to the symptom, and each source narrowed the hypothesis until only one explanation remained.

When an incident is complex or multi-stage, working from logs toward an explanation is not enough: you need to construct an explicit timeline that integrates evidence from all sources. A timeline reconstruction answers the question “in exactly what order did these things happen?” and reveals causal chains that are invisible when reading any single source in isolation.

Effective timeline reconstruction draws from multiple independent source types, usually at least three, so that you can cross-validate cause and effect. In most environments this includes metric graphs (when did error rate or latency change?), service and system logs with precise timestamps, and deployment or configuration change records (what changed in the 30 minutes before the incident?). In Kubernetes environments, events from kubectl describe are another high-value source; on non-Kubernetes systems, equivalent infrastructure event sources include hypervisor events, cloud control-plane events, load balancer logs, or firewall logs.

A useful format is a simple table ordered by timestamp:

TimeSourceEvent
02:58:23system journal (db host)Export job started, opened 90 connections
03:01:04PostgreSQL logConnection limit reached
03:01:17order-api journalFirst database timeout recorded
03:04:00nginx access logError rate crosses 5% threshold
03:07:00AlertmanagerAlert fires

This table makes the causal chain explicit and unambiguous. It also reveals the gap between cause (03:01:04) and detection (03:07:00): six minutes during which users experienced errors before anyone was notified. That gap is the MTTD for this incident, and the timeline makes it visible in a way that invites a concrete question: could alerting have fired earlier?

Timeline reconstruction is also the foundation of the blameless postmortem, which will be discussed later.

Log Volume, Sampling, Rotation, and Retention

Section titled “Log Volume, Sampling, Rotation, and Retention”

A busy web server can write gigabytes of access logs per day. Without management, log files grow until the disk fills and the system stops functioning. This is one connected problem space: you control volume at write time with sampling and log levels, then manage stored data with rotation and retention policies.

At high request rates, logging every request is often impractical. The storage cost rises quickly, and write I/O can affect application performance.

Head-based sampling makes the decision at the start of a request: log this request with probability P, discard it otherwise. This is simple to implement but has a significant limitation: errors and anomalous requests are sampled at the same rate as normal requests. At a 1% sample rate, you will miss 99% of your errors.

Tail-based sampling defers the decision until after the request completes: always log requests that resulted in errors, were slow, or otherwise met criteria for interest, and sample the remainder at a configurable rate. Tail-based sampling preserves the interesting events while discarding routine ones, which is usually the right operational tradeoff.

A pragmatic alternative that does not require dedicated sampling infrastructure is level-based filtering: log all errors at ERROR level (always captured), log slow requests at WARN level (captured in production), and log routine requests at DEBUG level (suppressed in production).

Rotation then prevents unbounded growth by periodically compressing, archiving, and eventually deleting old log files.

On Linux, logrotate is the standard tool. It runs daily via a systemd timer or cron job and reads configuration from /etc/logrotate.conf (global defaults) and /etc/logrotate.d/ (per-application overrides). A typical configuration for Nginx:

/var/log/nginx/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 0640 www-data adm
sharedscripts
postrotate
[ -f /var/run/nginx.pid ] && kill -USR1 $(cat /var/run/nginx.pid)
endscript
}

This rotates logs daily, keeps 14 days of history, and compresses old files with a one-day delay (so tools still reading yesterday’s log can finish before compression). The postrotate step sends the USR1 signal to Nginx, causing it to close its current file handle and open a new one. Without this signal, Nginx would continue writing to the old, renamed file.

Retention decisions balance operational usefulness against storage cost and compliance requirements. The amount of disk space consumed by logs depends directly on log volume and retention duration. A single high-traffic web server generating 1 GB of access logs per day consumes 365 GB per year if kept locally; a Kubernetes cluster with 100 services, each logging at varying rates, can easily consume multiple terabytes annually. Compressed logs reduce storage by 5-10x on average (text compresses well), but even compressed logs become a significant operational cost at scale.

For active incident investigation, logs from the past two weeks are usually sufficient. An incident that occurred earlier is increasingly unlikely to need live log data; investigation focuses on metrics and traces from the recent past. Two weeks of local, compressed storage is a practical default for busy production servers.

For compliance and security auditing, retention requirements are often mandated by regulation. PCI-DSS requires one year of log retention for audit logs; HIPAA requires similar or longer periods depending on audit scope. Federal and government systems may require longer retention or even indefinite archival of certain audit trails. These requirements are often handled separately from operational logs: audit logs are shipped in real time to a centralized append-only store (which prevents tampering and deletion), while operational logs follow a shorter local retention before rotation.

Most organizations use tiered storage to balance cost and access speed:

  • Hot tier (immediate access, fast): Recent logs (past 2-7 days) on fast local SSD or centralized hot storage. Used for active incident investigation.
  • Warm tier (slower access, cheaper): Logs from the past 2-4 weeks in compressed object storage (AWS S3, Google Cloud Storage, Azure Blob). Accessed occasionally for root cause analysis or compliance spot-checks.
  • Cold tier (archive, minimal cost): Logs older than 30 days, compressed and moved to archive storage (S3 Glacier, Cloud Archive, etc.), with retrieval time measured in hours or days. Used only for compliance audits or forensic investigations of historical incidents.
  • Deletion: Very old logs (often 7+ years for audit records, less for operational logs) are deleted according to policy and legal requirements.

The specific retention window depends on your organization’s incident response velocity and compliance obligations. A startup with no regulatory requirements might keep operational logs for one week and delete thereafter. A healthcare organization processing protected health information might keep audit logs indefinitely while rotating operational logs monthly. The decision should be explicit: “we keep production logs for X days” becomes a statement in your runbooks and disaster recovery plans, guiding both automated rotation and emergency retention decisions when an incident occurs.

Reading logs on individual servers scales to a handful of machines. Once you have dozens of servers, hundreds of containers, or a Kubernetes cluster running many pods, you need a system that collects logs from every source and makes them searchable from a single interface. The architecture is consistent across implementations.

Collection agents run on each node or as sidecar containers, reading log files or tapping into syslog sockets and forwarding entries to a central destination. In a Kubernetes environment, a DaemonSet is the standard deployment pattern: it ensures one copy of the agent runs on every node, with access to the node’s /var/log/pods/ directory where the container runtime writes container logs.

Processing enriches and normalizes log data before storage. This includes parsing syslog or JSON formats, adding Kubernetes metadata (pod name, namespace, label values), filtering noisy entries, and routing different log streams to different destinations.

Storage writes processed logs to a durable, queryable system indexed for the queries you will need to run: full-text search for error messages, label-based filtering for container and namespace, and time-range filtering for incident investigation windows.

Analysis and alerting sits on top of storage. Engineers query stored logs to investigate incidents. Automated rules detect patterns (the sudden appearance of “FATAL” in a service’s log) and trigger notifications.

Fluent Bit is one of the most widely deployed log collection agents in Kubernetes environments. It is a small C-based collector designed for low memory and CPU overhead in resource-constrained environments like DaemonSets. Fluent Bit reads container log files on the node, parses them, enriches them with Kubernetes metadata from the API server, and forwards them to a configured destination. It supports outputs to Loki, Elasticsearch, Kafka, and most managed cloud logging services, and handles parsing of common log formats with built-in parsers.

You will also encounter Fluentd, a related project in the same ecosystem. The practical difference is scope and footprint. Fluent Bit is the lightweight edge collector and forwarder that runs close to the workload (for example, one DaemonSet pod per node). Fluentd is the heavier, Ruby-based aggregator with a larger plugin ecosystem and richer transformation capabilities, often used as a central processing tier. In many modern deployments, Fluent Bit handles node-level collection and forwarding directly to storage, while Fluentd is used only when you need complex routing or transformation logic that justifies the additional resource cost.

Platform Choices: Elastic Stack and Grafana LGTM

Section titled “Platform Choices: Elastic Stack and Grafana LGTM”

Once your pipeline is in place, the main design decision is which query and storage model you want for day-two operations. The two most common open-source choices are Elastic and Grafana LGTM. They solve the same problem with different indexing strategies, operational footprints, and cost/performance tradeoffs.

The Elastic Stack (historically called ELK) is the longest-established open-source log management solution. Elasticsearch stores and indexes log data using an inverted index: the same data structure used in search engines, where every term in every field is indexed separately, enabling fast full-text search across arbitrarily large log collections. Kibana provides the web interface for querying, visualizing, and building dashboards. Logstash handles log transformation pipelines, though it is often replaced by lighter-weight Beats agents (Filebeat for log files) with Logstash reserved for complex aggregation.

Elasticsearch’s full-text index is its defining strength. Any term in any log entry is searchable in milliseconds regardless of data volume, because every field has already been indexed. The cost is resources: Elasticsearch is memory-intensive, and large log volumes require careful JVM heap tuning and storage capacity planning.

The LGTM stack (Loki, Grafana, Tempo, Mimir) is a cloud-native alternative built around the same Grafana interface that many Kubernetes environments already use for metrics. Its architecture reflects a deliberate tradeoff against the Elastic Stack.

Loki indexes only metadata labels rather than the full text of log entries. The labels attached at collection time (pod name, namespace, container name, and any Kubernetes labels you choose to forward) are indexed. The log message body is stored compressed and retrieved in its entirety only when you run a query. This label-index approach dramatically reduces storage cost and ingestion overhead compared to Elasticsearch. The tradeoff is query performance: a full-text search in Elasticsearch finds a specific phrase across billions of entries in milliseconds; the same search in Loki requires scanning the compressed log files that match the label filter.

LogQL is Loki’s query language. A basic query has two parts: a log stream selector (label filters in curly braces) and optional filter expressions:

{namespace="default", container="wordpress"} |= "connection refused"

To filter to errors and extract fields from structured JSON log lines:

{namespace="default", container="order-api"} | json | level="error" | line_format "{{.message}} ({{.duration_ms}}ms)"

The | json parser extracts fields from JSON log entries, and subsequent filter expressions operate on those fields. For teams already using Grafana for metrics, Loki integrates naturally: you can display Prometheus metric panels and Loki log panels on the same dashboard, zooming from a spike in an error-rate graph directly into the log entries from that time window without switching tools.

The choice between ELK and Loki is primarily a tradeoff between query flexibility and resource cost. Elasticsearch’s full-text index gives you maximum query power: any term in any log field is searchable in milliseconds without knowing in advance which service or time window to look at. Loki requires you to narrow by label first, then scan the compressed log content within that set.

For structured, cloud-native environments where you always know which namespace, service, or container you are investigating, Loki’s tradeoff is usually acceptable, and the storage savings are substantial. For security and compliance use cases where you might search for a specific string (a CVE identifier, an IP address, a username) across all services and all time, Elasticsearch’s full-text index is more appropriate.

Both Loki and Elasticsearch support log-based alerting rules: rules that fire when a pattern appears in a log stream, in addition to metric-based alerts. In Loki, alerting rules use LogQL and are managed through Grafana’s unified alerting system.

A log-based alert that fires when the word “FATAL” appears in a service’s log stream:

count_over_time({namespace="production", container="order-api"} |= "FATAL" [5m]) > 0

Log-based alerting is a complement to metric-based alerting, not a replacement. Some failure modes produce log entries before they produce metric changes: a single fatal error before a crash loop begins, a configuration parsing failure at startup, a one-time data corruption event. Log-based alerts can catch these earlier than metrics-based alerts, which require the problem to accumulate over a scrape interval.

AWS CloudWatch Logs, Google Cloud Logging, and Azure Monitor Logs offer fully managed log ingestion, storage, and search as cloud services. The appeal is operational simplicity: no clusters to run, no storage to provision, no software versions to upgrade. Fluent Bit can forward logs to all three. The costs are vendor lock-in, per-GB pricing that becomes significant at high log volumes, and limited flexibility compared to self-hosted stacks. For organizations running primarily in one cloud provider and without a reason to maintain a self-hosted stack, managed logging is often the pragmatic choice.

Logs serve both an operational role (diagnosing incidents) and a security role (detecting and investigating breaches). The two roles require different data, different retention policies, and different access controls.

For security investigations, the most valuable logs are authentication events (/var/log/auth.log, the systemd journal for SSH and sudo, and auditd records for privilege escalation), firewall logs (which record connection attempts that were blocked), and application-level access logs. These are the evidence trail for answering questions like: did this user account authenticate to this server at 2 AM? Did a process make network connections to an unexpected external IP? Did a file in /etc/ change at an unusual time?

osquery is an open-source tool that exposes operating system state as SQL tables. You can query for processes running from deleted binaries (a common malware indicator), unusual network connections, and suspicious file modifications. In practice, osquery is most useful when deployed continuously and shipping scheduled query results to a central store, rather than run manually only during incidents.

For authentication logs on a single host, the same grep and awk patterns from the Searching Logs section are usually enough to quickly summarize failed usernames and source IP concentrations. In centralized platforms, equivalent queries become dashboards and detection rules.

SIEM (Security Information and Event Management) systems are a specialized category of centralized logging aimed at security use cases. Platforms like Splunk Enterprise Security, Microsoft Sentinel, and IBM QRadar aggregate logs from firewalls, authentication systems, IDS/IPS, and application servers simultaneously, applying rule-based or machine-learning detection to identify threats across sources. SIEMs are designed for compliance-driven and security-investigation use cases where you need to correlate events across many systems and maintain centrally managed audit trails. They are heavier and more expensive than general-purpose log management but are the standard tooling in enterprise security operations centers.

Log management and incident investigation are two sides of the same discipline. You invest in log infrastructure before incidents happen: configuring collection agents, defining retention policies, establishing centralized search. You collect the dividends when an incident occurs and the data you need already exists and is findable within seconds rather than hours.

The landscape of log sources is broader than application stdout. Hardware and BMC logs capture signals that appear below the OS layer. The kernel audit subsystem (auditd) captures privileged actions at the kernel boundary, which complements application logging and is especially useful for accountability and compliance. The syslog protocol and its daemon ecosystem (rsyslog, syslog-ng) have forwarded logs across traditional Linux environments for decades and still underpin many production systems. Understanding all of these layers means you know where to look for evidence when the obvious sources yield nothing.

The investigation method, scope the symptom, narrow the time window, broaden across log sources, applies whether you are using grep on a single server or querying a Loki cluster with LogQL. The tools change as scale changes; the reasoning process does not. Building intuition for which log to read first, which time window to examine, and which adjacent sources to cross-reference is the skill that turns a long investigation into a short one. The incident timeline table is the artifact that makes the causal chain explicit and sets the foundation for the postmortem process.

Container logging in Kubernetes has a specific shape that differs from traditional log files: containers write to stdout/stderr, the runtime captures to node-level files, the kubelet exposes them via the API, and centralized agents ship them before ephemeral pods disappear. Knowing this mechanism makes kubectl logs predictable rather than mysterious, and knowing the --previous flag makes the difference between finding the error that caused a crash and seeing only an empty log from the container that replaced it.

The Incident Response and Disaster Recovery lecture extends this investigation capability into the formal process of coordinating a response, communicating during an outage, and running the postmortem that prevents recurrence.