Log Management and Analysis

It is 3:07 AM. Your phone buzzes with an alert: the web application is returning HTTP 500 errors to every request. No one deployed anything. No one changed a configuration file. The monitoring dashboard confirms the spike, but dashboards only tell you that something is wrong, not why. The only witnesses to whatever happened in the minutes before the outage are the log files your servers have been quietly writing since the day they were provisioned. This lecture is about learning to read those witnesses fluently.

Why logs matter

Every process on a Linux system can write log messages: the kernel, the init system, web servers, databases, cron jobs, authentication services, and your own application code. Each message is a small, timestamped record of something that happened. Individually, a single log line is mundane. Collectively, logs form a detailed timeline of system behavior, and that timeline is often the only evidence available when you need to explain an outage after the fact.

Metrics and dashboards are invaluable for detecting problems quickly, but they are summaries. A graph that shows “error rate jumped to 40%” does not tell you which endpoint failed, what the error message was, or which upstream dependency timed out. Logs fill that gap. They capture the specific error strings, stack traces, client IPs, request paths, and return codes that let you move from “something broke” to “here is exactly what broke, and here is why.”

Log severity levels

Every logging framework assigns each message a severity level (sometimes called priority or log level). Knowing the levels helps you filter noise during an investigation and configure alerting so that only meaningful messages cause notifications.

The syslog standard defines eight levels, ordered from highest to lowest severity:

Level	Name	Meaning
0	EMERG	System is unusable; immediate action required
1	ALERT	Action must be taken immediately
2	CRIT	Critical condition (hardware failure, etc.)
3	ERR	Error condition; something failed
4	WARNING	Something unexpected happened but the system is still running
5	NOTICE	Normal but significant events
6	INFO	Routine informational messages
7	DEBUG	Fine-grained detail, useful only during development

Application frameworks often use a simplified subset: DEBUG, INFO, WARN, ERROR, and FATAL (or CRITICAL). The precise names vary, but the ordering is always the same: DEBUG is the most verbose and lowest severity; FATAL/EMERG is the least verbose and highest severity.

In production, most services should log at INFO or WARNING and above. Running at DEBUG in production generates enormous log volume and can expose sensitive data in log files.

Where logs come from

Understanding the breadth of log sources prevents blind spots when you are troubleshooting a complex system. Logs originate from many layers:

Operating systems: kernel messages, system service start/stop events, hardware events
Applications: web servers, databases, background job workers, custom application code
Network devices: routers, switches, firewalls, load balancers (often via syslog forwarding)
Security systems: Intrusion Detection Systems (IDS), Intrusion Prevention Systems (IPS), and Security Information and Event Management (SIEM) platforms that aggregate and correlate security logs from multiple sources
Containers and orchestrators: Docker daemon logs, Kubernetes control-plane events, pod stdout/stderr
Hypervisors and guest operating systems: virtualization platforms and their virtual machines
Middleware: message brokers, API gateways, service meshes
CI/CD pipelines: build logs, test results, deployment events
Third-party services: vendor APIs, payment processors, CDN edge logs

A key challenge in any non-trivial environment is that these sources use different formats, different transport mechanisms, and different severity vocabularies. A centralized log collection layer (discussed below) normalizes this diversity into a unified stream.

Types of logs

Not all logs serve the same purpose. Understanding the categories helps you know where to look first.

System logs are produced by the kernel and core system services. On modern Linux distributions using systemd, these messages flow into the systemd journal (managed by systemd-journald). They cover boot messages, hardware events, service start/stop transitions, and kernel warnings. Older systems route these through the syslog protocol to files like /var/log/syslog or /var/log/messages.

Application logs come from the software you run on top of the operating system: your web framework, your database, your background job workers. Their location and format depend entirely on the application. A Rails app might log to log/production.log; PostgreSQL writes to a log directory under its data path; a Node.js service might write JSON to stdout, which systemd then captures in the journal.

Access and request logs record every inbound request to a server. For web servers like Nginx or Apache, these typically live in /var/log/nginx/access.log or /var/log/apache2/access.log. Each line records the client IP, timestamp, HTTP method, path, status code, response size, and often the referrer and user agent. These logs are essential for understanding traffic patterns and identifying which requests are failing.

Security and authentication logs track login attempts, privilege escalation, SSH sessions, and firewall actions. On Debian-based systems, look at /var/log/auth.log; on Red Hat-based systems, /var/log/secure. These logs are critical during security investigations and are often the first place to check if you suspect unauthorized access.

Where logs live on Linux

On a systemd-based Linux system, there are two primary places to find logs.

The systemd journal is a structured, binary log managed by systemd-journald. It captures output from all systemd-managed services, kernel messages, and anything written to syslog. You interact with it through the journalctl command. The journal is stored in /var/log/journal/ (persistent) or /run/log/journal/ (volatile, lost on reboot). One of its advantages is that it indexes messages by unit, priority, time, and other fields, making it fast to query even on busy systems.

The /var/log directory is the traditional home for log files. Even on systemd systems, many services still write plain-text log files here. A typical /var/log listing includes:

/var/log/syslog          # General system messages (Debian/Ubuntu)
/var/log/messages        # General system messages (RHEL/CentOS)
/var/log/auth.log        # Authentication events
/var/log/kern.log        # Kernel messages
/var/log/nginx/          # Nginx access and error logs
/var/log/mysql/          # MySQL/MariaDB logs
/var/log/apt/            # Package manager history

Returning to our 3 AM scenario, we would likely start with two commands: one to check the application’s own logs and one to check the system journal for the relevant time window.

# Check the systemd journal for errors in the last two hours
journalctl -p err --since "2 hours ago"

# Check the Nginx error log for recent entries
tail -100 /var/log/nginx/error.log

Log formats

Understanding the format of a log file is a prerequisite to searching it effectively. Two dominant formats appear in practice.

Regardless of format, a log entry typically contains the same core components:

Timestamp: when the event occurred (preferably ISO 8601 with timezone)
Severity: the log level (DEBUG, INFO, WARN, ERROR, etc.)
Source: the process, service, or component that generated the message
Message: the human-readable description of the event
Contextual information: additional fields such as error codes, IP addresses, user IDs, request IDs, or stack traces

Syslog format follows a well-known convention: a timestamp, a hostname, a process name (sometimes with a PID), and then the message body. A typical syslog line looks like this:

Mar 15 03:04:12 web-prod-01 nginx[2847]: upstream timed out (110: Connection timed out) while reading response header from upstream

The timestamp (Mar 15 03:04:12), host (web-prod-01), and process (nginx[2847]) give you immediate context. The message after the colon is free-form text defined by the application.

JSON structured logging has become increasingly common, especially in containerized and cloud-native environments. Instead of a flat string, each log entry is a JSON object with named fields:

{
  "timestamp": "2026-03-15T03:04:12.841Z",
  "level": "error",
  "service": "api",
  "message": "database connection refused",
  "host": "web-prod-01",
  "db_host": "db-prod-01",
  "port": 5432,
  "duration_ms": 30000
}

Structured logs are easier to parse programmatically and work well with centralized logging systems. The tradeoff is that they are harder to read with the naked eye compared to a single line of syslog.

Searching and filtering logs

Knowing where logs live and what they look like is only useful if you can search them quickly. The core toolkit for interactive log analysis on Linux is small but powerful.

grep

grep searches for lines matching a pattern. It is the workhorse of log analysis. To find all lines containing “error” (case-insensitive) in a log file:

grep -i "error" /var/log/nginx/error.log

To find lines that do not match a pattern (useful for filtering out noise):

grep -v "healthcheck" /var/log/nginx/access.log

To show context around each match (two lines before and after):

grep -B2 -A2 "Connection timed out" /var/log/nginx/error.log

journalctl

journalctl is the native tool for querying the systemd journal. Its filtering flags make it far more precise than grepping flat files.

# Show logs from a specific service
journalctl -u nginx.service

# Show only errors and above (priorities: emerg, alert, crit, err)
journalctl -p err

# Show logs from a specific time window
journalctl --since "2026-03-15 02:00:00" --until "2026-03-15 04:00:00"

# Combine filters: errors from nginx in the last hour
journalctl -u nginx.service -p err --since "1 hour ago"

# Follow new entries in real time (like tail -f)
journalctl -u myapp.service -f

tail -f

For plain-text log files, tail -f streams new lines as they are written. This is invaluable when you are actively reproducing a problem:

tail -f /var/log/nginx/error.log

You can combine it with grep to watch for specific patterns in real time:

tail -f /var/log/nginx/access.log | grep " 500 "

awk

awk excels at extracting and summarizing fields from structured text. Nginx access logs use a space-delimited format where the status code is typically the ninth field. To count how many times each status code appeared:

awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head

To find the top client IPs generating 500 errors:

awk '$9 == 500 {print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head

In our 3 AM investigation, this kind of summary quickly tells you whether the 500s are coming from all clients (suggesting a server-side problem) or a single source (suggesting a bad request pattern or an attack).

Regular expressions for log analysis

Basic string searches with grep will take you far, but regular expressions let you match patterns rather than literal strings. A few practical regex patterns cover the majority of log analysis tasks.

Match any line with an HTTP 5xx status code in an Nginx access log:

grep -E '" [5][0-9]{2} ' /var/log/nginx/access.log

Match timestamps within a specific minute (useful when you know the approximate time of an event):

grep "^Mar 15 03:04" /var/log/syslog

Extract IP addresses from any log file:

grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /var/log/auth.log

Match failed SSH login attempts and extract the username:

grep -oP 'Failed password for \K\S+' /var/log/auth.log | sort | uniq -c | sort -rn

Putting the tools together: a 3 AM walkthrough

Let us return to the 3 AM scenario and walk through a realistic investigation.

Confirm the symptom. Check the access log to quantify the 500 errors. How many are there, and when did they start?
Terminal window
```
awk '$9 == 500 {print $4}' /var/log/nginx/access.log | tail -20
```
This prints the timestamps of recent 500 responses. You notice they all start at 03:01:44.
Check the application error log. Nginx is returning 500s, but Nginx itself is probably proxying to an upstream application. Look at the Nginx error log for that time window.
Terminal window
```
grep "03:01" /var/log/nginx/error.log
```
You see repeated lines: connect() failed (111: Connection refused) while connecting to upstream. The application server is not accepting connections.
Check the application service. Use journalctl to see what happened to the application process.
Terminal window
```
journalctl -u myapp.service --since "2026-03-15 03:00:00" --until "2026-03-15 03:10:00"
```
The journal shows the app crashed at 03:01:42 with an out-of-memory error. Two seconds later, the 500s began.
Find out why memory spiked. Check the system journal for memory pressure signals around that time.
Terminal window
```
journalctl -p warning --since "2026-03-15 02:50:00" --until "2026-03-15 03:02:00" | grep -i "memory\|oom\|kill"
```
You find the OOM killer was invoked. Something consumed enough memory to starve the application.
Identify the culprit. Check what else was running. A cron job kicked off at 03:00:
Terminal window
```
grep "03:00" /var/log/syslog | grep -i cron
```
A nightly backup script started at 03:00:00, allocating enough memory to push the system over the edge.

The root cause: a nightly backup cron job ran on the same host as the application, consumed excessive memory, triggered the OOM killer, which terminated the app process. Nginx then had no upstream to proxy to and returned 500 errors to all clients. Total time from alert to root cause identification, using only logs: about ten minutes.

Log rotation

A busy web server can write gigabytes of access logs per day. Without management, log files grow until the disk is full, at which point the system may stop functioning entirely. Log rotation solves this by periodically compressing, renaming, and eventually deleting old log files.

On Linux, the standard tool is logrotate, which runs daily via cron or a systemd timer. Configuration files live in /etc/logrotate.conf (global defaults) and /etc/logrotate.d/ (per-application overrides). A typical configuration for Nginx looks like this:

/var/log/nginx/*.log {
    daily
    missingok
    rotate 14
    compress
    delaycompress
    notifempty
    create 0640 www-data adm
    sharedscripts
    postrotate
        [ -f /var/run/nginx.pid ] && kill -USR1 $(cat /var/run/nginx.pid)
    endscript
}

This configuration rotates logs daily, keeps 14 days of history, compresses old files (but delays compression by one cycle so tools tailing the previous file still work), and sends a signal to Nginx so it reopens its log file handles. The postrotate step is critical: without it, Nginx would continue writing to the old (now renamed) file, and the new file would remain empty.

After modifying a logrotate configuration, test it with a dry run before waiting a full day to find out if it works:

logrotate -d /etc/logrotate.d/nginx

The -d flag simulates the rotation and prints what it would do without actually changing anything.

The log management pipeline

Whether you operate a single server or a thousand, logs move through the same conceptual pipeline:

Collection: agents or sidecar processes gather log data from all sources (files, syslog sockets, container stdout, APIs).
Processing: the collected data is transformed, normalized, enriched with metadata (such as the cluster name or environment), filtered to remove noise, and tagged for routing.
Storage: processed logs are written to a durable, queryable store. The store must be tamper-resistant, efficiently indexed, and capable of handling high write throughput. Different database types suit different data: time-series databases for metrics, object storage for archival log files, and inverted-index stores (like Elasticsearch) for full-text search.
Analysis: engineers and automated systems query the stored logs to gain insights, identify issues, and understand behavior patterns.
Alerting and notifications: real-time rules detect anomalies or threshold crossings and trigger notifications.
Response and remediation: some systems can trigger automated responses (for example, auto-scaling a fleet when error rates spike, or quarantining a host when a security rule fires).

Fluent Bit: a lightweight log and metrics forwarder

Fluent Bit is an open-source, vendor-neutral log processor and forwarder widely used at the collection stage. Its defining characteristic is its size: the binary is under 1 MB, making it practical to deploy as a sidecar in every pod in a Kubernetes cluster without meaningful overhead. It can read from files, syslog, Docker and containerd log drivers, and dozens of other input plugins, then parse, filter, and route the data to outputs including Elasticsearch, Loki, Kafka, and cloud-native logging services.

Fluent Bit also supports metrics collection on Linux (measuring CPU, memory, and disk), giving it a dual role as both a log forwarder and a lightweight metrics agent. The heavier predecessor, Fluentd, remains in use but lacks native metrics collection and consumes more resources; Fluent Bit has largely replaced it in container-native deployments.

Centralized logging

So far, everything we have discussed assumes you are logged into a single server, reading its local log files. This works well for one or two machines. It breaks down quickly when you operate ten servers, fifty containers, or a fleet of microservices. In those environments, you need centralized logging: a system that collects logs from every source and makes them searchable from a single interface.

The general architecture is consistent across implementations. Agents run on each server (or sidecar container) and ship log entries to a central store. The store indexes the logs and exposes a query interface for searching and visualization.

Several well-known stacks fill this role:

ELK (Elastic Stack): the classic open-source stack with four components:

Elasticsearch stores and indexes the log data using an inverted index, enabling fast full-text search.
Logstash aggregates and transforms logs from multiple sources before forwarding them to Elasticsearch. In practice, the lighter-weight Beats agents (Filebeat for log files, Metricbeat for metrics) are deployed on each host and send data directly to Elasticsearch, with Logstash reserved for complex transformation pipelines.
Kibana provides the web UI for querying, visualizing, and building dashboards over the indexed data.

ELK is powerful but resource-intensive: Elasticsearch requires significant memory and CPU for large datasets and high query volumes.

Grafana LGTM Stack: a cloud-native alternative composed of four components:

Loki handles log aggregation and storage. Unlike Elasticsearch, Loki indexes only metadata labels (such as host, service, and environment), not the full log text. This dramatically reduces storage cost and ingestion overhead.
Grafana provides analytics, monitoring, and visualization for all data sources in the stack.
Tempo collects and stores distributed traces and integrates with other tracing tools.
Mimir stores metrics and is compatible with Prometheus, making it a scalable long-term metrics store.

The LGTM stack is designed with cloud-native scalability in mind: all four components are optimized for low overhead and horizontal scaling, in contrast to Elasticsearch’s heavier resource demands.

SIEM platforms: for security-focused log management, Security Information and Event Management (SIEM) systems aggregate logs from across the environment, correlate events across sources, and apply rule-based or ML-based detection to identify threats. SIEMs consume logs from IDS/IPS systems, firewalls, authentication services, and application logs simultaneously, surfacing patterns that no single source would reveal on its own.

Cloud-managed services: AWS CloudWatch Logs, Google Cloud Logging, and Azure Monitor Logs offer fully managed log ingestion and search. They reduce operational burden at the cost of vendor lock-in and per-GB pricing.

The value of centralized logging is not just convenience; it is correlation. When a request flows through a load balancer, a web server, an application server, and a database, the failure might only be visible by reading logs from multiple systems in chronological order. A centralized system lets you search across all of them with a single query, filtered by a time window or a request ID.

Building a mental model for log-based debugging

Effective log analysis is less about memorizing commands and more about developing a systematic approach. The core technique is timeline reconstruction: you are building a chronological narrative of what happened on the system, working backward from the symptom to the cause.

A practical mental model has three phases. First, scope the problem: identify the symptom precisely (which errors, which endpoints, which users, starting when). Second, narrow the window: use timestamps to focus on the minutes before the symptom appeared; the cause almost always precedes the effect by seconds to minutes, not hours. Third, broaden the sources: once you have a hypothesis, check adjacent logs (system journal, cron logs, auth logs, other services) to confirm or refute it.

A few principles make this process more reliable:

Start with the most specific log. If your web app is returning errors, start with the application log, not the system journal. The application log will have the actual error message and stack trace.
Work backward in time. The cause precedes the effect. If errors started at 03:01:44, look at what happened between 02:55 and 03:01.
Count before you read. Before reading individual lines, count how many errors there are and when they cluster. A command like awk '{print $4}' access.log | cut -d: -f1-3 | uniq -c gives you a per-minute histogram that reveals patterns invisible in raw output.
Cross-reference systems. A database timeout in the app log should have a corresponding entry in the database log. If it does not, the problem is likely the network between them, not the database itself.
Save your commands. When you find the query that reveals the answer, save it. The next incident will be different, but the technique will be the same, and having a library of useful one-liners accelerates every future investigation.

Log analysis is a skill that improves with practice. Every outage you investigate leaves you faster and more confident for the next one. The tools are simple; the art is knowing which log to read, which time window to examine, and which patterns to search for. That intuition comes only from experience.

Notable tools in the ecosystem

Beyond the major stacks described above, several tools are worth knowing:

Graylog: an open-source log management platform built around Elasticsearch and MongoDB. It provides a web interface for log search, dashboards, and alerting, with a commercial version that adds SIEM capabilities.
Splunk: a widely used commercial platform for log search, monitoring, and security analytics. Splunk’s query language (SPL) is powerful, and the platform is common in enterprise and government environments where compliance reporting is required.
Datadog: a SaaS observability platform that unifies logs, metrics, and traces under a single UI. Datadog’s agent handles collection from most sources out of the box, which reduces setup friction compared to self-hosted stacks.
New Relic: another full-stack SaaS observability platform covering application performance monitoring (APM), infrastructure metrics, and log management.
osquery: an open-source tool that exposes operating system internals (running processes, network connections, file system state, user activity) as SQL tables. It is particularly useful for security investigations and compliance auditing: SELECT * FROM processes WHERE on_disk = 0; finds processes running from deleted binaries, a common indicator of compromise.