The Logging Playbook I Wish We’d Had Before That 3 a.m. Outage
A battle-tested, step-by-step guide to logs that actually help you debug — with schema, sampling, routing, retention, and correlation that won’t melt your budget.
If your logs can’t answer who/what/when/where/why in five minutes, you don’t have observability — you have souvenirs.Back to all posts
The Problem You’ve Lived
You get paged. Error rates spike. Kibana shows a wall of INFO spam and a few stack traces with no request_id. The hot path service was redeployed an hour ago, so half the pods have different log formats. You grep your way through kubectl logs and stern, but the user’s complaint references a payment ID that never appears in logs. Meanwhile your ES cluster is red, shards are unassigned, and your cloud bill looks like a VC term sheet.
I’ve seen this fail across Node, Go, and Java stacks at Series B startups and Fortune 100s. The pattern’s the same: logs exist, but they don’t answer questions fast enough to reduce MTTR. Here’s what actually works.
Objectives First: What Questions Should Logs Answer in < 5 Minutes?
If you can’t define the questions, you’ll log everything and still miss the needle.
- Primary debugging questions:
- What request failed? Correlate by
trace_id/request_id. - Where did it fail? Service, version, region/zone, node, pod.
- Why did it fail? Error type, message, stack, cause, inputs (sanitized), retry state.
- How often and since when? Count window, first_seen, last_seen, deployment SHA.
- What request failed? Correlate by
- Operational objectives:
- MTTR improvement target: -30% in 60 days.
- Log search P95: < 1s on hot indexes for the last 1h, < 3s for last 24h.
- Ingestion lag: < 10s end-to-end.
- Logging cost budget: < 3% of infra spend or < 1.5 KB/request avg payload.
If your logs can’t answer these in five minutes without paging a subject-matter hero, you don’t have observability — you have souvenirs.
Standardize the Schema: Structured JSON, Correlation, and Errors
Stop arguing frameworks. Pick a schema and hold the line.
- Required fields (stable names across all services):
timestamp(RFC3339),severity(DEBUG|INFO|WARN|ERROR|FATAL)message(human-readable),event(machine label, e.g.,db.query.error)service.name,service.version,env,region,zone,k8s.pod,k8s.nodetrace.id,span.id,correlation_id(HTTPX-Request-IDfallback)http.method,http.target,http.status_code,user.id_hash(salted hash)error.type,error.message,error.stack(on error only)
- Rules:
- JSON only. No multiline plaintext. Stack traces as arrays or escaped strings.
- Keep cardinality low: no unbounded labels (e.g., raw emails) as fields.
- Prefer
eventenums over free-formmessagefor aggregation. - Use W3C
traceparentpropagation so logs link to traces.
Examples that won’t embarrass you later:
// Node + Express + pino with OpenTelemetry correlation
import pino from 'pino';
import { context, trace } from '@opentelemetry/api';
import { v4 as uuidv4 } from 'uuid';
import express from 'express';
const logger = pino({ level: process.env.LOG_LEVEL || 'info' });
const app = express();
// Correlation middleware
app.use((req, res, next) => {
const incoming = req.headers['x-request-id'] as string | undefined;
(req as any).correlationId = incoming || uuidv4();
res.setHeader('X-Request-Id', (req as any).correlationId);
next();
});
function logWithCtx(base: any = {}) {
const span = trace.getSpan(context.active());
const traceId = span?.spanContext().traceId;
const spanId = span?.spanContext().spanId;
return logger.child({
trace: { id: traceId, span_id: spanId },
correlation_id: (base as any).correlation_id,
service: { name: 'payments-api', version: process.env.GIT_SHA },
env: process.env.RUNTIME_ENV,
});
}
app.get('/charge', async (req, res) => {
const log = logWithCtx({ correlation_id: (req as any).correlationId });
log.info({ event: 'charge.request', http: { method: 'GET', target: '/charge' } }, 'Charge requested');
try {
throw new Error('card_declined');
} catch (err: any) {
log.error({
event: 'charge.error',
error: { type: err.name, message: err.message, stack: err.stack },
user: { id_hash: 'u_7a9b...' },
}, 'Charge failed');
return res.status(402).json({ ok: false });
}
});// Go + zap structured logging with trace context
import (
"go.uber.org/zap"
"context"
)
func LogWithCtx(ctx context.Context, base *zap.Logger) *zap.Logger {
traceID := ctx.Value("trace_id")
spanID := ctx.Value("span_id")
return base.With(
zap.String("trace.id", toStr(traceID)),
zap.String("span.id", toStr(spanID)),
zap.String("service.name", "checkout"),
zap.String("service.version", getenv("GIT_SHA")),
)
}Checkpoints:
- Every log line is valid JSON and includes
service.*,env,trace.idORcorrelation_id. - Error logs always include
error.typeanderror.message. - One-page schema agreement in the repo; PRs that break it get blocked.
Collect, Enrich, Redact, Route: Keep Logic Out of App Code
Use agents to keep your app slim and your policy centralized. My defaults:
- Kubernetes:
Fluent Bit(lightweight) orVectorfor tailing container logs, K8s metadata. - OTel Collector for receiving OTLP logs, batching, routing to multiple backends.
- Enrichment: K8s labels, cloud region/zone, git SHA, deployment.
- Redaction: PCI/PII via regex processors. Do it at the edge.
Fluent Bit DaemonSet snippet:
# fluent-bit ConfigMap excerpt
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: observability
data:
fluent-bit.conf: |
[SERVICE]
Flush 1
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Keep_Log Off
[FILTER]
Name rewrite_tag
Match kube.*
Rule ".*" "enriched.$TAG" true
[FILTER]
Name lua
Match enriched.*
script redact.lua
call redact
[OUTPUT]
Name stdout
Match enriched.*
Format json_lines
# Also send to Loki or Elastic via HTTP outputOpenTelemetry Collector routing logs to Loki (hot) and S3 (cold):
receivers:
otlp:
protocols:
http:
grpc:
processors:
batch:
attributes:
actions:
- key: service.version
from_attribute: k8s.deployment.labels.git_sha
action: upsert
transform:
log_statements:
- context: log
statements:
- replace_pattern(body, '(?i)(card|ssn|pan)[:=]\s*([0-9-]+)', 'REDACTED')
exporters:
loki:
endpoint: http://loki-gateway.loki:3100/loki/api/v1/push
s3:
endpoint: s3.amazonaws.com
bucket: org-logs-cold
file_format: parquet
s3uploader:
compression: gzip
service:
pipelines:
logs:
receivers: [otlp]
processors: [attributes, transform, batch]
exporters: [loki, s3]Checkpoints:
- Ingestion lag from pod stdout to backend < 10s (measure in logs with
ingest_lag_ms). - Redaction verified in integration tests (no PAN/SSN leaks in fixture logs).
- Backpressure alerting in collector (dropped spans/logs == 0 during steady state).
Storage and Retention: Hot/Warm/Cold Without Bankruptcy
One size doesn’t fit all. Use tiered backends:
- Hot (last 24–72h): fast search, frequent queries.
Loki(cheap indexing by labels + object storage), orElasticsearch/OpenSearchfor KQL.
- Warm (7–30d): slower but queryable.
- ES/OpenSearch with fewer replicas, slower storage.
- Cold (30–180d+): compliance, rare queries.
- S3/GCS in
Parquetvia Otelcol, query withAthena/BigQuery/Trino.
- S3/GCS in
If you’re on Elastic/OpenSearch, implement ILM:
{
"policy": {
"phases": {
"hot": {"actions": {"rollover": {"max_size": "50gb", "max_age": "1d"}}},
"warm": {"min_age": "2d", "actions": {"forcemerge": {"max_num_segments": 1}}},
"cold": {"min_age": "14d", "actions": {"allocate": {"number_of_replicas": 0}}},
"delete": {"min_age": "90d", "actions": {"delete": {}}}
}
}
}Cardinality is the silent killer. Set budgets:
- Max unique label values per index: < 10k per day for hot.
- Disallow unbounded fields in labels (e.g.,
user.email). Keep them in body. - Sample noisy DEBUG/INFO logs: 1–10% in hot tier; keep full in cold.
Metrics to watch:
- ES/OpenSearch: heap usage, shard count, queue size, indexing latency.
- Loki: ingester chunks in memory, rejected samples, label cardinality.
- Cost per GB ingested, query P95 latency by time range.
Make Correlation Boring: Traces, Logs, Metrics Together
The fastest debugs happen when logs stitch to traces automatically.
- Propagation: use W3C
traceparent. Make sure Nginx/Envoy/ALB preserves it. - Log-to-trace linking: include
trace.idandspan.idin every log line. - Trace-to-log linking: expose a query URL template in your APM linking to log search.
Examples:
- Kibana KQL for a 500 error burst by trace:
service.name: "payments-api" and http.status_code: 500 and trace.id: "4f1e2..."- Loki LogQL for a failing canary in
eu-west-1:
{service_name="checkout", env="prod", deployment="canary", region="eu-west-1"}
|= "error"
| json
| unwrap duration_ms
| avg_over_time(duration_ms[5m]) > 200- Shell for quick-and-dirty triage when your UI is down:
kubectl logs -n prod deploy/checkout -l app=checkout --since=10m | jq -r 'select(.severity=="ERROR") | [.timestamp,.trace.id,.message] | @tsv' | head -n 50Checkpoints:
- Clicking a trace in your APM opens a pre-filtered log view to that
trace.id. - 90%+ of error logs in the last 24h have a
trace.idpresent. - SREs can answer “when did this start and after which deploy?” with a saved query.
Guardrails: Governance, Redaction, and Testing Log Quality
This is where most teams fall over after month three.
- PII/Secrets: redact at the collector. Maintain unit tests for redactors with known fixtures.
- Schema stability: lint JSON logs in CI. Break builds if required fields are missing.
- Logging budget: enforce per-service bytes/request; alert when exceeding budget.
- Access controls: limit who can query cold data containing sensitive fields.
- Drills: quarterly chaos drills focused on observability — can an on-call solve a synthetic incident in < 15 minutes using logs?
Example Jest test for schema presence:
import { validate } from 'jsonschema';
import schema from '../logging.schema.json';
test('error logs include required fields', () => {
const log = {
timestamp: new Date().toISOString(),
severity: 'ERROR',
message: 'failed',
event: 'db.query.error',
service: { name: 'orders', version: 'abc123' },
trace: { id: '4f1e...', span_id: '9a0b...' },
error: { type: 'TimeoutError', message: 'timeout' }
};
const res = validate(log, schema);
expect(res.valid).toBe(true);
});Checkpoints:
- CI fails on schema regressions.
- Redaction tests cover top 10 sensitive patterns (PAN, SSN, API keys, JWTs).
- Logging bytes/request reported on dashboards per service and enforced via SLO.
Rollout Plan With Metrics (6 Weeks That Actually Works)
- Week 1: Agree the schema, wire correlation IDs, enable structured JSON in 1–2 services.
- Metric: 80% of error logs have
trace.idorcorrelation_id.
- Metric: 80% of error logs have
- Week 2: Deploy collector/agent with enrichment and redaction. Route to a hot backend.
- Metric: ingest lag < 10s; zero dropped logs in steady state.
- Week 3: Tiered storage: hot 72h + cold S3 parquet. Define ILM or retention.
- Metric: 30% cost reduction vs. monolithic ES in staging.
- Week 4: Saved searches and trace-to-log linking in your APM (Datadog, Grafana, Tempo/Jaeger).
- Metric: on-call drill MTTR from 45m → < 25m in staging incident.
- Week 5: Cardinality controls: sampling, label allow-list, drop chatty events.
- Metric: hot index cardinality < 10k unique label values/day per index.
- Week 6: Governance: CI schema checks, redaction tests, dashboards for bytes/request.
- Metric: logging spend < 3% infra; bytes/request < 1.5 KB avg.
If you need a forcing function, do this via GitOps: config for agents, ILM, and dashboards live in the repo and roll via ArgoCD.
What We’ve Seen Work (And Fail)
- Works: JSON logs + OTel correlation + Loki hot + S3 cold. A retail client cut MTTR by 42% and ES costs by 60% migrating hot search to Loki while keeping cold in S3.
- Works: strict schema + CI linting. Prevented a junior dev from shipping a
debugmap with 200 dynamic keys that exploded cardinality. - Fails: one-big-ES-cluster-for-everything with no ILM. You’ll page the ES team more than the product.
- Fails: trying to redact in app code. You’ll miss the third-party library that logs raw headers.
- Fails: logs without trace correlation. You’ll drown in context switching between tabs.
If you want a partner that’s done this across Kubernetes, ECS, and bare metal, GitPlumbers has the tools and the scars. We’ll help you ship a logging strategy that reduces pager load without torching your budget.
Key takeaways
- Logs should answer who/what/when/where/why in under 5 minutes — define objectives and measure them.
- Standardize structured JSON logs with stable fields and correlation IDs tied to traces.
- Use agents (Fluent Bit/Vector/Otelcol) to enrich, redact, and route — not your app code.
- Design storage as hot/warm/cold with ILM or tiered retention to control cost and search latency.
- Treat log cardinality as a budget; enforce sampling and field constraints at the edge.
- Correlate logs with traces and metrics to reduce MTTR and paged hours.
- Continuously test log quality in CI and via chaos drills; guardrails rot if you don’t measure.
Implementation checklist
- Define a logging objective and MTTR targets linked to SLOs.
- Adopt a common JSON schema with correlation IDs and error fields.
- Instrument structured logging in each service with language-appropriate libraries.
- Deploy an agent/collector for enrichment, redaction, and routing to multiple backends.
- Implement ILM/tiered retention and cost controls (cardinality caps, sampling, drop rules).
- Integrate tracing: propagate W3C tracecontext and include trace/span IDs in logs.
- Create saved searches and explore flows for top incidents; codify runbooks.
- Continuously measure ingestion lag, query P95, error log rate, and logging spend per request.
Questions we hear from teams
- Should I pick Loki or Elasticsearch for hot search?
- If most of your queries are recent and label-filterable, Loki is cheaper and simpler to scale. If you rely on deep full-text search across arbitrary fields, Elasticsearch/OpenSearch shines — but control cardinality and use ILM. Many teams run Loki hot + S3 cold and keep ES only for specific teams that need text search.
- Do I need OpenTelemetry for logs?
- You need consistent correlation. OTel makes trace propagation and log linkage boring. Even if you don’t ship OTLP logs, use OTel APIs to fetch trace/span IDs and include them in your logs.
- How do I prevent sensitive data from leaking into logs?
- Redact at the collector/agent with tested regex/transform processors. Add CI tests with known fixtures. Lock down cold storage access. Avoid logging raw headers/request bodies unless sanitized.
- What’s a reasonable logging budget?
- Target <3% of infra spend or <1.5 KB/request average for application logs. Sample debug logs and drop chatty, low-value events from hot tiers; retain full fidelity in cold storage if you must.
- How do I measure if logging improved MTTR?
- Run controlled incident drills before/after rollout. Track mean time from alert to root cause and the number of context switches (tabs/queries). Also measure query P95 and ingestion lag; they correlate strongly with triage speed.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
