Should I pick Loki or Elasticsearch for hot search?

If most of your queries are recent and label-filterable, Loki is cheaper and simpler to scale. If you rely on deep full-text search across arbitrary fields, Elasticsearch/OpenSearch shines — but control cardinality and use ILM. Many teams run Loki hot + S3 cold and keep ES only for specific teams that need text search.

Do I need OpenTelemetry for logs?

You need consistent correlation. OTel makes trace propagation and log linkage boring. Even if you don’t ship OTLP logs, use OTel APIs to fetch trace/span IDs and include them in your logs.

How do I prevent sensitive data from leaking into logs?

Redact at the collector/agent with tested regex/transform processors. Add CI tests with known fixtures. Lock down cold storage access. Avoid logging raw headers/request bodies unless sanitized.

What’s a reasonable logging budget?

Target <3% of infra spend or <1.5 KB/request average for application logs. Sample debug logs and drop chatty, low-value events from hot tiers; retain full fidelity in cold storage if you must.

How do I measure if logging improved MTTR?

Run controlled incident drills before/after rollout. Track mean time from alert to root cause and the number of context switches (tabs/queries). Also measure query P95 and ingestion lag; they correlate strongly with triage speed.

Guides · Oct 28, 2025 · 10 minute read

The Logging Playbook I Wish We’d Had Before That 3 a.m. Outage

Q: What’s a reasonable logging budget?

Target <3% of infra spend or <1.5 KB/request average for application logs. Sample debug logs and drop chatty, low-value events from hot tiers; retain full fidelity in cold storage if you must.

Q: How do I measure if logging improved MTTR?

Run controlled incident drills before/after rollout. Track mean time from alert to root cause and the number of context switches (tabs/queries). Also measure query P95 and ingestion lag; they correlate strongly with triage speed.

A battle-tested, step-by-step guide to logs that actually help you debug — with schema, sampling, routing, retention, and correlation that won’t melt your budget.

Alex Mercer

Principal Architect, GitPlumbers

20 years in the trenches from monoliths on bare metal to multi-cluster K8s with OTel. Ex-SRE at a FAANG-scale marketplace, ex-Head of Platform at two unicorns. I fix observability debt so teams can sleep again.

If your logs can’t answer who/what/when/where/why in five minutes, you don’t have observability — you have souvenirs.

Back to all posts

The Problem You’ve Lived

You get paged. Error rates spike. Kibana shows a wall of INFO spam and a few stack traces with no request_id. The hot path service was redeployed an hour ago, so half the pods have different log formats. You grep your way through kubectl logs and stern, but the user’s complaint references a payment ID that never appears in logs. Meanwhile your ES cluster is red, shards are unassigned, and your cloud bill looks like a VC term sheet.

I’ve seen this fail across Node, Go, and Java stacks at Series B startups and Fortune 100s. The pattern’s the same: logs exist, but they don’t answer questions fast enough to reduce MTTR. Here’s what actually works.

Objectives First: What Questions Should Logs Answer in < 5 Minutes?

If you can’t define the questions, you’ll log everything and still miss the needle.

Primary debugging questions:
- What request failed? Correlate by trace_id/request_id.
- Where did it fail? Service, version, region/zone, node, pod.
- Why did it fail? Error type, message, stack, cause, inputs (sanitized), retry state.
- How often and since when? Count window, first_seen, last_seen, deployment SHA.
Operational objectives:
- MTTR improvement target: -30% in 60 days.
- Log search P95: < 1s on hot indexes for the last 1h, < 3s for last 24h.
- Ingestion lag: < 10s end-to-end.
- Logging cost budget: < 3% of infra spend or < 1.5 KB/request avg payload.

If your logs can’t answer these in five minutes without paging a subject-matter hero, you don’t have observability — you have souvenirs.

Standardize the Schema: Structured JSON, Correlation, and Errors

Stop arguing frameworks. Pick a schema and hold the line.

Required fields (stable names across all services):
- timestamp (RFC3339), severity (DEBUG|INFO|WARN|ERROR|FATAL)
- message (human-readable), event (machine label, e.g., db.query.error)
- service.name, service.version, env, region, zone, k8s.pod, k8s.node
- trace.id, span.id, correlation_id (HTTP X-Request-ID fallback)
- http.method, http.target, http.status_code, user.id_hash (salted hash)
- error.type, error.message, error.stack (on error only)
Rules:
- JSON only. No multiline plaintext. Stack traces as arrays or escaped strings.
- Keep cardinality low: no unbounded labels (e.g., raw emails) as fields.
- Prefer event enums over free-form message for aggregation.
- Use W3C traceparent propagation so logs link to traces.

Examples that won’t embarrass you later:

// Node + Express + pino with OpenTelemetry correlation
import pino from 'pino';
import { context, trace } from '@opentelemetry/api';
import { v4 as uuidv4 } from 'uuid';
import express from 'express';

const logger = pino({ level: process.env.LOG_LEVEL || 'info' });
const app = express();

// Correlation middleware
app.use((req, res, next) => {
  const incoming = req.headers['x-request-id'] as string | undefined;
  (req as any).correlationId = incoming || uuidv4();
  res.setHeader('X-Request-Id', (req as any).correlationId);
  next();
});

function logWithCtx(base: any = {}) {
  const span = trace.getSpan(context.active());
  const traceId = span?.spanContext().traceId;
  const spanId = span?.spanContext().spanId;
  return logger.child({
    trace: { id: traceId, span_id: spanId },
    correlation_id: (base as any).correlation_id,
    service: { name: 'payments-api', version: process.env.GIT_SHA },
    env: process.env.RUNTIME_ENV,
  });
}

app.get('/charge', async (req, res) => {
  const log = logWithCtx({ correlation_id: (req as any).correlationId });
  log.info({ event: 'charge.request', http: { method: 'GET', target: '/charge' } }, 'Charge requested');
  try {
    throw new Error('card_declined');
  } catch (err: any) {
    log.error({
      event: 'charge.error',
      error: { type: err.name, message: err.message, stack: err.stack },
      user: { id_hash: 'u_7a9b...' },
    }, 'Charge failed');
    return res.status(402).json({ ok: false });
  }
});

// Go + zap structured logging with trace context
import (
  "go.uber.org/zap"
  "context"
)

func LogWithCtx(ctx context.Context, base *zap.Logger) *zap.Logger {
  traceID := ctx.Value("trace_id")
  spanID := ctx.Value("span_id")
  return base.With(
    zap.String("trace.id", toStr(traceID)),
    zap.String("span.id", toStr(spanID)),
    zap.String("service.name", "checkout"),
    zap.String("service.version", getenv("GIT_SHA")),
  )
}

Checkpoints:

Every log line is valid JSON and includes service.*, env, trace.id OR correlation_id.
Error logs always include error.type and error.message.
One-page schema agreement in the repo; PRs that break it get blocked.

Collect, Enrich, Redact, Route: Keep Logic Out of App Code

Use agents to keep your app slim and your policy centralized. My defaults:

Kubernetes: Fluent Bit (lightweight) or Vector for tailing container logs, K8s metadata.
OTel Collector for receiving OTLP logs, batching, routing to multiple backends.
Enrichment: K8s labels, cloud region/zone, git SHA, deployment.
Redaction: PCI/PII via regex processors. Do it at the edge.

Fluent Bit DaemonSet snippet:

# fluent-bit ConfigMap excerpt
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: observability
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        1
        Parsers_File parsers.conf
    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        Parser docker
        Tag kube.*
    [FILTER]
        Name kubernetes
        Match kube.*
        Merge_Log On
        Keep_Log Off
    [FILTER]
        Name    rewrite_tag
        Match   kube.*
        Rule    ".*"  "enriched.$TAG"  true
    [FILTER]
        Name    lua
        Match   enriched.*
        script  redact.lua
        call    redact
    [OUTPUT]
        Name  stdout
        Match enriched.*
        Format json_lines
        # Also send to Loki or Elastic via HTTP output

OpenTelemetry Collector routing logs to Loki (hot) and S3 (cold):

receivers:
  otlp:
    protocols:
      http:
      grpc:
processors:
  batch:
  attributes:
    actions:
      - key: service.version
        from_attribute: k8s.deployment.labels.git_sha
        action: upsert
  transform:
    log_statements:
      - context: log
        statements:
          - replace_pattern(body, '(?i)(card|ssn|pan)[:=]\s*([0-9-]+)', 'REDACTED')
exporters:
  loki:
    endpoint: http://loki-gateway.loki:3100/loki/api/v1/push
  s3:
    endpoint: s3.amazonaws.com
    bucket: org-logs-cold
    file_format: parquet
    s3uploader:
      compression: gzip
service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [attributes, transform, batch]
      exporters: [loki, s3]

Checkpoints:

Ingestion lag from pod stdout to backend < 10s (measure in logs with ingest_lag_ms).
Redaction verified in integration tests (no PAN/SSN leaks in fixture logs).
Backpressure alerting in collector (dropped spans/logs == 0 during steady state).

Storage and Retention: Hot/Warm/Cold Without Bankruptcy

One size doesn’t fit all. Use tiered backends:

Hot (last 24–72h): fast search, frequent queries.
- Loki (cheap indexing by labels + object storage), or Elasticsearch/OpenSearch for KQL.
Warm (7–30d): slower but queryable.
- ES/OpenSearch with fewer replicas, slower storage.
Cold (30–180d+): compliance, rare queries.
- S3/GCS in Parquet via Otelcol, query with Athena/BigQuery/Trino.

If you’re on Elastic/OpenSearch, implement ILM:

{
  "policy": {
    "phases": {
      "hot": {"actions": {"rollover": {"max_size": "50gb", "max_age": "1d"}}},
      "warm": {"min_age": "2d", "actions": {"forcemerge": {"max_num_segments": 1}}},
      "cold": {"min_age": "14d", "actions": {"allocate": {"number_of_replicas": 0}}},
      "delete": {"min_age": "90d", "actions": {"delete": {}}}
    }
  }
}

Cardinality is the silent killer. Set budgets:

Max unique label values per index: < 10k per day for hot.
Disallow unbounded fields in labels (e.g., user.email). Keep them in body.
Sample noisy DEBUG/INFO logs: 1–10% in hot tier; keep full in cold.

Metrics to watch:

ES/OpenSearch: heap usage, shard count, queue size, indexing latency.
Loki: ingester chunks in memory, rejected samples, label cardinality.
Cost per GB ingested, query P95 latency by time range.

Make Correlation Boring: Traces, Logs, Metrics Together

The fastest debugs happen when logs stitch to traces automatically.

Propagation: use W3C traceparent. Make sure Nginx/Envoy/ALB preserves it.
Log-to-trace linking: include trace.id and span.id in every log line.
Trace-to-log linking: expose a query URL template in your APM linking to log search.

Examples:

Kibana KQL for a 500 error burst by trace:

service.name: "payments-api" and http.status_code: 500 and trace.id: "4f1e2..."

Loki LogQL for a failing canary in eu-west-1:

{service_name="checkout", env="prod", deployment="canary", region="eu-west-1"}
|= "error"
| json
| unwrap duration_ms
| avg_over_time(duration_ms[5m]) > 200

Shell for quick-and-dirty triage when your UI is down:

kubectl logs -n prod deploy/checkout -l app=checkout --since=10m | jq -r 'select(.severity=="ERROR") | [.timestamp,.trace.id,.message] | @tsv' | head -n 50

Checkpoints:

Clicking a trace in your APM opens a pre-filtered log view to that trace.id.
90%+ of error logs in the last 24h have a trace.id present.
SREs can answer “when did this start and after which deploy?” with a saved query.

Guardrails: Governance, Redaction, and Testing Log Quality

This is where most teams fall over after month three.

PII/Secrets: redact at the collector. Maintain unit tests for redactors with known fixtures.
Schema stability: lint JSON logs in CI. Break builds if required fields are missing.
Logging budget: enforce per-service bytes/request; alert when exceeding budget.
Access controls: limit who can query cold data containing sensitive fields.
Drills: quarterly chaos drills focused on observability — can an on-call solve a synthetic incident in < 15 minutes using logs?

Example Jest test for schema presence:

import { validate } from 'jsonschema';
import schema from '../logging.schema.json';

test('error logs include required fields', () => {
  const log = {
    timestamp: new Date().toISOString(),
    severity: 'ERROR',
    message: 'failed',
    event: 'db.query.error',
    service: { name: 'orders', version: 'abc123' },
    trace: { id: '4f1e...', span_id: '9a0b...' },
    error: { type: 'TimeoutError', message: 'timeout' }
  };
  const res = validate(log, schema);
  expect(res.valid).toBe(true);
});

Checkpoints:

CI fails on schema regressions.
Redaction tests cover top 10 sensitive patterns (PAN, SSN, API keys, JWTs).
Logging bytes/request reported on dashboards per service and enforced via SLO.

Rollout Plan With Metrics (6 Weeks That Actually Works)

Week 1: Agree the schema, wire correlation IDs, enable structured JSON in 1–2 services.
- Metric: 80% of error logs have trace.id or correlation_id.
Week 2: Deploy collector/agent with enrichment and redaction. Route to a hot backend.
- Metric: ingest lag < 10s; zero dropped logs in steady state.
Week 3: Tiered storage: hot 72h + cold S3 parquet. Define ILM or retention.
- Metric: 30% cost reduction vs. monolithic ES in staging.
Week 4: Saved searches and trace-to-log linking in your APM (Datadog, Grafana, Tempo/Jaeger).
- Metric: on-call drill MTTR from 45m → < 25m in staging incident.
Week 5: Cardinality controls: sampling, label allow-list, drop chatty events.
- Metric: hot index cardinality < 10k unique label values/day per index.
Week 6: Governance: CI schema checks, redaction tests, dashboards for bytes/request.
- Metric: logging spend < 3% infra; bytes/request < 1.5 KB avg.

If you need a forcing function, do this via GitOps: config for agents, ILM, and dashboards live in the repo and roll via ArgoCD.

What We’ve Seen Work (And Fail)

Works: JSON logs + OTel correlation + Loki hot + S3 cold. A retail client cut MTTR by 42% and ES costs by 60% migrating hot search to Loki while keeping cold in S3.
Works: strict schema + CI linting. Prevented a junior dev from shipping a debug map with 200 dynamic keys that exploded cardinality.
Fails: one-big-ES-cluster-for-everything with no ILM. You’ll page the ES team more than the product.
Fails: trying to redact in app code. You’ll miss the third-party library that logs raw headers.
Fails: logs without trace correlation. You’ll drown in context switching between tabs.

If you want a partner that’s done this across Kubernetes, ECS, and bare metal, GitPlumbers has the tools and the scars. We’ll help you ship a logging strategy that reduces pager load without torching your budget.

Related Resources

Key takeaways

Logs should answer who/what/when/where/why in under 5 minutes — define objectives and measure them.
Standardize structured JSON logs with stable fields and correlation IDs tied to traces.
Use agents (Fluent Bit/Vector/Otelcol) to enrich, redact, and route — not your app code.
Design storage as hot/warm/cold with ILM or tiered retention to control cost and search latency.
Treat log cardinality as a budget; enforce sampling and field constraints at the edge.
Correlate logs with traces and metrics to reduce MTTR and paged hours.
Continuously test log quality in CI and via chaos drills; guardrails rot if you don’t measure.

Implementation checklist

Define a logging objective and MTTR targets linked to SLOs.
Adopt a common JSON schema with correlation IDs and error fields.
Instrument structured logging in each service with language-appropriate libraries.
Deploy an agent/collector for enrichment, redaction, and routing to multiple backends.
Implement ILM/tiered retention and cost controls (cardinality caps, sampling, drop rules).
Integrate tracing: propagate W3C tracecontext and include trace/span IDs in logs.
Create saved searches and explore flows for top incidents; codify runbooks.
Continuously measure ingestion lag, query P95, error log rate, and logging spend per request.

Questions we hear from teams

Should I pick Loki or Elasticsearch for hot search?: If most of your queries are recent and label-filterable, Loki is cheaper and simpler to scale. If you rely on deep full-text search across arbitrary fields, Elasticsearch/OpenSearch shines — but control cardinality and use ILM. Many teams run Loki hot + S3 cold and keep ES only for specific teams that need text search.
Do I need OpenTelemetry for logs?: You need consistent correlation. OTel makes trace propagation and log linkage boring. Even if you don’t ship OTLP logs, use OTel APIs to fetch trace/span IDs and include them in your logs.
How do I prevent sensitive data from leaking into logs?: Redact at the collector/agent with tested regex/transform processors. Add CI tests with known fixtures. Lock down cold storage access. Avoid logging raw headers/request bodies unless sanitized.
What’s a reasonable logging budget?: Target <3% of infra spend or <1.5 KB/request average for application logs. Sample debug logs and drop chatty, low-value events from hot tiers; retain full fidelity in cold storage if you must.
How do I measure if logging improved MTTR?: Run controlled incident drills before/after rollout. Track mean time from alert to root cause and the number of context switches (tabs/queries). Also measure query P95 and ingestion lag; they correlate strongly with triage speed.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about a logging overhaul See the MTTR reduction case study

The Problem You’ve Lived

Objectives First: What Questions Should Logs Answer in < 5 Minutes?

Standardize the Schema: Structured JSON, Correlation, and Errors

Collect, Enrich, Redact, Route: Keep Logic Out of App Code

Storage and Retention: Hot/Warm/Cold Without Bankruptcy

Make Correlation Boring: Traces, Logs, Metrics Together

Guardrails: Governance, Redaction, and Testing Log Quality

Rollout Plan With Metrics (6 Weeks That Actually Works)

What We’ve Seen Work (And Fail)

Related Resources

Key takeaways

Implementation checklist

Questions we hear from teams

Ready to modernize your codebase?

Related resources