Your Logs Are Chatty, Not Helpful: A Field Guide to Debuggable Logging That Cuts MTTR in Half
If your on-call playbook starts with “open Kibana and pray,” this is for you. Here’s the logging strategy we’ve used to turn noisy firehoses into surgical tools — with concrete configs, guardrails, and rollout steps that survive real incidents.
Logs are for humans under pressure. If a stressed engineer can’t find the breadcrumb trail in 3 clicks, your logging isn’t done.Back to all posts
The incident you’ve lived
Friday evening deploy. Error rate spikes. You open Kibana and it’s a wall of info logs, stack traces missing trace_id, and five different JSON shapes for the same error. You grep, pivot, and still can’t connect the 502 in the edge to the null in the payments service. I’ve seen this movie at a FAANG and three unicorns: the log volume is massive, the signal is garbage.
This guide is the playbook we use at GitPlumbers when we retrofit logging in places where “just ship it” turned into “we can’t debug prod.” It’s opinionated, boring-in-the-best-way, and it works under pressure.
Define success the way your CFO and SREs care about
Before you change a single line of code, define what good looks like:
- MTTR: Reduce median MTTR for P1s by 40–60% in 90 days.
- Query latency: P95 log query under 3s for 24h window on hot tier.
- Coverage: 95% of error logs include
trace_id,span_id,service,env,route,user_id(hashed), andversion. - Cost: Keep ingest under $X/day with sampling and tiered retention.
- Compliance: Zero PII in logs (verified via automated scans).
If you can’t measure it, you can’t tune it. Instrument dashboards for these upfront.
Standardize the schema and instrument once, consistently
Your logs are an API. Treat them like one. Agree on a minimal cross-service schema and enforce it.
- Required fields:
ts,level,service,env,region,trace_id,span_id,request_id,route,version,msg. - Optional but useful:
user_id_hash,http.method,http.status_code,retry_count,feature_flag,err.type,err.stack. - Format: newline-delimited JSON, UTC timestamps, no multiline stacks (encode as single field).
Example: Node.js with pino + OpenTelemetry context.
// logger.ts
import pino from 'pino';
import { context, trace } from '@opentelemetry/api';
export const logger = pino({ level: process.env.LOG_LEVEL || 'info' });
export function logWithCtx(level: pino.Level, msg: string, extra: Record<string, unknown> = {}) {
const span = trace.getSpan(context.active());
const spanCtx = span?.spanContext();
const base = {
service: process.env.SERVICE_NAME,
env: process.env.NODE_ENV,
version: process.env.BUILD_SHA,
trace_id: spanCtx?.traceId,
span_id: spanCtx?.spanId,
...extra,
};
// @ts-ignore
logger[level](base, msg);
}
// usage
logWithCtx('error', 'Stripe charge failed', { route: '/checkout', user_id_hash: 'c8f...' , err: { type: 'StripeCardError', code: 'card_declined' } });Python with structlog:
# logging_setup.py
import structlog, os
from opentelemetry import trace
structlog.configure(
processors=[
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso", utc=True),
structlog.processors.dict_tracebacks,
structlog.processors.JSONRenderer(),
]
)
logger = structlog.get_logger()
def log(level, msg, **kw):
span = trace.get_current_span()
ctx = span.get_span_context()
fields = dict(
service=os.getenv("SERVICE_NAME"),
env=os.getenv("ENV"),
version=os.getenv("BUILD_SHA"),
trace_id=ctx.trace_id if ctx and ctx.is_valid else None,
span_id=ctx.span_id if ctx and ctx.is_valid else None,
**kw,
)
getattr(logger, level)(msg, **fields)Java (Spring Boot) with Logback + JSON encoder:
<!-- logback-spring.xml -->
<configuration>
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
<providers>
<timestamp>
<timeZone>UTC</timeZone>
</timestamp>
<logLevel/>
<loggerName/>
<message/>
<mdcFields>
<excludeMdcKeyName>password</excludeMdcKeyName>
</mdcFields>
<arguments/>
<stackTrace/>
</providers>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="STDOUT"/>
</root>
</configuration>Populate MDC with OTel context via a filter or interceptor so every log carries trace_id/span_id.
Don’t let each team invent their own fields. You’ll pay that tax during incidents.
Ship logs through a transformation layer (redaction, routing, sampling)
Raw app logs should not go straight to Elasticsearch/Loki. Insert a router you control — Vector or OpenTelemetry Collector — to enforce policy, redact PII, shape fields, and route by environment.
Vector example (VRL) to hash user IDs and drop credit cards:
# vector.toml
[sources.app]
type = "stdin" # replace with kubernetes_logs or file
[transforms.shape]
type = "remap"
inputs = ["app"]
source = '''
.ts = to_string!(.ts) || now()
.level = upcase!(.level)
if exists(.user_id) { .user_id_hash = sha256!(string!(.user_id)); del(.user_id) }
if match!(string!(.msg), r"\b\d{13,16}\b") { drop() } # naive CC filter
if .env == "prod" && .level == "DEBUG" { drop() } # kill debug in prod
'''
[sinks.loki]
type = "loki"
inputs = ["shape"]
endpoint = "http://loki:3100"
labels = {service = "{{service}}", env = "{{env}}", level = "{{level}}"}
[sinks.es]
type = "elasticsearch"
inputs = ["shape"]
endpoints = ["https://es.example.com:9200"]
index = "logs-%Y.%m.%d"OpenTelemetry Collector to inject trace context into logs on the sidecar:
receivers:
otlp:
protocols: { grpc: {}, http: {} }
processors:
batch: {}
attributes:
actions:
- key: trace_id
from_attribute: trace.trace_id
action: upsert
exporters:
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
logs:
receivers: [otlp]
processors: [attributes, batch]
exporters: [loki]Checkpoints:
- Every prod error log includes
trace_id(sample 1000 logs; expect 95%+ with value). - Redaction works (seed known PII in staging and verify it’s missing downstream).
- Routing sends
devto cheaper tier,prodto hot storage.
Store, index, and keep costs sane
Pick a backend you can operate. We see success with:
- Loki for cheap, high-volume logs with labels + LogQL. Great for K8s.
- Elasticsearch/OpenSearch when you need flexible indexing and KQL.
- Cloud-native (CloudWatch, GCP Logging, BigQuery) if you already live there — but watch egress and query costs.
Guidelines:
- Keep hot retention small (24–72h). Warm (7–14d). Cold archive (30–90d) to S3/GCS.
- In Elasticsearch, use ILM to roll over by size/time and shrink to warm nodes.
- In Loki, keep label cardinality low; avoid
user_idorrequest_idas labels.
Elasticsearch ILM example:
{
"policy": {
"phases": {
"hot": {"actions": {"rollover": {"max_size": "50gb", "max_age": "2d"}}},
"warm": {"min_age": "2d", "actions": {"forcemerge": {"max_num_segments": 1}}},
"cold": {"min_age": "7d", "actions": {"freeze": {}, "set_priority": {"priority": 0}}},
"delete": {"min_age": "30d", "actions": {"delete": {}}}
}
}
}Metrics to watch:
- Ingest rate MB/s and events/s by service.
- P95 query latency for top saved queries.
- Storage growth vs. retention policy (projected 90 days).
Query faster under pressure: saved searches and runbooks
Don’t make on-call craft queries live. Pre-bake searches for your top incident classes and link them in runbooks.
Elasticsearch KQL examples:
service:payments AND level:(error OR fatal) AND env:prod AND @timestamp:[now-15m TO now]Find all 5xx correlated to a given trace_id:
trace_id:"7b8c..." AND http.status_code:>=500Loki LogQL examples:
{service="checkout", env="prod", level="ERROR"} |= "Stripe" | json | http_status >= 500Aggregate error rates by route (Grafana panel):
sum by (route) (rate({service="api", level="ERROR"}[5m]))Playbook staples:
- Link from APM trace view to logs by
trace_id(Datadog, Tempo, X-Ray, or Jaeger). - Dashboards for golden signals: error rate, latency percentiles, saturation.
- Saved queries for: timeouts, DB pool exhaustion, retries > N, circuit breaker open, auth failures.
Checkpoint: A dry-run incident should go from alert to “root-causing log line found” in under 5 minutes.
Guardrails: linting, CI, and runtime controls
What actually sticks is automation:
- Linting: Block
console.logand unstructured logs.
// .eslintrc.json
{
"rules": {
"no-console": ["error", {"allow": ["warn", "error"]}],
"@gitplumbers/structured-log": "error"
}
}- Schema tests: Validate representative logs in CI with JSON Schema.
# package.json scripts
ajv validate -s schema/log.json -d fixtures/*.json- Pipelines as code: Version your Vector/OTel configs. PRs must update both code and router.
- Runtime controls: Allow safe level changes without redeploy.
Spring Boot example:
curl -X POST localhost:8080/actuator/loggers/com.yourco.payments \
-H 'Content-Type: application/json' \
-d '{"configuredLevel": "DEBUG"}'Node example using an in-process admin endpoint:
app.post('/admin/log-level', (req, res) => {
const level = req.body.level; // validate!
logger.level = level; // pino supports runtime change
res.sendStatus(204);
});- Canaries: Raise
DEBUGon a canary pod only. Verify sample rate and cost before widening. - Automated PII scans: Periodically query logs for known bad patterns and alert.
A 30/60/90 rollout that works
We’ve run this playbook at startups and public companies. Here’s the compressed version.
0–30 days:
- Ratify schema and add it to your engineering standards doc.
- Instrument 2 services end-to-end (web + a critical backend path) with
trace_idand structured logs. - Deploy Vector/OTel Collector and route to a hot store (Loki or ES). Turn on basic redaction.
- Build 5 saved queries and one “from alert to logs” runbook.
31–60 days:
- Expand to top 10 services. Enforce CI schema checks and lint rules.
- Add tiered retention and budget alarms. Tune sampling for verbose domains.
- Integrate APM-to-logs deep links by
trace_id. - Table-top an incident using only the new flows. Track MTTR.
61–90 days:
- Migrate all services. Lock legacy sinks.
- Add runtime log-level controls + canary debug.
- Formalize ownership: who reviews logging for new features? Add to PR template.
- Publish a quarterly “log quality” report: coverage, redaction incidents, query latency.
Success criteria: MTTR down 40%+, paging reduced, on-call sentiment up, logs bill stable or lower despite growth.
If any of this feels familiar and you don’t have cycles to do the retrofit, GitPlumbers has done this surgery in Kubernetes, ECS, and bare metal shops using Vector, Fluent Bit, OTel, Loki, and Elastic. We know where the bodies are buried and how to keep the bills from spiking while you fix it.
Key takeaways
- Logs must be structured, correlated to traces, and consistent across services — or they’re just noise.
- Define success with MTTR, error-budget burn visibility, and query latency — not “log volume.”
- Standardize a schema and enforce it in CI; don’t rely on devs to remember fields in crunch time.
- Use a transformation layer (Vector/OTel Collector) for redaction, routing, and sampling to keep costs sane.
- Pre-bake queries and runbooks; on-call shouldn’t craft KQL at 2 a.m.
- Control verbosity at runtime and treat logging as a product with owners and SLAs.
Implementation checklist
- Agree on a minimal cross-service log schema with `trace_id` and `span_id`.
- Instrument 1 critical request path per service with structured logs and trace correlation.
- Deploy a log router (Vector or OTel Collector) for redaction, routing, and sampling.
- Set tiered retention (hot/warm/cold) and budget alarms on ingest and storage.
- Create saved queries for top 5 incident classes and link them in runbooks.
- Add CI checks that fail on non-JSON logs or schema drift.
- Expose runtime toggles to raise/lower log levels safely per service.
Questions we hear from teams
- How do I add `trace_id` if I’m not using distributed tracing yet?
- Start by generating a `request_id` at your ingress (NGINX/Envoy) and propagate it via headers. Include it in logs across services. When you add OpenTelemetry later, map `request_id` to `trace_id` during transformation so your saved queries keep working.
- Loki or Elasticsearch?
- If your primary use is K8s service logs with high volume and you can live with label-based filtering, Loki is cheaper and simpler. If you need ad-hoc field queries across arbitrary JSON and heavy aggregations, Elasticsearch/OpenSearch wins. Many shops run Loki for app logs and Elastic for security/audit.
- How do I keep costs from exploding?
- Drop DEBUG in prod at the router, sample high-frequency info logs (1–10%), enforce retention tiers (hot 48h, warm 7–14d, cold 30–90d), and watch cardinality (no per-user labels). Set budget alerts on ingest and storage, and review them weekly.
- What about PII and compliance?
- Redact at the edge (Vector/OTel). Maintain a denylist of fields (`password`, `ssn`, `card_number`) and run automated scans on downstream stores. Treat logging configs as code with security review. For regulated data, route to a separate, access-controlled index/tenant.
- Do I need OpenTelemetry to get value?
- No, but it multiplies the value. Even without OTel, structured logs with a propagated `request_id` and consistent schema will cut MTTR. Adding OTel later lets you pivot from trace to logs instantly, which is where the real speedup happens.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
