Tracing That Survives Prod: A Pragmatic Playbook for Microservices (OpenTelemetry, Meshes, and Messy Reality)
If your incident channel still asks “which service is slow?”, you don’t have tracing—you have vibes. Here’s the step-by-step I use to get distributed tracing working across real microservices without wrecking SLOs or your budget.
Tracing isn’t a feature. It’s a contract between every hop in your system. Break the contract at one boundary and you’re back to vibes and guesswork.Back to all posts
The outage you’ve lived through
It’s 12:41 AM. Checkout is timing out. Grafana shows a red wall. Half the team is grepping logs, the other half is clicking random Jaeger traces that end at the gateway. You ship microservices, but your traces die at hop two and async hops are a black hole. I’ve seen this movie at unicorns and at 20-year-old enterprises. Tracing fails not because tools are bad, but because propagation is partial, sampling is naive, and nobody owns the collector.
This is the playbook we use at GitPlumbers to make tracing boring, reliable, and cheap enough to keep. It’s opinionated because the alternatives cost you MTTR and money.
Decide your standards and topology (don’t skip this)
Before you touch code, decide the rules of the road. If you let each team “pick what works for us,” you’ll be back here in six months.
- Propagation: Use W3C
traceparent
/tracestate
end-to-end. Only translate B3 at legacy edges. Ban custom headers. - Context semantics: Standardize
service.name
,service.version
, anddeployment.environment
via OpenTelemetryResource
attributes. - Backend: Pick one primary trace backend:
Jaeger
,Grafana Tempo
,Honeycomb
, orDatadog
. All can ingest OTLP. Avoid dual-writes from apps; fan-out in the OpenTelemetry Collector. - Topology: Run an
otel-collector
as a DaemonSet (per-node) or per-namespace. Apps export to local collector. Collectors forward to backends and apply sampling/policy. Mesh/ingress export to the same collector.
Rule: apps and proxies speak OTLP to a local collector. The collector speaks whatever to your vendors.
Checkpoints
- A one-pager in your repo: propagation = W3C, backend = X, collector address =
otel-collector:4317
, required resource attributes listed. - A “no B3 past the edge” admission policy if you’re on Kubernetes.
- A runbook for changing sampling without redeploying apps.
Instrument the edges first, then services, then async
Tracing dies at boundaries. Instrument those first or your beautiful service spans won’t correlate.
Ingress/gateway
- NGINX Ingress: enable tracing with an OpenTelemetry module or sidecar exporter. For vanilla ingress-nginx, forward headers and let app SDKs create spans; or front with an Envoy gateway.
- Istio (1.18+): use Telemetry API to send spans to the collector.
apiVersion: telemetry.istio.io/v1alpha1 kind: Telemetry metadata: name: mesh-tracing namespace: istio-system spec: selector: {} tracing: - providers: - name: otel randomSamplingPercentage: 10 providers: - name: otel otel: service: otel-collector.otel.svc.cluster.local port: 4317
- Checkpoint: Incoming requests show a root span at the gateway with a
trace_id
you can follow into at least one downstream service.
Service auto-instrumentation
Node.js (Express)
// tracing.js const { Resource } = require('@opentelemetry/resources'); const { SemanticResourceAttributes: S } = require('@opentelemetry/semantic-conventions'); const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc'); const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base'); const { registerInstrumentations } = require('@opentelemetry/instrumentation'); const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http'); const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express'); const { propagation } = require('@opentelemetry/api'); const provider = new NodeTracerProvider({ resource: new Resource({ [S.SERVICE_NAME]: 'checkout', [S.SERVICE_VERSION]: process.env.VERSION || '1.2.3', 'deployment.environment': process.env.ENV || 'prod' }) }); provider.addSpanProcessor(new BatchSpanProcessor(new OTLPTraceExporter({ url: 'http://otel-collector:4317' }))); provider.register(); registerInstrumentations({ instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()] });
Java (Spring Boot 3+)
# build.gradle implementation 'io.opentelemetry:opentelemetry-bom:1.36.0' runtimeOnly 'io.opentelemetry.javaagent:opentelemetry-javaagent:1.36.0'
# JVM flags -javaagent:/otel/opentelemetry-javaagent.jar \ -Dotel.service.name=payments \ -Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \ -Dotel.propagators=tracecontext,baggage
Go (gRPC client/server): use
go.opentelemetry.io/otel
, add unary interceptors for client/server.Checkpoint: For one request, you see a trace spanning gateway → service A → service B, with latency and status in each span.
Async boundaries (Kafka/RabbitMQ/SQS)
Inject/extract context in message headers. With OpenTelemetry JS:
const { context, propagation } = require('@opentelemetry/api'); const headers = {}; propagation.inject(context.active(), headers); // producer producer.send({ topic: 'orders', messages: [{ key: id, value: body, headers }] }); // consumer const ctx = propagation.extract(context.active(), message.headers); context.with(ctx, () => handleMessage(message));
For Java, use
io.opentelemetry.instrumentation:kafka-clients
and ensure header propagation is enabled.Checkpoint: A trace shows producer → broker (as annotation) → consumer spans, not two disjoint traces.
Downstream calls (DB/HTTP)
- Enable client instrumentation for
pg
,mysql2
,redis
,http
,aws-sdk
, etc. For Spring, it’s auto with the Java agent. - Checkpoint: Spans include
db.system
,db.statement
(sanitized),http.url
,http.status_code
attributes.
- Enable client instrumentation for
Error capture
- Record exceptions and set
status.error
on spans. Ensure your frameworks aren’t swallowing exceptions before the tracer sees them. - Checkpoint: Error traces highlight red spans; sampling boosts error traces (see next section).
- Record exceptions and set
Wire the collector and the mesh (your policy brain)
The OpenTelemetry Collector is where you do smart routing, sampling, and vendor fan-out. Treat it like an SRE-managed system with config in Git.
- Kubernetes deploy: run per-node for resilience.
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: otel
data:
otelcol.yaml: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
attributes/redact:
actions:
- key: user.email
action: delete
tail_sampling:
decision_wait: 5s
policies:
- type: status_code
status_code:
status_codes: [ERROR]
- type: latency
latency:
threshold_ms: 500
- type: probabilistic
probabilistic:
sampling_percentage: 10
exporters:
otlp/honeycomb:
endpoint: api.honeycomb.io:443
headers: { 'x-honeycomb-team': '${HONEYCOMB_KEY}' }
jaeger:
endpoint: jaeger-collector.jaeger:14250
tls: { insecure: false }
logging:
loglevel: warn
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes/redact, tail_sampling, batch]
exporters: [jaeger, otlp/honeycomb]
- Istio/Envoy: confirm
traceparent
is preserved and sampling at the mesh aligns with collector policies. If the mesh does 10% head sampling and collector tail-samples, you’ll only ever see 10%. - Ingress-NGINX: ensure
proxy_set_header traceparent $http_traceparent;
and disable any header rewrites. - Checkpoint: A load test shows stable ingest at the collector with backpressure and retries; no app writes directly to vendors.
Make traces useful: correlate logs and metrics, not just pretty graphs
You’re not buying wall art. Traces must connect to logs and metrics.
Log correlation
Ensure every log event includes
trace_id
andspan_id
.Example (JSON logs):
{"level":"error","msg":"charge failed","trace_id":"4bf92f3577b34da6a3ce929d0e0e4736","span_id":"00f067aa0ba902b7","order_id":"123"}
For Java agent:
-Dotel.instrumentation.log-logging.enabled=true
. For Node, add a pino/winston hook to pull IDs from@opentelemetry/api
.
Prometheus exemplars
Expose histograms with exemplars so you can click from a p99 latency spike to a representative trace in Grafana/Tempo.
Go example:
httpRequestDuration.WithLabelValues("/checkout").ObserveWithExemplar(1.2, prometheus.Labels{ "trace_id": trace.SpanFromContext(ctx).SpanContext().TraceID().String(), })
Dashboards that matter
- RED for each service: Rate, Errors, Duration, with trace-exemplar links.
- A “Hot Paths” dashboard: top endpoints by latency contribution; click-through to traces.
- Checkpoint: From any SLO burn alert, you can pivot to a trace in ≤2 clicks, see the slow hop, and the owner on-call.
Sampling, storage, and cost control (the part finance will love you for)
I’ve watched teams nuke budgets by tracing 100% of traffic. Don’t. Be deliberate.
- Start with head sampling at the edge: 5–10% is plenty for healthy traffic. Errors should be 100%.
- Add tail-based sampling in the collector: keep slow or error traces regardless of head decision. Use policies like status code, latency, or attribute match (e.g.,
customer.tier = enterprise
). - Dynamic sampling: elevate sampling on active incidents or for canary releases.
- Retention: 7–14 days hot, 30–90 days cold (Tempo object storage is cheap). Jaeger + Elasticsearch will bite you; prefer Tempo/Cortex/Thanos-style backends for scale.
- Cardinality discipline: avoid unbounded
span.setAttribute
with IDs. Cap baggage to <1KB total. Drop PII at the collector. - Overhead targets: tracing adds <5ms p95 per request, <2% CPU overhead. Measure it in load tests.
Checkpoint metrics
- Trace coverage: >90% of requests generate a trace (before sampling), measured by requests vs spans created.
- Propagation success: >99% of traces have at least 2 services linked (not orphaned spans).
- Error capture: >95% of 5xx requests have an error span.
- Cost: storage/ingest per request within budget (track $/K requests).
A 30‑day rollout plan with proof points
Week 1 – Standards and collector
- Lock propagation (W3C), pick backend, deploy
otel-collector
with config in Git (GitOps viaArgoCD
). - Instrument ingress/gateway; verify root spans. Set 10% head sampling.
- Prove: 1 service and gateway appear in one trace under load; collector handles 2x peak QPS.
Week 2 – Core services
- Add auto-instrumentation to top 3 traffic services (Java agent, Node SDKs, Go interceptors).
- Enable DB/HTTP client spans; scrub SQL text at collector if needed.
- Prove: gateway → svc A → svc B traces with attributes and 0.99 propagation.
Week 3 – Async and correlation
- Instrument Kafka/RabbitMQ producers/consumers for header propagation; backfill libraries if missing.
- Add log correlation and exemplars.
- Prove: a trace travels across a message bus; Grafana panel links to a trace from p99.
Week 4 – Sampling polish and guardrails
- Turn on tail-based sampling for errors and >500ms latency, reduce head sampling to 5%.
- Add chaos tests: drop
traceparent
at random at the gateway; ensure SDKs create new roots but still sample errors. - Prove: MTTR drills—on injected 500s, on-call can isolate the slow hop in <10 minutes.
Exit criteria
- Coverage >90%, propagation >99%, overhead <5ms p95, error traces sampled at ~100%.
- Dashboards exist for RED + hot paths; runbook documented. Budgets approved with measured ingest/storage.
What breaks in the wild (and how we’ve fixed it)
- Mesh vs SDK sampling fights: Istio sampled 1%, collector tail-sampled for errors; healthy error traces were lost. Fix: raise mesh sampling or move all sampling to the collector via
x-ot-span-context
honoring—better, keep mesh at ≥10% when tail sampling is on. - Async orphans: Kafka libs not carrying headers across frameworks (seen in older Spring Cloud Stream). Fix: standardize on OpenTelemetry instrumentation version and add a thin wrapper to inject/extract; add CI tests that assert header presence.
- Hot attribute cardinality: teams added
user_id
as an attribute. Tempo storage exploded. Fix: hash IDs if needed, otherwise log it—not in spans. Enforce via collector attribute processor. - PII in spans: ORM auto-captured SQL with literals. Fix: parameterized queries, and a collector
attributes/delete
list for known fields. - Vendor lock whiplash: app SDKs exported to Datadog directly; switching to Honeycomb was a month-long migration. Fix: all apps export OTLP to collector; collector fans out.
- CI regressions: a refactor dropped context in an async callback chain. Fix: unit tests that assert
trace_id
continuity; chaos tests that drop headers and verify graceful re-rooting.
If any of this sounds familiar, you’re not alone. Tracing isn’t a feature; it’s a contract. Treat it that way.
Quick reference: what “good” looks like
- Standards: W3C
traceparent
, OTLP to collector, resource attributes set. - SDKs: Java agent 1.36+, Node OTel SDK 0.50+, Go OTel 1.22+.
- Backends: Tempo for scale/cost, Jaeger for UI/primitives, Honeycomb for high-cardinality analysis, Datadog for integrated infra.
- Policies: Head 5–10% + tail for errors/slow, redact PII in collector, drop high-cardinality attributes.
- Dashboards: RED + exemplars, Top N slow endpoints, Error heatmap by service.
- SLOs: trace coverage, propagation success, overhead, cost per request.
- Ownership: SRE owns collector and sampling. App teams own SDKs and context at boundaries.
When you hit those, you’ll stop asking “which service is slow?” and start fixing it in minutes.
Key takeaways
- Pick one propagation standard and stick to it across HTTP, gRPC, and async (W3C tracecontext).
- Instrument the edges first (ingress, gateway), then services, then async, then DB/HTTP clients—verify propagation at each hop.
- Use OpenTelemetry Collector as the choke point for routing, sampling, and vendor fan-out.
- Correlate logs and metrics with traces using trace_id/span_id and exemplars—otherwise you’re paying for pretty screenshots.
- Start with conservative head sampling at the edge, add tail sampling for high-value traffic. Control costs with dynamic policies.
- Set explicit success metrics: >90% trace coverage, >99% propagation success, <5ms overhead at p95, <1% span error rate.
- Bake tracing into CI/CD and chaos tests so header propagation never regresses.
Implementation checklist
- Standardize on `traceparent`/`tracestate` and reject B3 at the boundary.
- Install OpenTelemetry SDKs with auto-instrumentation per language and set `service.name`, `service.version`, `deployment.environment`.
- Deploy an `otel-collector` per node/namespace; send to Jaeger/Tempo/Honeycomb/Datadog as needed.
- Configure mesh/ingress (Istio/Envoy/NGINX) to preserve and emit tracing headers.
- Instrument async (Kafka/RabbitMQ) to inject/extract trace context in message headers.
- Correlate logs with `trace_id` and `span_id`; wire exemplars from Prometheus histograms to traces.
- Apply sampling policies: head at edge, tail in collector for errors/high latency.
- Set dashboards and SLOs for trace coverage, propagation, and overhead. Add chaos tests for header loss.
Questions we hear from teams
- Should we use OpenTelemetry or just stick with our vendor’s agent?
- Use OpenTelemetry for the application side and standardization. Export OTLP to a local collector and fan out to vendors from there. This avoids lock-in and lets you mix Jaeger/Tempo/Honeycomb/Datadog as needed. If you already run a vendor agent, keep it at the collector tier, not inside every app.
- Head vs tail sampling—what actually works in production?
- Start with 5–10% head sampling at the edge to cap volume. Add tail sampling in the collector to guarantee retention of errors and slow traces. Head-only is cheap but misses rare events. Tail-only is expensive if the edge drops context. The hybrid policy balances both.
- How do we trace across Kafka without rewriting everything?
- Adopt the OpenTelemetry instrumentation for your Kafka client (Java/Node/Go), enable header propagation, and add thin wrappers if your framework doesn’t do it automatically. Test it: assert headers exist on produced messages and are extracted on consume. Add chaos tests that strip headers to ensure re-rooting works.
- Won’t tracing add too much latency?
- With batch processors and OTLP over gRPC to a local collector, overhead should be <5ms at p95 and <2% CPU. Measure in your load tests. If you exceed that, check sync exporters, oversized attributes, and per-span logging. Adjust batch sizes and disable verbose instrumentation you don’t need.
- What if we’re not on Kubernetes?
- Run the collector as a sidecar or a VM service, point SDKs at `localhost:4317`, and manage config with your standard tooling (Ansible, Terraform). Gateways like Envoy or API Gateway can still forward `traceparent`. The principles—edge first, collector in the middle, sampling at the collector—still apply.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.