The Tracing Rollout That Finally Stuck: OpenTelemetry + Collector + Tail Sampling in K8s
You don’t need another tracing pilot. You need traces that answer pager questions in minutes, not hours—without bankrupting your ingest bill.
“You can’t fix what you can’t see. Tracing turns ‘it’s slow’ into ‘it’s this hop, right here, and here’s why.’”Back to all posts
The outage you couldn’t debug (we’ve all been there)
You know the one: checkout starts timing out, Grafana shows a red P95 on the API, but half the logs are missing and every team swears it’s not their service. I’ve watched senior engineers burn weekends grep‑hunting correlation IDs that never propagated past Kafka. The postmortem reads like a Mad Libs of blind spots.
Distributed tracing is supposed to fix this. Most rollouts fail because they start with “instrument everything” and end with a fat ingest bill, low signal, and no habit change. Here’s the rollout that finally stuck for us—OpenTelemetry + Collector + tail sampling—implemented in Kubernetes, wired through HTTP/gRPC and queues, and measured with concrete checkpoints.
What "good" tracing looks like in production
If you can’t measure it, you won’t fix it. Aim for:
- Trace coverage: ≥80% of requests at the edge have a
traceparentand produce at least one trace. - Trace completeness: Median spans/trace ≥ 8 for your critical flow; ≥90% of traces include all hops (edge → service → DB/cache → downstream → queue → worker).
- Error capture: ≥95% of 5xx requests yield a sampled trace with error status and exception events.
- Latency explainability: Time is accounted for across spans (server, client, DB) without giant “unknown” gaps.
- Cost guardrails: Successful requests sampled to ≤1–5% by default; errors and slow traces always kept.
If your on-call can jump from a red SLO chart to a single bad trace and say “DB pool is saturated in
orders-db,” you’re done.
Step 1: Choose a backend and a sampling plan
Pick one you can run and query without vendor tickets.
- Self-hosted: Grafana Tempo (S3/GCS, cheap at scale), Jaeger (mature, straightforward), Zipkin (lean, older).
- SaaS: Honeycomb (excellent UI for high-cardinality), Datadog (batteries-included), AWS X-Ray (AWS native, limited features), New Relic, Dynatrace.
Decide sampling up front:
- Keep: errors, long-latency traces (tail-based), and a small % of normal.
- Budget: Rough math—1k rps × 10 spans/trace × 1 kB/span = 10 MB/s unsampled. You will sample.
- Strategy: Tail-based sampling in the OpenTelemetry Collector so you can decide with context.
Step 2: Standardize context and attributes
Non-negotiables:
- Trace context: Use W3C Trace-Context (
traceparent,tracestate) everywhere. Drop B3 unless legacy requires it; if you must support both, configure propagation to read B3 and write W3C. - Baggage: For business keys like
tenant_id, keep it short and low-cardinality. - Resource attributes (consistent keys):
service.name,service.version,deployment.environment,cloud.region. Write them once via the Collector so every span gets them. - Span naming:
HTTP GET /orders/{id},db.sql.query SELECT orders,kafka.produce orders.created. Avoid unbounded label values in names.
Step 3: Instrument the edge and your top 5 flows
Start at ingress, then instrument the services that make or break revenue.
Ingress / service mesh
- Envoy/Istio: Enable tracing and forward to the Collector.
# Istio mesh config (values for istio/istiod Helm)
meshConfig:
defaultConfig:
tracing:
sampling: 10.0 # 10% head sample at proxies (still do tail-sampling later)
max_path_tag_length: 120# Envoy bootstrap snippet
tracing:
provider:
name: envoy.tracers.opentelemetry
typed_config:
"@type": type.googleapis.com/envoy.config.trace.v3.OpenTelemetryConfig
grpc_service:
envoy_grpc:
cluster_name: otel-collector
service_name: edge-gatewayJava (Spring Boot) with the Java agent
Use the OpenTelemetry Java agent to avoid touching code.
curl -LO https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v1.41.0/opentelemetry-javaagent.jar
JAVA_TOOL_OPTIONS="-javaagent:/app/opentelemetry-javaagent.jar" \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318 \
OTEL_TRACES_EXPORTER=otlp \
OTEL_SERVICE_NAME=orders-api \
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod,service.version=1.2.3 \
java -jar app.jarAdd manual spans where business context matters:
// OrderController.java
Span span = Span.current().makeCurrent().getSpan();
span.setAttribute("tenant_id", tenantId);
span.addEvent("cart_validated");Node.js (Express)
npm i @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces' }),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();// server.ts
import './tracing';
import express from 'express';
const app = express();
app.get('/health', (_req, res) => res.send('ok'));
app.listen(3000);gRPC and async (Kafka)
gRPC is usually covered by auto-instrumentation. Queues often aren’t—propagate headers yourself if needed.
// Kafka producer propagation (OTel Java)
TextMapSetter<Headers> setter = (carrier, key, value) -> carrier.add(new RecordHeader(key, value.getBytes(UTF_8)));
Context ctx = Context.current();
Headers headers = new RecordHeaders();
OpenTelemetry.getGlobalPropagators().getTextMapPropagator().inject(ctx, headers, setter);
producer.send(new ProducerRecord<>("orders.created", null, key, payload, headers));On the consumer side, extract before you do work.
TextMapGetter<Headers> getter = new TextMapGetter<>() {
public Iterable<String> keys(Headers c) { return StreamSupport
.stream(c.spliterator(), false).map(h -> h.key()).collect(Collectors.toList()); }
public String get(Headers c, String key) {
Header h = c.lastHeader(key);
return h == null ? null : new String(h.value(), UTF_8);
}
};
Context parent = OpenTelemetry.getGlobalPropagators().getTextMapPropagator().extract(Context.current(), record.headers(), getter);
Span span = tracer.spanBuilder("orders.worker.process").setParent(parent).startSpan();Step 4: Put an OpenTelemetry Collector in the path
This is where the signal gets good and the cost gets sane.
- Receive OTLP from proxies and SDKs.
- Add resource attributes, scrub PII, and batch.
- Tail-sample to keep errors and slow traces.
- Export to Tempo/Jaeger/SaaS.
# otel-collector.yaml
receivers:
otlp:
protocols: { http: {}, grpc: {} }
processors:
batch: {}
attributes/scrub:
actions:
- key: http.request.header.authorization
action: delete
- key: user.email
action: delete
resource:
attributes:
- key: deployment.environment
action: upsert
value: prod
- key: cloud.region
action: upsert
value: us-east-1
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow_requests
type: latency
latency: { threshold_ms: 500 }
- name: baseline_traffic
type: probabilistic
probabilistic: { sampling_percentage: 2 }
exporters:
otlphttp/tempo:
endpoint: http://tempo:4318
logging: {}
extensions:
health_check: {}
service:
extensions: [health_check]
pipelines:
traces:
receivers: [otlp]
processors: [attributes/scrub, resource, batch, tail_sampling]
exporters: [otlphttp/tempo, logging]Deploy via Helm or GitOps alongside your mesh/gateway. Watch Collector self-metrics (otelcol_*) to validate flow.
kubectl -n observability port-forward svc/otel-collector 13133:13133 &
curl -s localhost:13133/metrics | grep otelcol_receiver_accepted_spansStep 5: Wire traces into SLO dashboards with exemplars
Traces are only useful if they shorten MTTR.
- Prometheus + Grafana: Enable exemplars so red charts link to a representative trace.
- SLOs: For
http_request_duration_seconds, show P95 with exemplar trace IDs. - Alerting: Include a Tempo/Jaeger link template with
{traceID}in alerts.
Example panel query notes:
- Prometheus ≥2.40 supports exemplars; OTel SDKs add them automatically when configured.
- In Grafana, set Tempo as a data source; “Query with exemplars” emits trace links on your latency graphs.
Rollout plan with checkpoints and metrics
You can ship this in two sprints without boiling the ocean.
Sprint 1 — Edge + two services
- Gateways/mesh emit traces to Collector.
- Auto-instrument
api-gatewayandorders-api. - Kafka propagation for
orders.created. - Tail sampling in place.
- Checkpoints:
- ≥80% edge requests include
traceparent(measure via access logs). - ≥90% of 5xx from
orders-apihave a kept trace. - Ingest budget ≤ 1.5 MB/s after sampling.
- ≥80% edge requests include
Sprint 2 — Complete the critical path
- Instrument
payments,inventory, and theorders-worker. - Add DB spans and key business attributes (
tenant_id,order_id). - Wire Grafana SLO dashboard with exemplars.
- Checkpoints:
- Median spans/trace for checkout ≥ 8.
- 95th percentile time accounted across spans (no >20% “gap time”).
- On-call runbook links from alerts to Tempo/Jaeger queries.
- Instrument
Harden and govern
- Attribute dictionary, span naming guide, and banned keys (PII).
- CI check: reject PRs that introduce high-cardinality labels (regex budget).
- Backpressure: rate-limit exporters in Collector; alert on
otelcol_exporter_send_failed_spans.
Common traps and how to fix them
- Dropped context at async boundaries: Verify Kafka/SQS propagation with integration tests that assert presence of
traceparentheaders. I’ve seen this be 80% of “tracing doesn’t work” bugs. - High-cardinality explosions: Do not put user emails, UUIDs, or raw SQL in span names or attributes. Whitelist attributes in the Collector and scrub the rest.
- Auto-instrumentation only: It’s a start, but without business attributes (
tenant_id,order_value), traces won’t answer pager questions. Add 2–3 manual spans where it matters. - Sampling in the SDK: Prefer keeping sampling decisions centralized in the Collector; SDK head-sampling can discard the very traces you need.
- Polyglot edge cases: Node.js async context and Python asyncio can lose context under load. Use the latest OTel libs and run a load test specifically checking trace continuity.
- Mesh-only tracing: Service-to-service will show, but DB/queue spans will be missing. You still need SDKs for internals.
Proving it worked (and didn’t melt your bill)
- MTTR for checkout incidents down from 90m to <20m.
- Error trace capture ≥ 95% with automatic tail sampling.
- Ingest: 2% baseline sample on success, 100% on errors/slow -> 6–10x lower cost vs all-in.
- On-call: Single-click from P95 chart to bad trace with root cause (DB pool or downstream 429s).
If you want a sanity check before you roll, we’ve done this at scale across Istio, NGINX, Linkerd, Spring Boot, Node, Python, gRPC, Kafka, SQS, and yes—legacy monoliths that still run the business. GitPlumbers can blueprint your rollout and pair with your team so it sticks this time.
Key takeaways
- Instrument the edge first, then the top 5 critical flows—don’t boil the ocean.
- Use W3C Trace-Context end-to-end and test propagation across HTTP, gRPC, and your queues.
- Put an OpenTelemetry Collector in the path with tail-based sampling and PII scrubbing.
- Standardize attributes and span names early to avoid a cardinality explosion later.
- Wire traces into SLO dashboards with exemplars so on-call can jump from a red chart to a bad trace in one click.
Implementation checklist
- Pick a backend (Tempo, Jaeger, Honeycomb, Datadog, X-Ray) and a sampling budget.
- Enable tracing at the gateway/ingress and mesh sidecars.
- Auto-instrument top 5 services; add manual spans only where business context is needed.
- Propagate context across async (Kafka/SQS) and batch jobs.
- Deploy an OTel Collector with tail_sampling, resource detection, and attribute scrubbing.
- Define a trace schema: required attributes, naming, and error tagging.
- Add an SLO dashboard with exemplar links to traces.
- Track adoption metrics: trace coverage, completeness, error capture rate, ingest volume.
Questions we hear from teams
- Should we start with logs, metrics, or traces?
- If you’re firefighting latency and timeouts, start with traces at the edge plus your top 5 flows. Metrics are your SLO backbone, logs tell stories, but traces explain where time went. Then stitch them together with exemplars and log links.
- Is vendor X better than self-hosted?
- Depends on your team. If you have K8s + S3 and Grafana already, Tempo is cost-effective and scalable. If you need strong UI and event-based analysis, Honeycomb shines. Datadog is convenient if you’re already in the ecosystem. The key is tail-based sampling and consistent propagation—those matter more than the logo.
- Can we avoid code changes entirely?
- You can go far with auto-instrumentation and mesh, but you’ll miss DB calls, queue hops, and business context. Plan for small, targeted code changes (2–3 manual spans and attributes) in critical services.
- How do we keep costs sane?
- Tail-based sampling (keep errors and slow, sample 1–5% of normal), batch processors in the Collector, and attribute scrubbing. Monitor `otelcol_exporter_sent_spans` and ingestion volume; set budgets per env (`deployment.environment`).
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
