Should we start with logs, metrics, or traces?

If you’re firefighting latency and timeouts, start with traces at the edge plus your top 5 flows. Metrics are your SLO backbone, logs tell stories, but traces explain where time went. Then stitch them together with exemplars and log links.

Is vendor X better than self-hosted?

Depends on your team. If you have K8s + S3 and Grafana already, Tempo is cost-effective and scalable. If you need strong UI and event-based analysis, Honeycomb shines. Datadog is convenient if you’re already in the ecosystem. The key is tail-based sampling and consistent propagation—those matter more than the logo.

Can we avoid code changes entirely?

You can go far with auto-instrumentation and mesh, but you’ll miss DB calls, queue hops, and business context. Plan for small, targeted code changes (2–3 manual spans and attributes) in critical services.

How do we keep costs sane?

Tail-based sampling (keep errors and slow, sample 1–5% of normal), batch processors in the Collector, and attribute scrubbing. Monitor `otelcol_exporter_sent_spans` and ingestion volume; set budgets per env (`deployment.environment`).

Guides · Nov 13, 2025 · 10 minute read

The Tracing Rollout That Finally Stuck: OpenTelemetry + Collector + Tail Sampling in K8s

You don’t need another tracing pilot. You need traces that answer pager questions in minutes, not hours—without bankrupting your ingest bill.

Alex Kim

Principal Engineer, GitPlumbers

20 years building and rescuing platforms. Led observability programs through monolith-to-microservices migrations at fintech and SaaS unicorns; shipped tracing stacks with Istio, Envoy, OpenTelemetry, Tempo, Honeycomb, and Datadog. Won and lost a few pager wars, learned what actually sticks.

“You can’t fix what you can’t see. Tracing turns ‘it’s slow’ into ‘it’s this hop, right here, and here’s why.’”

Back to all posts

The outage you couldn’t debug (we’ve all been there)

You know the one: checkout starts timing out, Grafana shows a red P95 on the API, but half the logs are missing and every team swears it’s not their service. I’ve watched senior engineers burn weekends grep‑hunting correlation IDs that never propagated past Kafka. The postmortem reads like a Mad Libs of blind spots.

Distributed tracing is supposed to fix this. Most rollouts fail because they start with “instrument everything” and end with a fat ingest bill, low signal, and no habit change. Here’s the rollout that finally stuck for us—OpenTelemetry + Collector + tail sampling—implemented in Kubernetes, wired through HTTP/gRPC and queues, and measured with concrete checkpoints.

What "good" tracing looks like in production

If you can’t measure it, you won’t fix it. Aim for:

Trace coverage: ≥80% of requests at the edge have a traceparent and produce at least one trace.
Trace completeness: Median spans/trace ≥ 8 for your critical flow; ≥90% of traces include all hops (edge → service → DB/cache → downstream → queue → worker).
Error capture: ≥95% of 5xx requests yield a sampled trace with error status and exception events.
Latency explainability: Time is accounted for across spans (server, client, DB) without giant “unknown” gaps.
Cost guardrails: Successful requests sampled to ≤1–5% by default; errors and slow traces always kept.

If your on-call can jump from a red SLO chart to a single bad trace and say “DB pool is saturated in orders-db,” you’re done.

Step 1: Choose a backend and a sampling plan

Pick one you can run and query without vendor tickets.

Self-hosted: Grafana Tempo (S3/GCS, cheap at scale), Jaeger (mature, straightforward), Zipkin (lean, older).
SaaS: Honeycomb (excellent UI for high-cardinality), Datadog (batteries-included), AWS X-Ray (AWS native, limited features), New Relic, Dynatrace.

Decide sampling up front:

Keep: errors, long-latency traces (tail-based), and a small % of normal.
Budget: Rough math—1k rps × 10 spans/trace × 1 kB/span = 10 MB/s unsampled. You will sample.
Strategy: Tail-based sampling in the OpenTelemetry Collector so you can decide with context.

Step 2: Standardize context and attributes

Non-negotiables:

Trace context: Use W3C Trace-Context (traceparent, tracestate) everywhere. Drop B3 unless legacy requires it; if you must support both, configure propagation to read B3 and write W3C.
Baggage: For business keys like tenant_id, keep it short and low-cardinality.
Resource attributes (consistent keys): service.name, service.version, deployment.environment, cloud.region. Write them once via the Collector so every span gets them.
Span naming: HTTP GET /orders/{id}, db.sql.query SELECT orders, kafka.produce orders.created. Avoid unbounded label values in names.

Step 3: Instrument the edge and your top 5 flows

Start at ingress, then instrument the services that make or break revenue.

Ingress / service mesh

Envoy/Istio: Enable tracing and forward to the Collector.

# Istio mesh config (values for istio/istiod Helm)
meshConfig:
  defaultConfig:
    tracing:
      sampling: 10.0  # 10% head sample at proxies (still do tail-sampling later)
      max_path_tag_length: 120

# Envoy bootstrap snippet
tracing:
  provider:
    name: envoy.tracers.opentelemetry
    typed_config:
      "@type": type.googleapis.com/envoy.config.trace.v3.OpenTelemetryConfig
      grpc_service:
        envoy_grpc:
          cluster_name: otel-collector
      service_name: edge-gateway

Java (Spring Boot) with the Java agent

Use the OpenTelemetry Java agent to avoid touching code.

curl -LO https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v1.41.0/opentelemetry-javaagent.jar
JAVA_TOOL_OPTIONS="-javaagent:/app/opentelemetry-javaagent.jar" \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318 \
OTEL_TRACES_EXPORTER=otlp \
OTEL_SERVICE_NAME=orders-api \
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod,service.version=1.2.3 \
java -jar app.jar

Add manual spans where business context matters:

// OrderController.java
Span span = Span.current().makeCurrent().getSpan();
span.setAttribute("tenant_id", tenantId);
span.addEvent("cart_validated");

Node.js (Express)

npm i @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http

// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces' }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

// server.ts
import './tracing';
import express from 'express';
const app = express();
app.get('/health', (_req, res) => res.send('ok'));
app.listen(3000);

gRPC and async (Kafka)

gRPC is usually covered by auto-instrumentation. Queues often aren’t—propagate headers yourself if needed.

// Kafka producer propagation (OTel Java)
TextMapSetter<Headers> setter = (carrier, key, value) -> carrier.add(new RecordHeader(key, value.getBytes(UTF_8)));
Context ctx = Context.current();
Headers headers = new RecordHeaders();
OpenTelemetry.getGlobalPropagators().getTextMapPropagator().inject(ctx, headers, setter);
producer.send(new ProducerRecord<>("orders.created", null, key, payload, headers));

On the consumer side, extract before you do work.

TextMapGetter<Headers> getter = new TextMapGetter<>() {
  public Iterable<String> keys(Headers c) { return StreamSupport
      .stream(c.spliterator(), false).map(h -> h.key()).collect(Collectors.toList()); }
  public String get(Headers c, String key) {
    Header h = c.lastHeader(key);
    return h == null ? null : new String(h.value(), UTF_8);
  }
};
Context parent = OpenTelemetry.getGlobalPropagators().getTextMapPropagator().extract(Context.current(), record.headers(), getter);
Span span = tracer.spanBuilder("orders.worker.process").setParent(parent).startSpan();

Step 4: Put an OpenTelemetry Collector in the path

This is where the signal gets good and the cost gets sane.

Receive OTLP from proxies and SDKs.
Add resource attributes, scrub PII, and batch.
Tail-sample to keep errors and slow traces.
Export to Tempo/Jaeger/SaaS.

# otel-collector.yaml
receivers:
  otlp:
    protocols: { http: {}, grpc: {} }
processors:
  batch: {}
  attributes/scrub:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: user.email
        action: delete
  resource:
    attributes:
      - key: deployment.environment
        action: upsert
        value: prod
      - key: cloud.region
        action: upsert
        value: us-east-1
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow_requests
        type: latency
        latency: { threshold_ms: 500 }
      - name: baseline_traffic
        type: probabilistic
        probabilistic: { sampling_percentage: 2 }
exporters:
  otlphttp/tempo:
    endpoint: http://tempo:4318
  logging: {}
extensions:
  health_check: {}
service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes/scrub, resource, batch, tail_sampling]
      exporters: [otlphttp/tempo, logging]

Deploy via Helm or GitOps alongside your mesh/gateway. Watch Collector self-metrics (otelcol_*) to validate flow.

kubectl -n observability port-forward svc/otel-collector 13133:13133 &
curl -s localhost:13133/metrics | grep otelcol_receiver_accepted_spans

Step 5: Wire traces into SLO dashboards with exemplars

Traces are only useful if they shorten MTTR.

Prometheus + Grafana: Enable exemplars so red charts link to a representative trace.
SLOs: For http_request_duration_seconds, show P95 with exemplar trace IDs.
Alerting: Include a Tempo/Jaeger link template with {traceID} in alerts.

Example panel query notes:

Prometheus ≥2.40 supports exemplars; OTel SDKs add them automatically when configured.
In Grafana, set Tempo as a data source; “Query with exemplars” emits trace links on your latency graphs.

Rollout plan with checkpoints and metrics

You can ship this in two sprints without boiling the ocean.

Sprint 1 — Edge + two services
- Gateways/mesh emit traces to Collector.
- Auto-instrument api-gateway and orders-api.
- Kafka propagation for orders.created.
- Tail sampling in place.
- Checkpoints:
  - ≥80% edge requests include traceparent (measure via access logs).
  - ≥90% of 5xx from orders-api have a kept trace.
  - Ingest budget ≤ 1.5 MB/s after sampling.
Sprint 2 — Complete the critical path
- Instrument payments, inventory, and the orders-worker.
- Add DB spans and key business attributes (tenant_id, order_id).
- Wire Grafana SLO dashboard with exemplars.
- Checkpoints:
  - Median spans/trace for checkout ≥ 8.
  - 95th percentile time accounted across spans (no >20% “gap time”).
  - On-call runbook links from alerts to Tempo/Jaeger queries.
Harden and govern
- Attribute dictionary, span naming guide, and banned keys (PII).
- CI check: reject PRs that introduce high-cardinality labels (regex budget).
- Backpressure: rate-limit exporters in Collector; alert on otelcol_exporter_send_failed_spans.

Common traps and how to fix them

Dropped context at async boundaries: Verify Kafka/SQS propagation with integration tests that assert presence of traceparent headers. I’ve seen this be 80% of “tracing doesn’t work” bugs.
High-cardinality explosions: Do not put user emails, UUIDs, or raw SQL in span names or attributes. Whitelist attributes in the Collector and scrub the rest.
Auto-instrumentation only: It’s a start, but without business attributes (tenant_id, order_value), traces won’t answer pager questions. Add 2–3 manual spans where it matters.
Sampling in the SDK: Prefer keeping sampling decisions centralized in the Collector; SDK head-sampling can discard the very traces you need.
Polyglot edge cases: Node.js async context and Python asyncio can lose context under load. Use the latest OTel libs and run a load test specifically checking trace continuity.
Mesh-only tracing: Service-to-service will show, but DB/queue spans will be missing. You still need SDKs for internals.

Proving it worked (and didn’t melt your bill)

MTTR for checkout incidents down from 90m to <20m.
Error trace capture ≥ 95% with automatic tail sampling.
Ingest: 2% baseline sample on success, 100% on errors/slow -> 6–10x lower cost vs all-in.
On-call: Single-click from P95 chart to bad trace with root cause (DB pool or downstream 429s).

If you want a sanity check before you roll, we’ve done this at scale across Istio, NGINX, Linkerd, Spring Boot, Node, Python, gRPC, Kafka, SQS, and yes—legacy monoliths that still run the business. GitPlumbers can blueprint your rollout and pair with your team so it sticks this time.

Related Resources

Key takeaways

Instrument the edge first, then the top 5 critical flows—don’t boil the ocean.
Use W3C Trace-Context end-to-end and test propagation across HTTP, gRPC, and your queues.
Put an OpenTelemetry Collector in the path with tail-based sampling and PII scrubbing.
Standardize attributes and span names early to avoid a cardinality explosion later.
Wire traces into SLO dashboards with exemplars so on-call can jump from a red chart to a bad trace in one click.

Implementation checklist

Pick a backend (Tempo, Jaeger, Honeycomb, Datadog, X-Ray) and a sampling budget.
Enable tracing at the gateway/ingress and mesh sidecars.
Auto-instrument top 5 services; add manual spans only where business context is needed.
Propagate context across async (Kafka/SQS) and batch jobs.
Deploy an OTel Collector with tail_sampling, resource detection, and attribute scrubbing.
Define a trace schema: required attributes, naming, and error tagging.
Add an SLO dashboard with exemplar links to traces.
Track adoption metrics: trace coverage, completeness, error capture rate, ingest volume.

Questions we hear from teams

Should we start with logs, metrics, or traces?: If you’re firefighting latency and timeouts, start with traces at the edge plus your top 5 flows. Metrics are your SLO backbone, logs tell stories, but traces explain where time went. Then stitch them together with exemplars and log links.
Is vendor X better than self-hosted?: Depends on your team. If you have K8s + S3 and Grafana already, Tempo is cost-effective and scalable. If you need strong UI and event-based analysis, Honeycomb shines. Datadog is convenient if you’re already in the ecosystem. The key is tail-based sampling and consistent propagation—those matter more than the logo.
Can we avoid code changes entirely?: You can go far with auto-instrumentation and mesh, but you’ll miss DB calls, queue hops, and business context. Plan for small, targeted code changes (2–3 manual spans and attributes) in critical services.
How do we keep costs sane?: Tail-based sampling (keep errors and slow, sample 1–5% of normal), batch processors in the Collector, and attribute scrubbing. Monitor `otelcol_exporter_sent_spans` and ingestion volume; set budgets per env (`deployment.environment`).

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a tracing rollout blueprint See how we tail-sample errors without losing signal