Should we use OpenTelemetry or just stick with our vendor’s agent?

Use OpenTelemetry for the application side and standardization. Export OTLP to a local collector and fan out to vendors from there. This avoids lock-in and lets you mix Jaeger/Tempo/Honeycomb/Datadog as needed. If you already run a vendor agent, keep it at the collector tier, not inside every app.

Head vs tail sampling—what actually works in production?

Start with 5–10% head sampling at the edge to cap volume. Add tail sampling in the collector to guarantee retention of errors and slow traces. Head-only is cheap but misses rare events. Tail-only is expensive if the edge drops context. The hybrid policy balances both.

How do we trace across Kafka without rewriting everything?

Adopt the OpenTelemetry instrumentation for your Kafka client (Java/Node/Go), enable header propagation, and add thin wrappers if your framework doesn’t do it automatically. Test it: assert headers exist on produced messages and are extracted on consume. Add chaos tests that strip headers to ensure re-rooting works.

Won’t tracing add too much latency?

With batch processors and OTLP over gRPC to a local collector, overhead should be <5ms at p95 and <2% CPU. Measure in your load tests. If you exceed that, check sync exporters, oversized attributes, and per-span logging. Adjust batch sizes and disable verbose instrumentation you don’t need.

What if we’re not on Kubernetes?

Run the collector as a sidecar or a VM service, point SDKs at `localhost:4317`, and manage config with your standard tooling (Ansible, Terraform). Gateways like Envoy or API Gateway can still forward `traceparent`. The principles—edge first, collector in the middle, sampling at the collector—still apply.

Guides · Oct 1, 2025 · 10 minute read

Tracing That Survives Prod: A Pragmatic Playbook for Microservices (OpenTelemetry, Meshes, and Messy Reality)

Q: What if we’re not on Kubernetes?

Run the collector as a sidecar or a VM service, point SDKs at `localhost:4317`, and manage config with your standard tooling (Ansible, Terraform). Gateways like Envoy or API Gateway can still forward `traceparent`. The principles—edge first, collector in the middle, sampling at the collector—still apply.

If your incident channel still asks “which service is slow?”, you don’t have tracing—you have vibes. Here’s the step-by-step I use to get distributed tracing working across real microservices without wrecking SLOs or your budget.

Back to all posts

Tracing That Survives Prod: A Pragmatic Playbook for Microservices (OpenTelemetry, Meshes, and Messy Reality)

Key takeaways

Implementation checklist