How did you keep observability costs from exploding?

Two levers: tail-based trace sampling (keep all errors and slow requests, sample the rest) and strict label hygiene to avoid cardinality bombs. We also pushed high-cardinality logs to Loki with retention tiers (hot 7 days, cold 30) and kept metrics retention at 15 days for high-res series, 90 days for downsampled.

Why OpenTelemetry instead of a single vendor agent?

Portability and flexibility. OTel let us route the same data to Prometheus/Grafana/Tempo/Loki now and keep an exit ramp to a vendor later. Auto-instrumentation for Java/Node was mature enough (1.28.0/0.44.x), and the Collector gave us control over sampling and routing.

Do I need Argo Rollouts to gate canaries on SLOs?

No. Spinnaker, Flagger, and even bespoke CD pipelines can call Prometheus and make promotion decisions. What matters is gating on SLO-aligned queries, not just infrastructure metrics.

What’s the minimum to start if I have four weeks?

Pick two critical journeys, define availability and latency SLOs, instrument the edge and the two hottest backends with OTel, add burn-rate alerts, and wire a single canary to those queries. You can harden and expand later.

Case-studies · Oct 2, 2025 · 9 minute read

The Canary That Saved Black Friday: SLO-Driven Observability Stopped a Redis Client Meltdown

We replaced noisy alerts and blind spots with SLOs, OpenTelemetry, and canary analysis—then watched it prevent a seven-figure outage in real time.

Back to all posts

The Canary That Saved Black Friday: SLO-Driven Observability Stopped a Redis Client Meltdown

Key takeaways

Implementation checklist