The Canary That Saved Black Friday: SLO-Driven Observability Stopped a Redis Client Meltdown

We replaced noisy alerts and blind spots with SLOs, OpenTelemetry, and canary analysis—then watched it prevent a seven-figure outage in real time.

We didn’t “monitor more.” We aligned telemetry to SLOs, wired it to rollouts, and let the system say “no” to a bad deploy before customers did.
Back to all posts

The setup most of us inherit

I walked into a mid-market e‑commerce shop (let’s call them Cartwheel) three months before Black Friday. They were on EKS 1.27, Istio 1.20, ~200 services (mix of Java 17 Spring Boot, Node 18, a couple of Go 1.21 backends), and ElastiCache Redis 6.2. Deploys via ArgoCD 2.11, feature flagged with LaunchDarkly. Monitoring was a patchwork: some CloudWatch dashboards, one aging Prometheus 2.33 scraping node exporters, and logs tailing in CloudWatch Logs with no correlation. On-call was living on espresso and adrenaline.

  • Median MTTR: 2h 12m
  • Alert noise: 300+ pages/month, 60% unactionable
  • Zero tracing in production, partial metrics, logs without correlation IDs
  • Holiday code freeze looming, leadership nervous (for good reason)

I’ve seen this movie. If we didn’t make observability boring and reliable fast, Black Friday would be a coin flip.

What we changed in six weeks (and why it matters)

We didn’t boil the ocean. We picked the revenue paths and instrumented ruthlessly. The rule: if it doesn’t improve a customer SLO, it doesn’t ship now.

  • SLOs that mattered

    • Checkout availability: 99.9% over 30 days
    • Checkout p95 latency: < 650ms
    • Add-to-cart error rate: < 0.5%
    • Error budget policy: fast burn pages, slow burn tickets
  • Telemetry standardization

    • OpenTelemetry everywhere: opentelemetry-javaagent 1.28.0 for Java, @opentelemetry/sdk-node 0.44.x for Node, otelhttp for Go
    • traceparent propagation via W3C headers through Istio/Envoy
    • OpenTelemetry Collector 0.96.0 as a DaemonSet + gateway with tail-based sampling (keep all 5xx and slow > p95)
    • Metrics to Prometheus 2.48.0 with exemplars; traces to Tempo 2.5; logs to Loki 2.9; dashboards in Grafana 10.4
  • Alerts that page humans only when users hurt

    • Multi-window, multi-burn rate alerts per SLO (fast 5m/1h and slow 30m/6h)
    • Routing via Alertmanager to PagerDuty by service ownership
  • Canary analysis on SLO queries

    • Argo Rollouts 1.6 gating promotions using Prometheus queries on SLO/error budget, not arbitrary CPU graphs
  • Runbooks and rollback muscle

    • GitOps-first rollback steps, one-click ArgoCD health checks, and a bot that posts the Grafana panel + runbook to Slack when a burn alert fires

Here’s the kind of PrometheusRule we shipped for checkout availability:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: checkout-slo-burn
  labels:
    slo: checkout-availability
spec:
  groups:
  - name: checkout.slo
    rules:
    - record: slo:checkout_availability:error_ratio
      expr: |
        sum(rate(http_requests_total{job="checkout",status=~"5..|429"}[5m]))
        /
        sum(rate(http_requests_total{job="checkout"}[5m]))
    - alert: SLOErrorBudgetBurnFast
      expr: |
        (slo:checkout_availability:error_ratio[5m]) > (0.001 * 14.4)
      for: 5m
      labels:
        severity: page
      annotations:
        summary: "Checkout fast burn > 14.4x"
    - alert: SLOErrorBudgetBurnSlow
      expr: |
        (avg_over_time(slo:checkout_availability:error_ratio[30m])) > (0.001 * 6)
      for: 30m
      labels:
        severity: ticket
      annotations:
        summary: "Checkout slow burn > 6x"

And we added exemplars to RED metrics so you can hop from a spike to a representative trace in one click.

The day it almost blew up

Black Friday week, traffic climbing. A routine change slipped into the canary: node-redis bumped from 4.5.x to 4.6.7. Harmless? The release notes buried a default change in connection pooling. Under surge, the new default caused aggressive connection churn against ElastiCache, which translated into intermittent ECONNRESET and timeouts on writes.

Timeline (UTC):

  1. 14:07 – Argo Rollouts starts 10% canary on cart-service and checkout-api.
  2. 14:11 – The fast-burn SLO alert flirts with the threshold: error ratio hits 0.35% for 2 minutes, then recovers. Canary stays put.
  3. 14:13 – Error ratio jumps to 1.1% at 10% traffic. RED dashboard shows p95 latency creeping from 480ms to 690ms.
  4. 14:14 – The Argo Rollouts analysis template fails the Prometheus query guardrail. Promotion is automatically paused. No human clicks yet.
  5. 14:15 – PagerDuty pages the on-call with the fast-burn alert, and Slack bot posts the checkout SLO panel + top trace exemplars.

In the old world, this would have hit 100% and we’d be firefighting during peak. Instead, we were staring at a contained problem at 10% traffic, 45 minutes before the traffic apex.

Root cause in minutes, not hours

Here’s what “observability works” looks like in practice:

  • The Grafana panel’s exemplar popped a trace where checkout-apicart-serviceredis showed a burst of spans ending in ECONNRESET. We didn’t have to grep logs hoping IDs matched—we clicked.
  • The correlated Loki logs (thanks to trace_id in logfmt) showed node-redis connection state churn: connect, end, reconnect in tight loops under load.
  • Tempo trace waterfall made the contention obvious: downstream spans to Redis ballooned; upstream time spent in retry logic ate the app budget.
  • USE dashboards on the Redis client node showed file descriptor usage flapping near limits. Istio metrics looked clean, so we bypassed the “blame the mesh” rabbit hole.

Two possible fixes: tweak the client pool or roll back. We had both prepared.

The on-call followed the runbook:

# 1) Abort canary promotion
kubectl argo rollouts abort checkout-api

# 2) Roll back to previous image via GitOps
git revert <commit> && git push

# 3) ArgoCD sync
argocd app sync checkout-api --prune

# 4) Verify SLO burn subsides (<1x)
# Grafana link posted by bot uses panel share link with variables

Total time from page to rollback complete: 11 minutes. Impact limited to the 10% canary slice for ~6 minutes at elevated error rates. That’s a blip in the revenue graph, not a headline in the postmortem.

For completeness, we later pinned node-redis and adjusted pool settings:

// Node 18 + node-redis 4.x
import { createClient } from "redis";
const client = createClient({
  socket: { keepAlive: 30000, reconnectStrategy: (retries) => Math.min(retries * 50, 1000) },
  // Explicit pool constraints to avoid churn under burst
  isolationPoolOptions: { max: 100, min: 10, acquireTimeoutMillis: 200 }
});

What we measured (before vs after)

I don’t care how pretty your graphs are—show me the deltas.

  • MTTR p50: from 2h12m16m (−87%)
  • Pages/month: from 300+114 (−62%), and 90% mapped to a runbook
  • First-failure detection: from “customer tweets” → SLO burn alert within 120s
  • Deploy frequency: +40% (guardrails made on-call comfortable shipping during peak)
  • Tracing coverage on critical paths: 92% with tail-based sampling preserving all 5xx and slow traces
  • Infra and SaaS bill: +18% telemetry cost, but avoided a projected $1.2M revenue hit based on historical conversion rates during peak hour

This is the only ROI calculus that matters to leadership: tiny, predictable telemetry spend vs. existential peak-day risk.

Implementation details you can steal

If you’re working in a similar stack, here’s the recipe that has worked across multiple orgs:

  • Standardize labels early

    • Use service, namespace, version, env labels on metrics. Avoid unbounded cardinality (no raw user_id).
    • Add trace_id to logs. If you must log user context, hash and clamp.
  • OTel Collector config that pays off

receivers:
  otlp:
    protocols:
      http:
      grpc:
processors:
  batch: {}
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow
        type: latency
        latency:
          threshold_ms: 700
exporters:
  prometheus:
    endpoint: 0.0.0.0:9464
    enable_open_metrics: true
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
service:
  pipelines:
    traces: { receivers: [otlp], processors: [batch, tail_sampling], exporters: [otlp/tempo] }
    metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }
    logs: { receivers: [otlp], processors: [batch], exporters: [loki] }
  • Argo Rollouts AnalysisTemplate example (gate on SLO query)
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-slo-gate
spec:
  metrics:
  - name: error-ratio
    interval: 60s
    successCondition: result < 0.005
    failureLimit: 2
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{job="checkout",status=~"5..|429", rollout="canary"}[2m]))
          /
          sum(rate(http_requests_total{job="checkout", rollout="canary"}[2m]))
  • Dashboards that matter

    • One RED per service, one USE per node/infra domain.
    • Panels link with variables to traces/logs. No “wall of 12 CPU charts.”
  • Pager routing with ownership

    • If a page fires and no one knows who owns it, you don’t have observability—you have noise.

What I’d do the same tomorrow (and what I’d skip)

  • Do again

    • Start from SLOs. It keeps engineers and execs speaking the same language.
    • Tie canaries to SLO queries. It’s the difference between “we hope” and “we know.”
    • Keep tail-based sampling. The “interesting 1%” pays the bills in incident response.
  • Skip

    • Chasing 100% tracing coverage on day one. Get critical paths first.
    • Over-indexing on vendor magic. We shipped this on OSS: Prometheus, Grafana, Loki, Tempo, OTel. Buy where it accelerates, not replaces thinking.
    • Vanity alerts. If it doesn’t map to an SLO or a runbook, it’s not a page.

You don’t buy observability. You build it deliberately around what your users pay you for.

If you’re staring at a peak season with the same uneasy feeling Cartwheel had, we can help you make this boring in a month, not a quarter.

Related Resources

Key takeaways

  • SLOs with multi-window burn-rate alerts cut through noise and highlighted business risk, not just red graphs.
  • Standardized telemetry via `OpenTelemetry` with trace IDs in logs made cross-layer debugging a two-minute task, not a war room.
  • Canary analysis tied to SLO metrics stopped a bad Redis client upgrade before it hit 100% of traffic.
  • Tail-based trace sampling kept costs sane while preserving the troublesome 1% of requests you actually need to see.
  • Owned runbooks and GitOps rollbacks turned insights into action in minutes.

Implementation checklist

  • Define 3–5 business SLOs and wire burn-rate alerts to PagerDuty. No SLO, no alert.
  • Propagate `traceparent` everywhere. Add `trace_id` and `span_id` to logs.
  • Adopt `OpenTelemetry Collector` with tail-based sampling and exemplars to Prometheus.
  • Use Argo Rollouts (or equivalent) to gate canaries on SLO queries, not raw metrics.
  • Create RED + USE dashboards per service. Strip vanity graphs.
  • Write rollback runbooks. Practice them. Automate where safe.

Questions we hear from teams

How did you keep observability costs from exploding?
Two levers: tail-based trace sampling (keep all errors and slow requests, sample the rest) and strict label hygiene to avoid cardinality bombs. We also pushed high-cardinality logs to Loki with retention tiers (hot 7 days, cold 30) and kept metrics retention at 15 days for high-res series, 90 days for downsampled.
Why OpenTelemetry instead of a single vendor agent?
Portability and flexibility. OTel let us route the same data to Prometheus/Grafana/Tempo/Loki now and keep an exit ramp to a vendor later. Auto-instrumentation for Java/Node was mature enough (1.28.0/0.44.x), and the Collector gave us control over sampling and routing.
Do I need Argo Rollouts to gate canaries on SLOs?
No. Spinnaker, Flagger, and even bespoke CD pipelines can call Prometheus and make promotion decisions. What matters is gating on SLO-aligned queries, not just infrastructure metrics.
What’s the minimum to start if I have four weeks?
Pick two critical journeys, define availability and latency SLOs, instrument the edge and the two hottest backends with OTel, add burn-rate alerts, and wire a single canary to those queries. You can harden and expand later.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get an SLO playbook tailored to your stack See how we gate canaries on SLOs

Related resources