The Canary That Saved Black Friday: SLO-Driven Observability Stopped a Redis Client Meltdown
We replaced noisy alerts and blind spots with SLOs, OpenTelemetry, and canary analysis—then watched it prevent a seven-figure outage in real time.
We didn’t “monitor more.” We aligned telemetry to SLOs, wired it to rollouts, and let the system say “no” to a bad deploy before customers did.Back to all posts
The setup most of us inherit
I walked into a mid-market e‑commerce shop (let’s call them Cartwheel) three months before Black Friday. They were on EKS 1.27
, Istio 1.20
, ~200 services (mix of Java 17
Spring Boot, Node 18
, a couple of Go 1.21
backends), and ElastiCache Redis 6.2
. Deploys via ArgoCD 2.11
, feature flagged with LaunchDarkly
. Monitoring was a patchwork: some CloudWatch
dashboards, one aging Prometheus 2.33
scraping node exporters, and logs tailing in CloudWatch Logs
with no correlation. On-call was living on espresso and adrenaline.
- Median MTTR: 2h 12m
- Alert noise: 300+ pages/month, 60% unactionable
- Zero tracing in production, partial metrics, logs without correlation IDs
- Holiday code freeze looming, leadership nervous (for good reason)
I’ve seen this movie. If we didn’t make observability boring and reliable fast, Black Friday would be a coin flip.
What we changed in six weeks (and why it matters)
We didn’t boil the ocean. We picked the revenue paths and instrumented ruthlessly. The rule: if it doesn’t improve a customer SLO, it doesn’t ship now.
SLOs that mattered
- Checkout availability: 99.9% over 30 days
- Checkout p95 latency: < 650ms
- Add-to-cart error rate: < 0.5%
- Error budget policy: fast burn pages, slow burn tickets
Telemetry standardization
OpenTelemetry
everywhere:opentelemetry-javaagent 1.28.0
for Java,@opentelemetry/sdk-node 0.44.x
for Node,otelhttp
for Gotraceparent
propagation viaW3C
headers throughIstio
/EnvoyOpenTelemetry Collector 0.96.0
as a DaemonSet + gateway with tail-based sampling (keep all 5xx and slow > p95)- Metrics to
Prometheus 2.48.0
with exemplars; traces toTempo 2.5
; logs toLoki 2.9
; dashboards inGrafana 10.4
Alerts that page humans only when users hurt
- Multi-window, multi-burn rate alerts per SLO (fast 5m/1h and slow 30m/6h)
- Routing via
Alertmanager
toPagerDuty
by service ownership
Canary analysis on SLO queries
Argo Rollouts 1.6
gating promotions using Prometheus queries on SLO/error budget, not arbitrary CPU graphs
Runbooks and rollback muscle
- GitOps-first rollback steps, one-click ArgoCD health checks, and a bot that posts the Grafana panel + runbook to Slack when a burn alert fires
Here’s the kind of PrometheusRule we shipped for checkout availability:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: checkout-slo-burn
labels:
slo: checkout-availability
spec:
groups:
- name: checkout.slo
rules:
- record: slo:checkout_availability:error_ratio
expr: |
sum(rate(http_requests_total{job="checkout",status=~"5..|429"}[5m]))
/
sum(rate(http_requests_total{job="checkout"}[5m]))
- alert: SLOErrorBudgetBurnFast
expr: |
(slo:checkout_availability:error_ratio[5m]) > (0.001 * 14.4)
for: 5m
labels:
severity: page
annotations:
summary: "Checkout fast burn > 14.4x"
- alert: SLOErrorBudgetBurnSlow
expr: |
(avg_over_time(slo:checkout_availability:error_ratio[30m])) > (0.001 * 6)
for: 30m
labels:
severity: ticket
annotations:
summary: "Checkout slow burn > 6x"
And we added exemplars to RED metrics so you can hop from a spike to a representative trace in one click.
The day it almost blew up
Black Friday week, traffic climbing. A routine change slipped into the canary: node-redis
bumped from 4.5.x
to 4.6.7
. Harmless? The release notes buried a default change in connection pooling. Under surge, the new default caused aggressive connection churn against ElastiCache
, which translated into intermittent ECONNRESET
and timeouts on writes.
Timeline (UTC):
- 14:07 – Argo Rollouts starts 10% canary on
cart-service
andcheckout-api
. - 14:11 – The fast-burn SLO alert flirts with the threshold: error ratio hits 0.35% for 2 minutes, then recovers. Canary stays put.
- 14:13 – Error ratio jumps to 1.1% at 10% traffic. RED dashboard shows p95 latency creeping from 480ms to 690ms.
- 14:14 – The Argo Rollouts analysis template fails the Prometheus query guardrail. Promotion is automatically paused. No human clicks yet.
- 14:15 – PagerDuty pages the on-call with the fast-burn alert, and Slack bot posts the checkout SLO panel + top trace exemplars.
In the old world, this would have hit 100% and we’d be firefighting during peak. Instead, we were staring at a contained problem at 10% traffic, 45 minutes before the traffic apex.
Root cause in minutes, not hours
Here’s what “observability works” looks like in practice:
- The Grafana panel’s exemplar popped a trace where
checkout-api
→cart-service
→redis
showed a burst of spans ending inECONNRESET
. We didn’t have to grep logs hoping IDs matched—we clicked. - The correlated
Loki
logs (thanks totrace_id
in logfmt) showednode-redis
connection state churn:connect
,end
,reconnect
in tight loops under load. Tempo
trace waterfall made the contention obvious: downstream spans to Redis ballooned; upstream time spent inretry
logic ate the app budget.USE
dashboards on the Redis client node showed file descriptor usage flapping near limits.Istio
metrics looked clean, so we bypassed the “blame the mesh” rabbit hole.
Two possible fixes: tweak the client pool or roll back. We had both prepared.
The on-call followed the runbook:
# 1) Abort canary promotion
kubectl argo rollouts abort checkout-api
# 2) Roll back to previous image via GitOps
git revert <commit> && git push
# 3) ArgoCD sync
argocd app sync checkout-api --prune
# 4) Verify SLO burn subsides (<1x)
# Grafana link posted by bot uses panel share link with variables
Total time from page to rollback complete: 11 minutes. Impact limited to the 10% canary slice for ~6 minutes at elevated error rates. That’s a blip in the revenue graph, not a headline in the postmortem.
For completeness, we later pinned node-redis
and adjusted pool settings:
// Node 18 + node-redis 4.x
import { createClient } from "redis";
const client = createClient({
socket: { keepAlive: 30000, reconnectStrategy: (retries) => Math.min(retries * 50, 1000) },
// Explicit pool constraints to avoid churn under burst
isolationPoolOptions: { max: 100, min: 10, acquireTimeoutMillis: 200 }
});
What we measured (before vs after)
I don’t care how pretty your graphs are—show me the deltas.
- MTTR p50: from 2h12m → 16m (−87%)
- Pages/month: from 300+ → 114 (−62%), and 90% mapped to a runbook
- First-failure detection: from “customer tweets” → SLO burn alert within 120s
- Deploy frequency: +40% (guardrails made on-call comfortable shipping during peak)
- Tracing coverage on critical paths: 92% with tail-based sampling preserving all 5xx and slow traces
- Infra and SaaS bill: +18% telemetry cost, but avoided a projected $1.2M revenue hit based on historical conversion rates during peak hour
This is the only ROI calculus that matters to leadership: tiny, predictable telemetry spend vs. existential peak-day risk.
Implementation details you can steal
If you’re working in a similar stack, here’s the recipe that has worked across multiple orgs:
Standardize labels early
- Use
service
,namespace
,version
,env
labels on metrics. Avoid unbounded cardinality (no rawuser_id
). - Add
trace_id
to logs. If you must log user context, hash and clamp.
- Use
OTel Collector config that pays off
receivers:
otlp:
protocols:
http:
grpc:
processors:
batch: {}
tail_sampling:
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow
type: latency
latency:
threshold_ms: 700
exporters:
prometheus:
endpoint: 0.0.0.0:9464
enable_open_metrics: true
otlp/tempo:
endpoint: tempo:4317
tls: { insecure: true }
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces: { receivers: [otlp], processors: [batch, tail_sampling], exporters: [otlp/tempo] }
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }
logs: { receivers: [otlp], processors: [batch], exporters: [loki] }
- Argo Rollouts AnalysisTemplate example (gate on SLO query)
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: checkout-slo-gate
spec:
metrics:
- name: error-ratio
interval: 60s
successCondition: result < 0.005
failureLimit: 2
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{job="checkout",status=~"5..|429", rollout="canary"}[2m]))
/
sum(rate(http_requests_total{job="checkout", rollout="canary"}[2m]))
Dashboards that matter
- One RED per service, one USE per node/infra domain.
- Panels link with variables to traces/logs. No “wall of 12 CPU charts.”
Pager routing with ownership
- If a page fires and no one knows who owns it, you don’t have observability—you have noise.
What I’d do the same tomorrow (and what I’d skip)
Do again
- Start from SLOs. It keeps engineers and execs speaking the same language.
- Tie canaries to SLO queries. It’s the difference between “we hope” and “we know.”
- Keep tail-based sampling. The “interesting 1%” pays the bills in incident response.
Skip
- Chasing 100% tracing coverage on day one. Get critical paths first.
- Over-indexing on vendor magic. We shipped this on OSS:
Prometheus
,Grafana
,Loki
,Tempo
,OTel
. Buy where it accelerates, not replaces thinking. - Vanity alerts. If it doesn’t map to an SLO or a runbook, it’s not a page.
You don’t buy observability. You build it deliberately around what your users pay you for.
If you’re staring at a peak season with the same uneasy feeling Cartwheel had, we can help you make this boring in a month, not a quarter.
Key takeaways
- SLOs with multi-window burn-rate alerts cut through noise and highlighted business risk, not just red graphs.
- Standardized telemetry via `OpenTelemetry` with trace IDs in logs made cross-layer debugging a two-minute task, not a war room.
- Canary analysis tied to SLO metrics stopped a bad Redis client upgrade before it hit 100% of traffic.
- Tail-based trace sampling kept costs sane while preserving the troublesome 1% of requests you actually need to see.
- Owned runbooks and GitOps rollbacks turned insights into action in minutes.
Implementation checklist
- Define 3–5 business SLOs and wire burn-rate alerts to PagerDuty. No SLO, no alert.
- Propagate `traceparent` everywhere. Add `trace_id` and `span_id` to logs.
- Adopt `OpenTelemetry Collector` with tail-based sampling and exemplars to Prometheus.
- Use Argo Rollouts (or equivalent) to gate canaries on SLO queries, not raw metrics.
- Create RED + USE dashboards per service. Strip vanity graphs.
- Write rollback runbooks. Practice them. Automate where safe.
Questions we hear from teams
- How did you keep observability costs from exploding?
- Two levers: tail-based trace sampling (keep all errors and slow requests, sample the rest) and strict label hygiene to avoid cardinality bombs. We also pushed high-cardinality logs to Loki with retention tiers (hot 7 days, cold 30) and kept metrics retention at 15 days for high-res series, 90 days for downsampled.
- Why OpenTelemetry instead of a single vendor agent?
- Portability and flexibility. OTel let us route the same data to Prometheus/Grafana/Tempo/Loki now and keep an exit ramp to a vendor later. Auto-instrumentation for Java/Node was mature enough (1.28.0/0.44.x), and the Collector gave us control over sampling and routing.
- Do I need Argo Rollouts to gate canaries on SLOs?
- No. Spinnaker, Flagger, and even bespoke CD pipelines can call Prometheus and make promotion decisions. What matters is gating on SLO-aligned queries, not just infrastructure metrics.
- What’s the minimum to start if I have four weeks?
- Pick two critical journeys, define availability and latency SLOs, instrument the edge and the two hottest backends with OTel, add burn-rate alerts, and wire a single canary to those queries. You can harden and expand later.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.