The Night an SLO Burn Alert Saved Black Friday: An Observability Rehab That Paid for Itself
A fintech on GKE was flying blind into peak traffic. Four weeks of pragmatic observability work turned a looming incident into a non-event—and cut MTTR by 60%.
We didn’t page on CPU. We paged on customer risk—and bought ourselves the eight minutes that changed the outcome.Back to all posts
The setup you’ve seen before
They’re a mid-market fintech on GKE, processing card authorizations with a 200ms p95 latency SLO and strict uptime targets. Traffic is spiky around payroll and promos. They had Datadog for infra, Cloud Logging for app logs, and an old Pingdom check. No proper SLOs, lots of noisy alerts. Classic.
I’ve seen this movie: teams lean on expensive logs and pretty dashboards, then hit a peak and get blindsided. The CTO asked GitPlumbers to “make on-call humane before Black Friday.” We had four weeks and tight constraints:
- Compliance: PCI boundaries, restricted egress, no PII leaving VPC.
- Budget: rein in logging spend; avoid vendor lock-in that doubles next renewal.
- Stack: Go + Node services, Redis, Postgres, Envoy, GKE regional.
- People: small SRE crew, product teams owning services.
We didn’t promise magic. We promised signal over noise.
Where it hurt (and why it mattered)
The failure modes were textbook:
- No SLOs: Pages were tied to CPU and pod restarts, not user experience. On-call ignored half of them.
- Partial tracing: Datadog APM sampled 1/10th of traces, no exemplars, so metrics couldn’t point to traces.
- Dashboards by vibes: No RED/USE discipline. You couldn’t answer “Is it the network, the app, or the DB?” in one screen.
- Redis blind spot: Pool metrics weren’t consistent; saturation hid inside application logs.
- Rollbacks = tribal knowledge: Changes rolled out via
kubectl applyand best intentions.
When this goes wrong, it goes really wrong: you burn your budget during peak, page people late, and play whack-a-mole across five dashboards while customers rage-tweet. Been there.
What we changed in 4 weeks (no silver bullets)
We went pragmatic, OSS-first, GitOps-driven. Keep the blast radius small, wire the basics right, and page on SLOs.
OpenTelemetry everywhere
- Added
opentelemetry-goandopentelemetry-jsSDKs across gateway, risk engine, and checkout. - Standardized semantic attributes (
http.route,db.system,peer.service). - Enabled exemplars so metrics link to traces.
// go.opentelemetry.io/otel v1.x hist := meter.Float64Histogram("http.server.duration", metric.WithUnit("s")) start := time.Now() // ... handle request hist.Record(req.Context(), time.Since(start).Seconds()) // exemplar captured from active span- Added
Collector + Tempo + Loki
- Deployed OTel Collector (contrib) with tail-sampling: keep slow/error traces, downsample boring ones.
# otelcol-config.yaml receivers: otlp: protocols: {grpc: {}, http: {}} processors: tail_sampling: decision_wait: 3s policies: - latency: {threshold_ms: 200} - status_code: {status_codes: [ERROR]} exporters: otlp/tempo: { endpoint: tempo-distributor:4317, tls: {insecure: true} } loki: endpoint: http://loki:3100/loki/api/v1/push service: pipelines: traces: [otlp -> tail_sampling -> otlp/tempo] logs: [otlp -> loki]Prometheus + Grafana with exemplars
- Prometheus scraped app and Envoy metrics; enabled exemplars and sane retention.
# prometheus.yaml (flags) --enable-feature=exemplar-storage --storage.tsdb.retention.time=15d --query.max-concurrency=40# scrape snippet for k8s pods scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: [{role: pod}] relabel_configs: - action: keep source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace target_label: __meta_kubernetes_pod_container_port_number metric_relabel_configs: - action: drop regex: '.*(password|token).*' source_labels: [__name__]In Grafana, RED and USE dashboards showed service health at a glance, with trace exemplars clickable on latency histograms.
SLOs + multi-window burn alerts
- Two SLOs to start:
checkout p95 < 200ms,gateway availability 99.9%. - Used Sloth to generate recording/alerting rules and wired to PagerDuty.
# slo-checkout.yaml (Sloth) service: checkout slo: "p95_latency_under_200ms" objective: 99 budgeting: timeslices timeWindow: 30d labels: {team: payments} alerting: pageAlert: disabled: false indicator: ratio: errors: metric: "histogram_quantile(0.95, sum(rate(http_server_duration_bucket{service='checkout'}[5m])) by (le)) > 0.2" total: metric: "1" # latency SLI modeled as bad if p95 > 200ms (0.2s)For teams not ready for Sloth, we also ship the canonical PromQL burn alerts:
# alerting-rules.yaml groups: - name: slo-burn rules: - alert: SLOBurnFast expr: (rate(slo_errors[5m]) / rate(slo_total[5m])) > (14 * (1 - 0.99)) for: 2m labels: {severity: page, slo: checkout} annotations: {summary: "Fast burn on checkout latency SLO"} - alert: SLOBurnSlow expr: (rate(slo_errors[1h]) / rate(slo_total[1h])) > (6 * (1 - 0.99)) for: 30m labels: {severity: page, slo: checkout} annotations: {summary: "Slow burn on checkout latency SLO"}- Two SLOs to start:
Circuit breakers + safe rollbacks
- Istio
DestinationRuletripped early under 5xx/latency flaps; better to shed load than cascade.
apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: {name: checkout} spec: host: checkout trafficPolicy: connectionPool: http: {http1MaxPendingRequests: 1000, maxRequestsPerConnection: 100} outlierDetection: consecutive5xxErrors: 5 interval: 5s baseEjectionTime: 30s maxEjectionPercent: 50- All config via ArgoCD; rollbacks were pull requests, not heroics.
# one-liner when fire breaks out argocd app rollback payments --to-revision 79- Istio
The near-miss: caught eight minutes before customers felt it
Black Friday traffic ramped 12% higher than forecast. Five minutes into the spike, the fast SLO burn alert paged: 14x burn on the checkout latency SLO. Grafana showed p95 climbing from 140ms to 210ms; p99 flirting with 450ms. Error rate was flat. That’s exactly why we alert on burn, not errors.
Because we wired exemplars, a click on the latency histogram jumped straight into Tempo traces. The slow spans clustered in api-gateway -> checkout. Traces showed 30–40ms waits on Redis calls with a smoking gun: connection pool saturation.
The app logs (now structured and queryable in Loki) confirmed frequent pool timeout warnings. One look at the deployment diff—thanks to GitOps history—and we saw last night’s change: reduced Redis PoolSize to cut idle connections during quiet hours. Perfectly reasonable… until a promo hits.
Two moves in parallel:
- Rollback the checkout pod to the previous version via ArgoCD. Completed in 4 minutes.
- Ride the wave with circuit breakers protecting downstream services while pods rolled.
As pods came back, Redis contention dropped. p95 settled to 160ms, p99 to 300ms. The slow burn alert cleared 20 minutes later. Customers never felt it; Twitter stayed quiet.
“We used to page on CPU. This time we paged on risk. We bought ourselves eight minutes to act.”
For the curious, here’s the Redis pool configuration we restored in Go:
// go-redis/v9 client config
rdb := redis.NewClient(&redis.Options{
Addr: "redis:6379",
PoolSize: 400, // was cut to 150 in the bad deploy
MinIdleConns: 50,
PoolTimeout: 500 * time.Millisecond,
ReadTimeout: 100 * time.Millisecond,
WriteTimeout: 100 * time.Millisecond,
})Results that matter (not vanity metrics)
- Major incident avoided during peak hour; projected impact (based on burn rate and conversion curve) was ~18 minutes of degraded checkout. Actual customer-visible errors: near zero.
- MTTR down 60% quarter-over-quarter for P1/P2 incidents (from 42m to 17m median).
- Alert noise down 45%; page volume matched SLO risk instead of infra trivia.
- Time-to-first-signal improved by ~8 minutes on latency regressions due to multi-window burn alerts.
- Logging costs cut 30% by sampling/retention changes and moving hot-path debugging to traces.
- Engineering confidence up: teams fixed two perf regressions proactively by watching SLO error budgets.
If you’ve been burned by “buy this tool and be observable,” you’ll recognize what worked: small, disciplined changes, tied to user outcomes.
What actually worked (and what I’d do differently)
What worked:
- SLO-first mindset: The SLOs forced hard conversations about what to page on and what to accept.
- Exemplars + tail-sampling: This combo made metrics the front door and traces the deep dive—without breaking the budget.
- GitOps for observability: One place to review, roll back, and audit alert rules and dashboards. No mystery JSON.
- Pre-built runbooks: The rollback command and circuit breaker policy weren’t invented during the incident.
- Cardinality discipline: Dropped junk labels early; avoided a Prometheus meltdown at T+30m.
What I’d change next time:
- SLO coverage earlier: We started with checkout and gateway; we should have added risk engine SLOs in sprint one.
- Synthetic checks per user journey: We had basic Pingdom; I prefer canaries hitting the full path (auth -> risk -> capture) with trace context.
- Error budget policies: Product didn’t yet tie feature flags to budgets. That’s next.
Your two-sprint playbook (steal this)
Sprint 1:
- Pick 1–2 critical journeys; define SLOs and budgets; agree on paging policy.
- Ship OpenTelemetry SDKs; enable exemplars; deploy Collector with tail-sampling.
- Stand up Prometheus, Grafana, Tempo, Loki; build one RED and one USE dashboard.
Sprint 2:
- Add multi-window, multi-burn alerts; wire to PagerDuty.
- Add circuit breakers and a one-line rollback runbook; test both in staging with chaos.
- Move observability config to Git, enforce PR reviews, and simulate an incident drill.
A month from now, your on-call will thank you—and your CFO won’t yell about the logging bill.
Related Resources
Key takeaways
- Tie alerts to SLOs and error budgets; kill the noise, page on real risk.
- Instrument with OpenTelemetry end-to-end and wire exemplars so metrics link to traces.
- Adopt the RED/USE dashboards for fast triage; pre-bake runbooks and rollbacks.
- Do multi-window, multi-burn alerts; they buy you minutes before users notice.
- Protect yourself with circuit breakers and safe rollbacks via GitOps.
- Keep cardinality in check and budgets honest—observability that’s unused is shelfware.
Implementation checklist
- Define 1–2 SLOs per critical journey with budgets and owners.
- Deploy OpenTelemetry SDKs and Collector; enable exemplars and tail-sampling.
- Install Prometheus/Grafana/Alertmanager (+ Tempo/Loki) with sane defaults.
- Build a RED dashboard per service and USE dashboard per cluster node.
- Add multi-window, multi-burn SLO alerts and hook to PagerDuty.
- Enable circuit breakers (Istio/Envoy) and write rollback runbooks.
- Manage everything via GitOps (ArgoCD) and test alerts with synthetic load.
Questions we hear from teams
- Why not just buy a full SaaS observability suite?
- You can. Many teams do. The problem we see is cost blowouts and little discipline: lots of data, little signal. We prefer OSS-first with clear SLOs and sampling, then add SaaS where it adds leverage (e.g., RUM, long-term logs for compliance).
- What’s the fastest way to get exemplars working?
- Use OpenTelemetry SDKs, propagate trace context end-to-end, and enable exemplar storage in Prometheus. Most histograms will automatically attach the current span ID as an exemplar you can click through in Grafana.
- How do you keep Prometheus from melting under cardinality?
- Budget labels. Drop high-cardinality dimensions (user IDs, SQL text). Enforce metric naming reviews in PRs, and set federation/recording rules for SLO queries so live alerts don’t scan raw series.
- Is multi-window burn alerting overkill for small teams?
- It’s a few lines of PromQL for a huge payoff. Fast window catches sharp regressions; slow window avoids flapping. It’s one of the highest ROI alerting patterns we roll out.
- What if we’re on EKS or ECS, not GKE?
- Same playbook. The k8s primitives, OTel, Prometheus, Grafana, and burn alerts are portable. Swap cloud-specific plumbing; keep the SLOs and discipline.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
