The Night an SLO Burn Alert Saved Black Friday: An Observability Rehab That Paid for Itself

A fintech on GKE was flying blind into peak traffic. Four weeks of pragmatic observability work turned a looming incident into a non-event—and cut MTTR by 60%.

We didn’t page on CPU. We paged on customer risk—and bought ourselves the eight minutes that changed the outcome.
Back to all posts

The setup you’ve seen before

They’re a mid-market fintech on GKE, processing card authorizations with a 200ms p95 latency SLO and strict uptime targets. Traffic is spiky around payroll and promos. They had Datadog for infra, Cloud Logging for app logs, and an old Pingdom check. No proper SLOs, lots of noisy alerts. Classic.

I’ve seen this movie: teams lean on expensive logs and pretty dashboards, then hit a peak and get blindsided. The CTO asked GitPlumbers to “make on-call humane before Black Friday.” We had four weeks and tight constraints:

  • Compliance: PCI boundaries, restricted egress, no PII leaving VPC.
  • Budget: rein in logging spend; avoid vendor lock-in that doubles next renewal.
  • Stack: Go + Node services, Redis, Postgres, Envoy, GKE regional.
  • People: small SRE crew, product teams owning services.

We didn’t promise magic. We promised signal over noise.

Where it hurt (and why it mattered)

The failure modes were textbook:

  • No SLOs: Pages were tied to CPU and pod restarts, not user experience. On-call ignored half of them.
  • Partial tracing: Datadog APM sampled 1/10th of traces, no exemplars, so metrics couldn’t point to traces.
  • Dashboards by vibes: No RED/USE discipline. You couldn’t answer “Is it the network, the app, or the DB?” in one screen.
  • Redis blind spot: Pool metrics weren’t consistent; saturation hid inside application logs.
  • Rollbacks = tribal knowledge: Changes rolled out via kubectl apply and best intentions.

When this goes wrong, it goes really wrong: you burn your budget during peak, page people late, and play whack-a-mole across five dashboards while customers rage-tweet. Been there.

What we changed in 4 weeks (no silver bullets)

We went pragmatic, OSS-first, GitOps-driven. Keep the blast radius small, wire the basics right, and page on SLOs.

  1. OpenTelemetry everywhere

    • Added opentelemetry-go and opentelemetry-js SDKs across gateway, risk engine, and checkout.
    • Standardized semantic attributes (http.route, db.system, peer.service).
    • Enabled exemplars so metrics link to traces.
    // go.opentelemetry.io/otel v1.x
    hist := meter.Float64Histogram("http.server.duration", metric.WithUnit("s"))
    start := time.Now()
    // ... handle request
    hist.Record(req.Context(), time.Since(start).Seconds()) // exemplar captured from active span
  2. Collector + Tempo + Loki

    • Deployed OTel Collector (contrib) with tail-sampling: keep slow/error traces, downsample boring ones.
    # otelcol-config.yaml
    receivers:
      otlp:
        protocols: {grpc: {}, http: {}}
    processors:
      tail_sampling:
        decision_wait: 3s
        policies:
          - latency: {threshold_ms: 200}
          - status_code: {status_codes: [ERROR]}
    exporters:
      otlp/tempo: { endpoint: tempo-distributor:4317, tls: {insecure: true} }
      loki:
        endpoint: http://loki:3100/loki/api/v1/push
    service:
      pipelines:
        traces: [otlp -> tail_sampling -> otlp/tempo]
        logs: [otlp -> loki]
  3. Prometheus + Grafana with exemplars

    • Prometheus scraped app and Envoy metrics; enabled exemplars and sane retention.
    # prometheus.yaml (flags)
    --enable-feature=exemplar-storage
    --storage.tsdb.retention.time=15d
    --query.max-concurrency=40
    # scrape snippet for k8s pods
    scrape_configs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs: [{role: pod}]
        relabel_configs:
          - action: keep
            source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            target_label: __meta_kubernetes_pod_container_port_number
        metric_relabel_configs:
          - action: drop
            regex: '.*(password|token).*'
            source_labels: [__name__]

    In Grafana, RED and USE dashboards showed service health at a glance, with trace exemplars clickable on latency histograms.

  4. SLOs + multi-window burn alerts

    • Two SLOs to start: checkout p95 < 200ms, gateway availability 99.9%.
    • Used Sloth to generate recording/alerting rules and wired to PagerDuty.
    # slo-checkout.yaml (Sloth)
    service: checkout
    slo: "p95_latency_under_200ms"
    objective: 99
    budgeting: timeslices
    timeWindow: 30d
    labels: {team: payments}
    alerting:
      pageAlert:
        disabled: false
    indicator:
      ratio:
        errors:
          metric: "histogram_quantile(0.95, sum(rate(http_server_duration_bucket{service='checkout'}[5m])) by (le)) > 0.2"
        total:
          metric: "1"  # latency SLI modeled as bad if p95 > 200ms (0.2s)

    For teams not ready for Sloth, we also ship the canonical PromQL burn alerts:

    # alerting-rules.yaml
    groups:
      - name: slo-burn
        rules:
          - alert: SLOBurnFast
            expr: (rate(slo_errors[5m]) / rate(slo_total[5m])) > (14 * (1 - 0.99))
            for: 2m
            labels: {severity: page, slo: checkout}
            annotations: {summary: "Fast burn on checkout latency SLO"}
          - alert: SLOBurnSlow
            expr: (rate(slo_errors[1h]) / rate(slo_total[1h])) > (6 * (1 - 0.99))
            for: 30m
            labels: {severity: page, slo: checkout}
            annotations: {summary: "Slow burn on checkout latency SLO"}
  5. Circuit breakers + safe rollbacks

    • Istio DestinationRule tripped early under 5xx/latency flaps; better to shed load than cascade.
    apiVersion: networking.istio.io/v1beta1
    kind: DestinationRule
    metadata: {name: checkout}
    spec:
      host: checkout
      trafficPolicy:
        connectionPool:
          http: {http1MaxPendingRequests: 1000, maxRequestsPerConnection: 100}
        outlierDetection:
          consecutive5xxErrors: 5
          interval: 5s
          baseEjectionTime: 30s
          maxEjectionPercent: 50
    • All config via ArgoCD; rollbacks were pull requests, not heroics.
    # one-liner when fire breaks out
    argocd app rollback payments --to-revision 79

The near-miss: caught eight minutes before customers felt it

Black Friday traffic ramped 12% higher than forecast. Five minutes into the spike, the fast SLO burn alert paged: 14x burn on the checkout latency SLO. Grafana showed p95 climbing from 140ms to 210ms; p99 flirting with 450ms. Error rate was flat. That’s exactly why we alert on burn, not errors.

Because we wired exemplars, a click on the latency histogram jumped straight into Tempo traces. The slow spans clustered in api-gateway -> checkout. Traces showed 30–40ms waits on Redis calls with a smoking gun: connection pool saturation.

The app logs (now structured and queryable in Loki) confirmed frequent pool timeout warnings. One look at the deployment diff—thanks to GitOps history—and we saw last night’s change: reduced Redis PoolSize to cut idle connections during quiet hours. Perfectly reasonable… until a promo hits.

Two moves in parallel:

  1. Rollback the checkout pod to the previous version via ArgoCD. Completed in 4 minutes.
  2. Ride the wave with circuit breakers protecting downstream services while pods rolled.

As pods came back, Redis contention dropped. p95 settled to 160ms, p99 to 300ms. The slow burn alert cleared 20 minutes later. Customers never felt it; Twitter stayed quiet.

“We used to page on CPU. This time we paged on risk. We bought ourselves eight minutes to act.”

For the curious, here’s the Redis pool configuration we restored in Go:

// go-redis/v9 client config
rdb := redis.NewClient(&redis.Options{
  Addr: "redis:6379",
  PoolSize: 400,        // was cut to 150 in the bad deploy
  MinIdleConns: 50,
  PoolTimeout: 500 * time.Millisecond,
  ReadTimeout:  100 * time.Millisecond,
  WriteTimeout: 100 * time.Millisecond,
})

Results that matter (not vanity metrics)

  • Major incident avoided during peak hour; projected impact (based on burn rate and conversion curve) was ~18 minutes of degraded checkout. Actual customer-visible errors: near zero.
  • MTTR down 60% quarter-over-quarter for P1/P2 incidents (from 42m to 17m median).
  • Alert noise down 45%; page volume matched SLO risk instead of infra trivia.
  • Time-to-first-signal improved by ~8 minutes on latency regressions due to multi-window burn alerts.
  • Logging costs cut 30% by sampling/retention changes and moving hot-path debugging to traces.
  • Engineering confidence up: teams fixed two perf regressions proactively by watching SLO error budgets.

If you’ve been burned by “buy this tool and be observable,” you’ll recognize what worked: small, disciplined changes, tied to user outcomes.

What actually worked (and what I’d do differently)

What worked:

  • SLO-first mindset: The SLOs forced hard conversations about what to page on and what to accept.
  • Exemplars + tail-sampling: This combo made metrics the front door and traces the deep dive—without breaking the budget.
  • GitOps for observability: One place to review, roll back, and audit alert rules and dashboards. No mystery JSON.
  • Pre-built runbooks: The rollback command and circuit breaker policy weren’t invented during the incident.
  • Cardinality discipline: Dropped junk labels early; avoided a Prometheus meltdown at T+30m.

What I’d change next time:

  • SLO coverage earlier: We started with checkout and gateway; we should have added risk engine SLOs in sprint one.
  • Synthetic checks per user journey: We had basic Pingdom; I prefer canaries hitting the full path (auth -> risk -> capture) with trace context.
  • Error budget policies: Product didn’t yet tie feature flags to budgets. That’s next.

Your two-sprint playbook (steal this)

Sprint 1:

  1. Pick 1–2 critical journeys; define SLOs and budgets; agree on paging policy.
  2. Ship OpenTelemetry SDKs; enable exemplars; deploy Collector with tail-sampling.
  3. Stand up Prometheus, Grafana, Tempo, Loki; build one RED and one USE dashboard.

Sprint 2:

  1. Add multi-window, multi-burn alerts; wire to PagerDuty.
  2. Add circuit breakers and a one-line rollback runbook; test both in staging with chaos.
  3. Move observability config to Git, enforce PR reviews, and simulate an incident drill.

A month from now, your on-call will thank you—and your CFO won’t yell about the logging bill.

Related Resources

Key takeaways

  • Tie alerts to SLOs and error budgets; kill the noise, page on real risk.
  • Instrument with OpenTelemetry end-to-end and wire exemplars so metrics link to traces.
  • Adopt the RED/USE dashboards for fast triage; pre-bake runbooks and rollbacks.
  • Do multi-window, multi-burn alerts; they buy you minutes before users notice.
  • Protect yourself with circuit breakers and safe rollbacks via GitOps.
  • Keep cardinality in check and budgets honest—observability that’s unused is shelfware.

Implementation checklist

  • Define 1–2 SLOs per critical journey with budgets and owners.
  • Deploy OpenTelemetry SDKs and Collector; enable exemplars and tail-sampling.
  • Install Prometheus/Grafana/Alertmanager (+ Tempo/Loki) with sane defaults.
  • Build a RED dashboard per service and USE dashboard per cluster node.
  • Add multi-window, multi-burn SLO alerts and hook to PagerDuty.
  • Enable circuit breakers (Istio/Envoy) and write rollback runbooks.
  • Manage everything via GitOps (ArgoCD) and test alerts with synthetic load.

Questions we hear from teams

Why not just buy a full SaaS observability suite?
You can. Many teams do. The problem we see is cost blowouts and little discipline: lots of data, little signal. We prefer OSS-first with clear SLOs and sampling, then add SaaS where it adds leverage (e.g., RUM, long-term logs for compliance).
What’s the fastest way to get exemplars working?
Use OpenTelemetry SDKs, propagate trace context end-to-end, and enable exemplar storage in Prometheus. Most histograms will automatically attach the current span ID as an exemplar you can click through in Grafana.
How do you keep Prometheus from melting under cardinality?
Budget labels. Drop high-cardinality dimensions (user IDs, SQL text). Enforce metric naming reviews in PRs, and set federation/recording rules for SLO queries so live alerts don’t scan raw series.
Is multi-window burn alerting overkill for small teams?
It’s a few lines of PromQL for a huge payoff. Fast window catches sharp regressions; slow window avoids flapping. It’s one of the highest ROI alerting patterns we roll out.
What if we’re on EKS or ECS, not GKE?
Same playbook. The k8s primitives, OTel, Prometheus, Grafana, and burn alerts are portable. Swap cloud-specific plumbing; keep the SLOs and discipline.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Fix your observability before peak traffic See how we implement SLOs and burn alerts

Related resources