The Week SLOs Stopped the Pager Storm: How One Team Cut MTTR by 62%

A fintech platform lived on dashboards and died by alerts—until we put SLOs at the center of incident response. Here’s the playbook, the scars, and the measurable win.

"We stopped paging on pod CPU and started paging when users were hurting. MTTR fell off a cliff."
Back to all posts

The pager storm that changed everything

Two summers ago, a B2C fintech with a high-traffic payments API (Node.js + Go services on EKS, Istio mesh, Kafka, Postgres) called us mid-incident. Refunds were intermittently failing, authorization latencies were spiking, and on-call was drowning in alerts from Prometheus and Datadog. Dashboards were red; users were angry; leadership wanted ETAs.

We looked at the alert stream. Everything paged: CPU, memory, p99 latency across every route, Kafka consumer lag, even an outdated cert reminder. Classic “monitor-all-the-things” setup. I’ve seen this movie: responders chasing graphs, changing three things at once, and accidentally making the blast radius bigger.

What flipped the trajectory wasn’t more dashboards or tighter thresholds. It was getting ruthless about SLOs—defining what reliability meant to users, wiring alerts to error budgets, and letting everything else take a back seat.

Why SLOs—not more dashboards—were the unlock

Without SLOs, you’re instrumenting components. With SLOs, you’re protecting user journeys. Very different outcomes.

  • The team cared about POST /authorize and POST /refund succeeding fast enough. Users didn’t care about pod CPU.
  • We framed two SLIs: availability (ratio of good/total) and latency (requests under 300ms) for those routes.
  • We set 28-day SLOs: 99.9% availability on authorize, 99% under 300ms latency. That implied error budgets we could actually spend.

The incident response pivot: only page on budget burn that threatens the SLO. Ticket or log the rest. It sounds simple; it’s hard to do without discipline and the right plumbing.

The constraints we walked into

  • ~60 microservices across two EKS regions, GitOps via ArgoCD, infrastructure via Terraform (with some drift… of course).
  • Prometheus + Alertmanager for cluster metrics, Datadog for app logs/APM, Grafana everywhere.
  • Istio 1.18 with VirtualService routing and mTLS; traffic split for canaries.
  • No shared definition of “reliability.” Alerts were copy-paste PromQL with custom labels per team—cardinality explosion and alert fatigue.
  • SLA commitments existed in sales decks, but engineers had no SLOs to aim at.

I’ve watched orgs try to fix this by adding a dedicated “monitoring team” or a new vendor. That just centralizes the noise. We went the other direction: smaller, sharper signals tied to user experience.

What we actually changed (and how)

  1. Map the journeys and define SLIs

    • We whiteboarded the critical paths: AuthorizePayment and Refund.
    • For each: defined “good” events (2xx/3xx, excluding client 4xx) and a latency threshold users actually feel.
  2. Pick SLOs and budgets

    • Authorize availability: 99.9% over 28 days. Budget = 0.1% of requests can fail.
    • Authorize latency: 99% under 300ms over 28 days.
  3. Instrument and normalize metrics

    • We standardized HTTP metrics via Istio/envoy telemetry and app-level http_requests_total with consistent labels.
    • We fixed a label explosion (user_id in labels—yep) that was killing Prometheus.
  4. Generate SLO rules with Sloth

    • We used Sloth to avoid hand-rolled PromQL and keep SLOs declarative in Git.
# slos/payments-authorize.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: payments-authorize-availability
spec:
  service: payments-api
  labels:
    team: checkout
  slos:
    - name: authorize-availability
      objective: 99.9
      description: Authorize requests succeed under 300ms
      sli:
        events:
          errorQuery: |
            sum(rate(http_requests_total{job="payments-api",route="POST /authorize",status=~"5.."}[5m]))
          totalQuery: |
            sum(rate(http_requests_total{job="payments-api",route="POST /authorize"}[5m]))
      timeWindow: 28d
      alerting:
        pageAlert:
          alert: PageOnHighErrorBudgetBurn
          labels: { severity: page }
        ticketAlert:
          alert: TicketOnLowErrorBudgetBurn
          labels: { severity: ticket }
  1. Multi-window burn-rate alerts (the gold standard)
# Generated/templated by Sloth, shown here for clarity
- record: slo:authorize_availability_error_ratio:5m
  expr: |
    sum(rate(http_requests_total{job="payments-api",route="POST /authorize",status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total{job="payments-api",route="POST /authorize"}[5m]))

- record: slo:authorize_availability_burn_rate:5m
  expr: slo:authorize_availability_error_ratio:5m / (1 - 0.999)

- alert: PageOnHighErrorBudgetBurn
  expr: (slo:authorize_availability_burn_rate:5m > 14.4 and slo:authorize_availability_burn_rate:1h > 14.4)
  for: 5m
  labels:
    severity: page
    service: payments-api
  annotations:
    summary: High burn on authorize availability
    runbook: https://runbooks.internal/authorize

- alert: TicketOnLowErrorBudgetBurn
  expr: (slo:authorize_availability_burn_rate:30m > 3 and slo:authorize_availability_burn_rate:6h > 3)
  for: 30m
  labels:
    severity: ticket
    service: payments-api
  annotations:
    summary: Slow burn on authorize availability
  1. Route only budget-burn pages to on-call
# Alertmanager (excerpt)
route:
  receiver: default
  group_by: ["alertname", "service"]
  routes:
    - matchers:
        - alertname = "PageOnHighErrorBudgetBurn"
      receiver: pagerduty
      continue: false
    - matchers:
        - alertname = "TicketOnLowErrorBudgetBurn"
      receiver: jira

receivers:
  - name: pagerduty
    pagerduty_configs:
      - routing_key: ${PD_ROUTING_KEY}
  - name: jira
    webhook_configs:
      - url: https://jira.internal/webhook
  1. Gate deploys when burning too fast
    • We added a simple pre-sync check in CI for ArgoCD to block if the 1h burn rate was above threshold. Deploy later, not during a fire.
#!/usr/bin/env bash
set -euo pipefail
PROM_URL=${PROM_URL:-http://prometheus.monitoring:9090}
QUERY='slo:authorize_availability_burn_rate:1h'
BR=$(curl -s --get "$PROM_URL/api/v1/query" --data-urlencode query=$QUERY | jq -r '.data.result[0].value[1]')
MAX=2.0
awk -v br="$BR" -v max="$MAX" 'BEGIN{ if (br>max) { print "Error budget burn too high (" br ") — blocking deploy"; exit 1 } else { print "Burn OK (" br ")" } }'
  1. Runbooks and rehearsal
    • One-page runbooks per SLO with mitigations (flip Istio canary to 0%, enable circuit breaker, rollback via ArgoCD) and owners.
    • We ran a 90-minute game day to practice an SLO breach response.

What changed in incident response (measurable and fast)

Within two weeks:

  • MTTR dropped from 94 minutes to 36 minutes (62% reduction). We stopped thrashing and started executing runbooks.
  • On-call pages fell 47% overall. Business-hours spam fell 72% because only burn pages were allowed to interrupt.
  • P1/P2 incidents dropped from ~11/month to 4/month by month three, largely due to deploy gates during active burn.
  • TTA (time to acknowledge) improved to <3 minutes because there were fewer, clearer pages.
  • 2 of the next 3 production regressions were caught at canary by the latency SLO, and ArgoCD auto-rolled back within 6 minutes.
  • Support tickets about “payment slow” dropped 30% (we correlated with SLO latency breaches) and NPS recovered by ~6 points.

The cultural change was bigger: engineers finally had a language to say “we’re out of budget—no new risky deploys today.” Product used the budget charts in planning. Nobody argued with the math.

Hard lessons (so you don’t hit the same potholes)

  • Don’t set SLOs on pods or services—set them on journeys. We killed three service-centric SLOs that made no user sense.
  • Keep labels clean. A single user_id label multiplied Prometheus series and masked actual burn. Drop high-cardinality junk.
  • Exclude client errors (4xx) from availability SLIs unless you explicitly want to track abuse/misuse as a reliability risk.
  • Right-size targets. Their initial 99.99% availability target paged constantly; 99.9% was honest for their stack and traffic.
  • Align SLAs vs SLOs with the business. We used SLOs to drive engineering, and SLAs for contracts. Different audiences.
  • Wire mitigations. Istio circuit breakers and outlier detection saved a cascade once. If you alert on burn but can’t throttle, you’ll still page helplessly.
# Istio DestinationRule excerpt for circuit breaking
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payments-api
spec:
  host: payments-api.default.svc.cluster.local
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    connectionPool:
      tcp:
        maxConnections: 1000
      http:
        http1MaxPendingRequests: 1000
        maxRequestsPerConnection: 100

Try this next sprint (no committees required)

  1. Pick one journey. Define availability and latency SLIs (good/total and under-threshold duration).
  2. Set a 28-day SLO and compute the budget. Be conservative: 99.9% is plenty for most consumer flows.
  3. Install Sloth and commit one SLO YAML to Git. Deploy via ArgoCD to generate rules.
helm repo add sloth https://slok.github.io/sloth
helm upgrade --install sloth sloth/sloth -n slo --create-namespace
kubectl apply -f slos/payments-authorize.yaml
  1. Create two alert paths in Alertmanager: page for high burn, ticket for low burn. Update your on-call policy accordingly.
  2. Add a simple CI check to block deploys when the 1h burn rate is above threshold.
  3. Write a one-page runbook and schedule a 60-minute game day.

If you get stuck on PromQL, noisy labels, or routing policies, bring in a grown-up for a day. Two focused working sessions beat a three-month “observability initiative” every time.

Where GitPlumbers fits (when to call us)

We’ve done SLO rescues at fintechs, marketplaces, and B2B SaaS—from greenfield to “Grafana museum” environments. If you need:

  • SLO/SLI design for real user journeys
  • Sloth/Prometheus rules that won’t melt your TSDB
  • Alert paths that page only on user-impacting burn
  • ArgoCD/Terraform gates that stop risky deploys
  • Game-day rehearsal and runbook hardening

Ping us. We’ll leave you with SLOs in Git, alerts that matter, and an on-call rotation that can finally sleep. No silver bullets—just plumbing that works under pressure.

Related Resources

Key takeaways

  • SLOs change on-call behavior by making pages map to user impact, not pod health.
  • Multi-window burn-rate alerts cut noise while catching real degradation early.
  • Tie deploy gates to error budgets so you stop making production worse during burn.
  • Instrument SLIs on user journeys, not services. Dashboards follow the SLOs—not the other way around.
  • Start small: one critical path, one availability SLO, one latency SLO, 28-day window. Iterate.

Implementation checklist

  • Pick 1–2 user journeys and define SLIs (good/total, latency under threshold).
  • Set a 28-day SLO and calculate the error budget (1 - SLO).
  • Deploy Sloth or SLO tooling to generate Prometheus rules automatically.
  • Create multi-window burn alerts (5m+1h page; 30m+6h ticket).
  • Route paging only from error-budget burn to reduce alert fatigue.
  • Add ArgoCD/Terraform gates to block deploys when burn exceeds threshold.
  • Write a one-page runbook per SLO; rehearse an incident in a game day.

Questions we hear from teams

How is an SLO different from an SLA?
SLOs are internal engineering targets tied to user experience (e.g., 99.9% auth availability over 28 days). SLAs are contractual promises with penalties. Use SLOs to guide engineering decisions; use SLAs to align with customers and Legal.
Do I need a new vendor to do SLOs?
No. Prometheus + Alertmanager + Grafana + Sloth gets you far. If you’re already on Datadog, New Relic, or Nobl9, you can implement SLOs there too. The hard part is defining good SLIs and wiring alert policy to burn rate.
How long does this take to see results?
We typically see noise reduction within a week of routing pages only from burn alerts, and MTTR improvements in 2–4 weeks as runbooks harden. Full cultural adoption (product using budgets, deploy gates respected) takes 1–2 quarters.
What if we can’t hit 99.9% right now?
Then don’t. Pick an achievable target that reflects reality, measure, and improve. An honest 99.0% you hit is better than a 99.99% fantasy that pages everyone into burnout.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an SLO-first incident response expert Download the SLO Playbook (GitOps-ready)

Related resources