How do I pick my first SLO targets?

Start with what you can meet based on historical data. If your checkout availability has been 99.7–99.85% over the last quarter, set 99.8% or 99.85%, not 99.99%. Use a 28-day window and revisit quarterly. The point is to create a useful error budget signal, not win an uptime beauty contest.

Do I need Istio to implement SLIs/SLOs?

No, but a mesh makes it easier. You can use NGINX Ingress, Envoy, or app-level metrics (OpenTelemetry) to expose request counts and latency histograms. What matters is consistent metrics for total vs error counts and latency buckets.

What about composite services and partial failures?

Model user journeys at the edge if possible (e.g., via API gateway metrics) and add service-level SLOs where necessary. If the page is triggered from the edge SLO, use service SLOs to triage. Avoid clever weighted composites until you’ve mastered basics.

Won’t slow-burn alerts cause alert fatigue?

Use multi-window thresholds and route slow burns to tickets unless they continue to burn. We page on fast burns; we ticket on slow burns. Most teams see fewer, earlier, and more actionable signals with this pattern.

Case-studies · Nov 28, 2025 · 9 minute read

From Pager Hell to Predictable On-Call: How SLOs Cut Pages 65% in 90 Days

We stopped chasing CPU graphs and started defending error budgets. The result: fewer pages, faster MTTR, calmer engineers, safer releases.

Riley Shaw

Partner, Reliability Engineering at GitPlumbers

20 years in the trenches across fintech, adtech, and marketplaces. Ex-Stripe SRE, helped teams at Shopify and Segment turn metrics into reliability outcomes. Can quote PromQL in my sleep and has a soft spot for boring, scalable YAML.

We didn’t fix incidents; we fixed what we alert on.

Back to all posts

The incident treadmill we walked into

I walked into a Monday standup where the on-call had slept on the couch in the office. Again. The company (let’s call them MarketForge) runs a high-traffic marketplace on AWS EKS with Istio, Prometheus, and Alertmanager. They'd bolted on Datadog for host metrics, Sentry for errors, and PagerDuty for paging.

The problem: incidents were defined by whatever metric tripped first. CPU spikes? Page. Pod restarts? Page. A minor 5xx blip on a non-critical endpoint at 3 a.m.? Page. They were doing drive-by dashboarding and vibe debugging. AI-assisted PRs were shipping faster than the monitoring could keep up, and the team was drowning in noise.

Average pages per week: 38 (with bursts >60 during releases)
MTTR: ~6 hours
MTTD: ~20 minutes (read: humans noticed before dashboards)
Change failure rate: ~32%
Compliance constraint: PCI scope on checkout; zero tolerance for silent failures there

I’ve seen this movie: without SLOs, you’re optimizing for graphs, not users. We flipped the script.

Why SLOs changed the game

Dashboards are for humans; SLOs are contracts with your users. When we centered incident response on SLOs, three things happened immediately:

We stopped paging on infra noise and started paging on user pain.
We could quantify risk with error budgets instead of arguing about severity.
We created a common language across engineering, product, and compliance.

We anchored on two simple rules:

Define SLIs around critical user journeys: login, search, checkout.
Alert only when SLO error budgets burned at meaningful rates (fast and slow).

We didn’t fix incidents; we fixed what we alert on.

What we implemented (the boring, critical details)

We kept it boring on purpose. Fancy observability with no governance is just expensive noise.

SLIs for critical paths

Availability: ratio of 2xx|3xx to all responses per service
Latency: p95 under a threshold (e.g., 300ms for search, 500ms for checkout)
Source: Istio metrics exported to Prometheus

PromQL examples:

# Checkout availability (5xx + throttles considered errors)
sum(rate(istio_requests_total{destination_workload="checkout", response_code=~"5..|429"}[5m]))
/
sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))

# Checkout latency SLI (p95 under 500ms)
histogram_quantile(0.95,
  sum by (le) (rate(istio_request_duration_milliseconds_bucket{destination_workload="checkout"}[5m]))
) < 0.5

SLOs with error budgets

Checkout availability SLO: 99.9% over 28 days
Search latency SLO: p95 < 300ms for 99% of requests over 28 days
Login availability SLO: 99.95% over 28 days

We codified these with Sloth so SLOs, alerts, and dashboards are generated from one YAML source of truth and deployed with ArgoCD.

# sloth.yaml
service: checkout
slos:
  - name: http-availability
    objective: 99.9
    description: "HTTP 2xx/3xx ratio over 28d"
    labels:
      team: core-commerce
      tier: critical
    sli:
      events:
        error_query: |
          sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5..|429"}[5m]))
        total_query: |
          sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))
    alerting:
      name: checkout-slo
      labels:
        severity: page
      annotations:
        runbook: https://runbooks.marketforge.internal/checkout/slo
      page_alert: { disable: false }
      ticket_alert: { disable: false }

Multi-window, multi-burn rate alerts

We used the Google SRE pattern to catch both fast burns (explosions) and slow burns (leaks):

Fast burn: 2h window, burn rate > 14x (page now)
Slow burn: 6h and 24h windows, burn rate > 6x and > 1x (ticket or page depending on tier)

# prometheus-rules.yaml (generated by Sloth, simplified)
groups:
- name: slo-burn
  rules:
  - alert: SLOErrorBudgetBurnFast
    expr: |
      (
        sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5..|429"}[5m]))
        /
        sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))
      ) > (14 * (1 - 0.999))
    for: 10m
    labels:
      severity: page
      service: checkout
    annotations:
      summary: "Checkout SLO fast burn"
      runbook: "https://runbooks.marketforge.internal/checkout/slo"
  - alert: SLOErrorBudgetBurnSlow
    expr: |
      (
        avg_over_time(
          (
            sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5..|429"}[5m]))
            /
            sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))
          )[6h:]
        ) > (6 * (1 - 0.999))
      )
    for: 30m
    labels:
      severity: ticket
      service: checkout
    annotations:
      summary: "Checkout SLO slow burn"
      runbook: "https://runbooks.marketforge.internal/checkout/slo"

Pager routing that respects sleep

Alertmanager routed only severity: page from SLO burns to PagerDuty. Everything else opened a ticket in Jira or Slack.

# alertmanager.yaml (routing snippet)
route:
  receiver: default
  routes:
  - matchers:
    - severity="page"
    receiver: pagerduty
    group_by: [service]
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 3h
  - matchers:
    - severity="ticket"
    receiver: jira
receivers:
- name: pagerduty
  pagerduty_configs:
  - routing_key: ${PAGERDUTY_KEY}
- name: jira
  webhook_configs:
  - url: https://jira.marketforge.internal/hooks/alerts

GitOps everything

SLO YAML lived under sre/slos/ in the mono-repo.
ArgoCD synced PrometheusRule CRDs, Alertmanager config, and Grafana dashboards.
Changes required a PR, code review, and a canary rollout. No click-ops.

Make it visible

We carved out a Grafana folder: one dashboard per service, top-left panel is remaining error budget over 28 days, plus burn rate sparkline. The first graph every on-call saw was user impact, not pod counts.

How incident response actually changed

We didn’t just add alerts; we rewrote the on-call contract.

Paging policy: Only SLO burn pages. Node/pod/K8s noise became tickets with rational priorities.
Triage flow started at the SLI panel, not the pod list. The second thing on-call checked was the ArgoCD deploy history. 80% of SLO burns correlated with a deploy in the last 30 minutes. Shocking, I know.
Rollback and flag strategy: Argo Rollouts for canary, LaunchDarkly for feature flags. If a canary burned 2% of the error budget in 15 minutes, rollout abort was the default, not a debate.
Runbooks: Each SLO had a runbook with kubectl one-liners, istioctl proxy-status, SLI queries, and feature flag kill switches.

# Quick triage snippets from the runbook
kubectl -n checkout get pods -o wide --sort-by=.status.containerStatuses[0].restartCount

# Check last deploy
argocd app history checkout | head -n 5

# Compare SLI before/after deploy
promtool query instant http://prometheus:9090 \
  'sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5..|429"}[5m]))
   /
   sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))'

The social change was the hardest part. We had to deprogram the “CPU>85% == page” reflex. But once engineers saw reduced noise and clearer priorities, they leaned in.

Results after 30/60/90 days

We measured hard outcomes, not vibes.

Pages/week: 38 → 14 (−63%) by day 60; stabilized at 13–16 by day 90
MTTR: 6h → 1h 50m at day 30 → 48m at day 90 (−87%)
MTTD: 20m → 5m (multi-window alerts caught slow burns early)
Change failure rate: 32% → 12% (canaries + flag kills on burn)
On-call satisfaction (internal survey): 2.1/5 → 4.0/5
Unplanned downtime on checkout (28d): 220 minutes → 24 minutes
Compliance posture: PCI evidence packs included SLO dashboards and error budget policy; zero NCs in the audit

Business impact wasn’t subtle: conversion recovered 1.8 points after we stopped drowning checkout with noisy restarts and started defending the SLO. Product stopped arguing with SRE about “is this critical?” We had numbers.

Lessons learned (and what I’d do differently)

Don’t start with 20 SLOs. We started with three services and two SLIs each. That was enough to flip the culture.
Pick SLO targets you can actually meet. Shipping a 99.99% SLO on day one just means permanent pages and no credibility.
Keep SLIs boring. We resisted “weighted blended” weirdness. Ratios and histograms won.
Codify or it didn’t happen. YAML + Sloth + ArgoCD meant changes were reviewable and auditable.
Tie to release policy. Error budget exhaustion paused non-critical launches for a sprint. Product grumbled, then loved the predictability.
Watch for AI-induced regressions. The fastest route to a burned budget was an “optimizing” AI patch that subtly changed retry semantics. SLOs turned those from Friday-night mysteries into Tuesday-morning blips.

Do this next week: a 7-step playbook

List your top three user journeys and pick an availability and latency SLI for each.
Set SLOs with a 28d window and error budget policies (what pauses when you breach?).
Generate Prometheus rules via Sloth (or hand-roll if you must) and deploy with ArgoCD.
Implement two alert rules per SLO: fast (2h burn) pages, slow (6h/24h) tickets.
Route non-SLO alerts away from PagerDuty. Sleep is a feature.
Write runbooks that start with SLI graphs and recent deploys. Practice once a month.
Add canary/flags to shorten MTTR: Argo Rollouts + LaunchDarkly is a strong combo.

If you want a sanity check on your first SLOs or help wrangling the PromQL, this is exactly what we do at GitPlumbers. We’ve cleaned up enough AI-generated “observability” YAML to know where the footguns are.

Related Resources

Key takeaways

Page on user pain, not node metrics: alert on SLO error budget burn, not CPU or pod restarts.
Define SLIs from top user journeys; keep them simple and objective.
Use multi-window, multi-burn-rate alerts to catch both fast and slow burns without noisy flapping.
Codify SLOs, alerts, and routing via GitOps so changes get reviewed, tested, and rolled out predictably.
Tie incident response to error budgets: on breach, slow changes, add guardrails, and fix causes, not symptoms.

Implementation checklist

Map 3-5 critical user journeys and define one availability and one latency SLI for each.
Pick realistic SLO targets (e.g., 99.9% over 28d) with a clear error budget policy.
Implement multi-window burn rate alerts (e.g., 2h/1h fast, 6h/24h slow) with `severity: page`.
Route non-SLO alerts to tickets; reserve pages for error budget burns.
Codify SLOs with a generator like Sloth, deploy via `ArgoCD`, and visualize error budgets in `Grafana`.
Create runbooks that start with SLI graphs and recent deploys; practice drills monthly.
Use feature flags/canaries (`LaunchDarkly`, `Argo Rollouts`) to reduce blast radius when SLOs burn.

Questions we hear from teams

How do I pick my first SLO targets?: Start with what you can meet based on historical data. If your checkout availability has been 99.7–99.85% over the last quarter, set 99.8% or 99.85%, not 99.99%. Use a 28-day window and revisit quarterly. The point is to create a useful error budget signal, not win an uptime beauty contest.
Do I need Istio to implement SLIs/SLOs?: No, but a mesh makes it easier. You can use NGINX Ingress, Envoy, or app-level metrics (OpenTelemetry) to expose request counts and latency histograms. What matters is consistent metrics for total vs error counts and latency buckets.
What about composite services and partial failures?: Model user journeys at the edge if possible (e.g., via API gateway metrics) and add service-level SLOs where necessary. If the page is triggered from the edge SLO, use service SLOs to triage. Avoid clever weighted composites until you’ve mastered basics.
Won’t slow-burn alerts cause alert fatigue?: Use multi-window thresholds and route slow burns to tickets unless they continue to burn. We page on fast burns; we ticket on slow burns. Most teams see fewer, earlier, and more actionable signals with this pattern.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about SLOs that actually reduce pages See how we codify SLOs with GitOps