How many SLOs per service is reasonable?

Start with 1–2 SLOs per critical user journey, usually 3–5 per product. More than that and you’ll spend your life in meetings. Keep the long tail of metrics on dashboards, not pagers.

What if we don’t have Prometheus/Argo/Istio?

Great. Use what you have: Datadog monitors for burn rate, LaunchDarkly or OpenFeature for flag gates, AWS ALB/NLB metrics for success/latency, Spinnaker for automated rollouts. The pattern matters more than the tools.

How do we pick the right objective (99.9 vs 99.95)?

Back into it from user tolerance and incident review. If a 10‑minute checkout outage a month is acceptable, 99.98% is your ceiling. Then stress‑test with historical data to ensure you can operate within budget most months.

Won’t gating rollouts slow us down?

Only if your quality is already poor. Healthy teams pass gates quickly. And when things do go wrong, automatic rollback is much faster than a human hunt.

What about internal platforms and batch jobs?

Use SLIs that match the job: queue latency, backlog depth, and SLA completion window success. Page on predicted misses (backlog growing + steady input), not on CPU spikes.

Reliability-observability · Nov 6, 2025 · 10 minute read

The SLIs That Actually Change On‑Call: Predict Failures, Gate Rollouts, Ship Calmly

Stop paging on CPU and start shipping based on error budgets that predict pain before customers feel it.

Alex Ramirez

Partner, Reliability Engineering at GitPlumbers

20 years in the trenches from bare‑metal MySQL to multi‑region Kubernetes. Ex‑Netflix traffic team, ex‑Stripe platform, now helping teams ship calmly with SRE, GitOps, and ruthless de‑pagification.

“If your alert doesn’t change someone’s behavior, it’s not an alert. It’s a dashboard.”

Back to all posts

You don’t need another dashboard. You need better predictors.

I’ve sat through too many 2 a.m. pages where the only signal was CPU 92% on some node pool. Nobody made a better decision because of it. The incidents that bit us were always telegraphed by earlier, quieter signals: tail latency creeping, retry storms, queue backlogs, connection pool saturation. When we flipped our SLOs to track those, on‑call went from whack‑a‑mole to calm, and incident volume dropped.

If your SLOs don’t change how you roll out or triage, they’re just KPIs in a nicer font.

This is the playbook we implement at GitPlumbers when we’re called into burnt‑out teams who’ve tried “observability” and got only more noise.

Define SLIs that predict pain (not vanity)

Your SLIs should be the earliest reliable proxies for user harm on critical journeys. A few that consistently work:

Request success rate for customer‑facing endpoints, not service‑to‑service. Example: POST /checkout success, not “cluster 2xx”. Tie it to user_journey=checkout.
Tail latency (p95/p99) on those same journeys. p50 tells you the happy path; p95 tells you who’s about to file a ticket.
Saturation headroom where contention hurts first: DB connection pool (e.g., pgbouncer_stats_free_slots), thread pools, Kafka consumer lag, Kubernetes workqueue depth.
Retry and throttle signals: 429/503 ratios, client retry counts, circuit‑breaker open rate. Retry storms predict meltdowns.
Backlog growth: SQS/Kafka consumer lag, workqueue_depth in controllers. Backlog climbing + steady input == future SLO breach.

What to drop from paging:

Raw CPU/memory/disk unless they directly cause user harm and have no better proxy.
Averages of anything. Averages are where incidents go to hide.
Global availability across 10 services—scope SLIs to the user journey and owning team.

Instrument consistently:

Use OpenTelemetry to attach service.name, deployment.environment, team, tier, and user_journey to traces/metrics/logs.
Normalize HTTP metrics: status code family, method, route template. Avoid cardinality bombs (no full URLs).

Put SLOs where the pager changes hands (and behavior)

An SLO isn’t “99.9% because marketing.” It’s a contract: when the error budget burns at a certain rate, you take a different action.

Example: For Checkout API, objective 99.9% monthly success. If burn rate > 14× over 5m, page primary immediately; if > 2× over 1h, page within business hours.
Tie actions to budget state:
- Burn < 25%: ship freely, allow canary to auto‑promote.
- Burn 25–75%: require canary analysis to pass stricter gates.
- Burn > 75%: freeze risky changes; only hotfixes with rollback prepped.

This is where I’ve seen teams finally reduce incidents: the SLO is not just plotted in Grafana; it gates rollouts and dictates triage.

Wire telemetry to triage and rollout automation

Make the data drive decisions automatically. The boring plumbing pays off fast.

Alerting: Use multi‑window, multi‑burn rate alerts (from the Google SRE book) so you only wake people when users feel it.
Triage routing: PagerDuty Event Orchestration uses labels to route/suppress. If the SLO isn’t burning, suppress node CPU alerts or reclassify as low.
Runbooks in code: Every page fires an action—link a runbook, attach last failed deploy, and top suspect services from traces.
Rollouts gated by SLO: Argo Rollouts or Flagger query Prometheus. If success rate dips or p95 spikes, the canary halts or rolls back. Feature flags follow the same rule.

I’ve watched a fintech cut Sev‑1s by 40% in a quarter by doing just this: SLO burn gates on Argo + PD orchestration to de‑noise infra alerts.

Concrete configs you can copy‑paste

Here are minimal, production‑proven snippets you can adapt.

Prometheus: recording rules and multi‑window burn alerts

# prometheus-rules.yaml
groups:
- name: checkout-slo
  rules:
  - record: job:http_request_total:rate5m
    expr: sum by (job, user_journey) (rate(http_requests_total{user_journey="checkout", le="+Inf"}[5m]))
  - record: job:http_request_errors:rate5m
    expr: sum by (job, user_journey) (rate(http_requests_total{user_journey="checkout", status=~"5..|429|499"}[5m]))
  - record: slo:checkout:error_ratio5m
    expr: job:http_request_errors:rate5m / job:http_request_total:rate5m
  - record: slo:checkout:error_ratio1h
    expr: sum_over_time(slo:checkout:error_ratio5m[1h]) / 12
  - record: slo:checkout:error_ratio6h
    expr: sum_over_time(slo:checkout:error_ratio5m[6h]) / 72

  # Multi-window burn alerts for 99.9% SLO (0.1% budget)
  # Fast burn (page now): error ratio over 5m > 14x budget
  - alert: CheckoutFastBurn
    expr: slo:checkout:error_ratio5m > (0.001 * 14)
    for: 5m
    labels:
      severity: critical
      team: payments
    annotations:
      summary: "Checkout SLO fast burn"
      description: "Error ratio > 14x over 5m. User impact likely."

  # Slow burn (page during hours): over 6h > 2x budget
  - alert: CheckoutSlowBurn
    expr: slo:checkout:error_ratio6h > (0.001 * 2)
    for: 10m
    labels:
      severity: warning
      team: payments
    annotations:
      summary: "Checkout SLO slow burn"
      description: "Error ratio > 2x over 6h. Investigate within business hours."

Sloth: codify the SLO (so it’s reviewable and versioned)

# slo-checkout.yaml (Sloth)
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: checkout-availability
spec:
  service: checkout
  labels:
    team: payments
    user_journey: checkout
  slos:
  - name: availability
    objective: 99.9
    description: Checkout success rate monthly
    sli:
      events:
        errorQuery: sum(rate(http_requests_total{user_journey="checkout",status=~"5..|429|499"}[5m]))
        totalQuery: sum(rate(http_requests_total{user_journey="checkout"}[5m]))
    alerting:
      name: Checkout
      labels:
        severity: page
      annotations:
        runbook: https://runbooks.example.com/checkout
      pageAlert: true
      ticketAlert: true

Argo Rollouts: gate promotion on SLO queries

# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-canary-analysis
spec:
  metrics:
  - name: success-rate
    interval: 1m
    count: 5
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          1 - (sum(rate(http_requests_total{user_journey="checkout",status=~"5..|429|499",version="{{args.version}}"}[1m])) /
               sum(rate(http_requests_total{user_journey="checkout",version="{{args.version}}"}[1m])))
    successCondition: result[0] >= 0.999
  - name: p95-latency
    interval: 1m
    count: 5
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{user_journey="checkout",version="{{args.version}}"}[1m])))
    successCondition: result[0] < 0.3

# rollout.yaml (snippet)
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 60}
      - analysis:
          templates:
          - templateName: checkout-canary-analysis
          args:
          - name: version
            valueFrom:
              podTemplateHashValue: Latest
      - setWeight: 50
      - pause: {duration: 120}
      - analysis:
          templates:
          - templateName: checkout-canary-analysis
          args:
          - name: version
            valueFrom:
              podTemplateHashValue: Latest

Flagger + Istio: automatic rollback with success rate and latency

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: checkout
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  service:
    meshName: istiod
    port: 80
    gateways:
    - public-gateway
    hosts:
    - checkout.example.com
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99.9
      interval: 30s
    - name: request-duration
      thresholdRange:
        max: 300
      interval: 30s

Istio: stop retry storms and eject bad pods early

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: checkout-dr
spec:
  host: checkout
  trafficPolicy:
    outlierDetection:
      consecutive5xx: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 100
    retries:
      attempts: 2
      perTryTimeout: 300ms

PagerDuty Event Orchestration: suppress vanity alerts when SLO is healthy

{
  "rules": [
    {
      "conditions": [
        {"operator": "and", "subconditions": [
          {"field": "details.alert_type", "operator": "equals", "value": "cpu_high"},
          {"field": "details.slo_burning", "operator": "equals", "value": false}
        ]}
      ],
      "actions": {"suppress": true, "annotations": {"note": "CPU high suppressed unless SLO burning"}}
    }
  ]
}

Make leading indicators part of triage

Now that you’ve got the wiring, teach the pager to speak the language of prediction.

Page titles: include burn_rate, top impacted user_journey, and last deploy SHA. Example: “Checkout SLO fast burn (14x) – deploy 3f2a9bc 8m ago”.
Enrich alerts with suspects: top N spans by error rate from traces (e.g., SpanMetricsProcessor), DB saturation headroom, and retry ratios.
First actions in runbook:
1. Check canary status; if failing, roll back (kubectl argo rollouts undo ...).
2. If retries > threshold, tighten Istio outlier detection by 1 notch.
3. If backlog rising, scale consumers before chasing code paths.

On‑call should move from “hunt for dashboards” to “confirm and act in <5 minutes.”

Run the loop weekly: tune, delete, and dare to freeze

The teams that win treat SLOs as a feedback loop, not a one‑time ceremony.

Review error‑budget burn every week. If burn was near zero, your SLO is either too loose or you’re under‑shipping—loosen rollout gates. If you ran hot, freeze risky changes next sprint.
Prune alerts that didn’t change behavior. If an alert never changed the on‑call action in 90 days, it’s a dashboard, not a page.
Tighten thresholds on leading indicators (retry rate, backlog depth) as you gain headroom.
Keep SLOs in Git next to service code via Sloth or Nobl9. PRs change SLOs; releases reference their budget state.

I’ve seen this fail when leadership treats SLOs as vanity goals. The fix is simple: wire budget state to the deploy pipeline and hold teams accountable to their own gates.

What we stopped measuring (and what it bought us)

At one marketplace client, we killed 60% of alerts in two weeks:

Deleted: node CPU/mem pages, generic 5xx across the whole mesh, cluster “NotReady” spam. None changed the on‑call action.
Added: checkout p99 latency, DB pool free slots for payment write path, consumer lag on orders topic, 429/503 ratio from the edge.
Gated rollouts: Argo Rollouts promoted only when success rate ≥ 99.9% and p95 < 300ms for 5 consecutive minutes.

Results in 90 days:

42% fewer Sev‑1/Sev‑2s.
MTTR down from 52m to 19m.
Page volume down 58%, and engineers started sleeping again.

No silver bullets. Just SLIs that predict pain and plumbing that acts on them. That’s the boring, durable win.

Related Resources

Key takeaways

Pick SLIs that are leading indicators of customer pain: tail latency, saturation, retry storms, and backlog growth.
Express SLOs where a human would take a different action at the pager—otherwise it’s not an SLO, it’s a metric.
Use multi-window burn-rate alerts to page only when it matters and to classify urgency.
Gate rollouts and feature flags with error budgets; automate rollback when the burn spikes.
Tie telemetry to triage: route, suppress, or enrich alerts based on SLO state and ownership.

Implementation checklist

Define 3–5 user-journey SLIs (p95 latency, success rate, backlog depth, saturation headroom).
Codify SLOs with error budgets and multi-window burn-rate alerts.
Tag telemetry with `service`, `team`, `tier`, and `user_journey` via OpenTelemetry.
Integrate SLO state with rollout controllers (Argo Rollouts or Flagger) to auto‑halt/rollback.
Adopt weekly error‑budget reviews and tighten/relax gates based on data.
Delete vanity alerts (CPU, disk) that don’t change on‑call behavior.

Questions we hear from teams

How many SLOs per service is reasonable?: Start with 1–2 SLOs per critical user journey, usually 3–5 per product. More than that and you’ll spend your life in meetings. Keep the long tail of metrics on dashboards, not pagers.
What if we don’t have Prometheus/Argo/Istio?: Great. Use what you have: Datadog monitors for burn rate, LaunchDarkly or OpenFeature for flag gates, AWS ALB/NLB metrics for success/latency, Spinnaker for automated rollouts. The pattern matters more than the tools.
How do we pick the right objective (99.9 vs 99.95)?: Back into it from user tolerance and incident review. If a 10‑minute checkout outage a month is acceptable, 99.98% is your ceiling. Then stress‑test with historical data to ensure you can operate within budget most months.
Won’t gating rollouts slow us down?: Only if your quality is already poor. Healthy teams pass gates quickly. And when things do go wrong, automatic rollback is much faster than a human hunt.
What about internal platforms and batch jobs?: Use SLIs that match the job: queue latency, backlog depth, and SLA completion window success. Page on predicted misses (backlog growing + steady input), not on CPU spikes.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an SRE who ships See our SLO playbook