Error Budgets By Tier: Stop Letting One Noisy Service Burn Your Whole Quarter

Design error budget allocation that predicts incidents, gates rollouts, and makes triage boring — not political.

If your error budget can’t stop a bad deploy in five minutes without a meeting, it’s not an SLO — it’s a vanity chart.
Back to all posts

The release that torched your quarter

You’ve lived this: Friday 4:45pm, a “safe” change ships to a Tier 0 service. P99 jumps, retries storm the DB, the error budget is gone in 30 minutes, and suddenly product wants a freeze for the rest of the month. Meanwhile that Tier 2 batch job that failed twice last week quietly spent half the remaining budget because your SLOs were flat and shared.

I’ve seen this movie at SaaS unicorns and banks. The fix isn’t a new dashboard; it’s designing tiered error budgets with leading indicators and wiring them into rollout and triage automation. If your budgets can’t stop a bad deploy in 5 minutes without a Zoom call, they’re just vanity metrics.

Tier the service, not the politics

Service tiering is only useful if it changes math and automation. A simple, pragmatic model I’ve seen work:

  • Tier 0 (Critical, customer-facing): checkout, auth, trading. SLOs: 99.9%-99.99% availability, strict latency (P99 under X ms), sometimes a freshness SLO (search index <= 60s). Error budget: tiny. Automation: aggressive.
  • Tier 1 (Revenue-adjacent/internal critical): catalog, pricing, notifications. SLOs: 99.5%-99.9%, latency looser. Error budget: moderate.
  • Tier 2 (Best-effort/internal): reporting, batch, ML scoring queues. SLOs: 99%-99.5% or freshness-based. Error budget: larger; prioritize throughput over tail latency.

Don’t set SLOs only on availability. For each tier, pick SLO types that match user pain:

  • Availability (5xx or gRPC status != OK).
  • Latency (P95/P99 under threshold for the top N endpoints).
  • Freshness/lag (Kafka consumer lag, ETL recency, cache warmness).

Then decide where the budget can be spent:

  • Change budget (deploys, flags, config pushes). Freeze/rollback when exceeded.
  • Steady-state budget (dependency flakiness, noisy neighbors). Escalate ownership.
  • Load/peak budget (expected seasonal spikes). Pre-approve higher burn during known events.

A Tier 0 service might allocate 60% to change risk, 30% steady-state, 10% peak windows. Tier 2 flips that ratio.

Leading indicators that predict incidents

If your alerts fire after customers scream, you’re measuring the wrong thing. The indicators that consistently predict pain:

  • Burn rate (multi-window): ratio of current error rate to allowed budget, measured over 5m/1h and 1h/6h pairs.
  • Saturation: queue depth, thread pool utilization, connection pool saturation, Envoy upstream_rq_pending_overflow.
  • Tail latency slope: P99 growth rate, not absolute value. Sudden slope changes correlate with emergent queuing.
  • Retry storms: spike in client retries or 429/503, especially from a single caller.
  • GC/CPU pressure: JVM gc_pause_seconds_sum, cpu_steal_time, container throttling (container_cpu_cfs_throttled_seconds_total).
  • Data pipeline lag: Kafka consumer lag growth rate, not static lag; DB lock wait count/s; replica apply delay.
  • Cache effectiveness: cache miss ratio delta (derivative) and eviction storms.

Concrete Prometheus recording rules get you there. Example availability & latency SLOs with burn metrics:

# slo-rules.yaml
# Availability SLI: total requests vs. errors
- record: job:http_requests:rate5m
  expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:rate5m
  expr: sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
- record: slo:availability:error_ratio5m
  expr: job:http_errors:rate5m / job:http_requests:rate5m

# Latency SLI: proportion under threshold (ex: 300ms)
- record: slo:latency:good_ratio5m
  expr: 1 - (sum by (job) (rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) \
             / sum by (job) (rate(http_request_duration_seconds_count[5m])))

# Burn rate: compare short window to budget per tier
# Example: Tier0 budget 0.1% => 0.001 error budget fraction
- record: slo:availability:burn5m
  expr: slo:availability:error_ratio5m / 0.001

# Slow window to avoid flapping
- record: slo:availability:error_ratio1h
  expr: sum by (job) (increase(http_requests_total{code=~"5.."}[1h])) \
        / sum by (job) (increase(http_requests_total[1h]))
- record: slo:availability:burn1h
  expr: slo:availability:error_ratio1h / 0.001

Pair fast/slow windows to catch both brownouts and slow leaks. Alert when both are high for Tier 0; be looser for Tier 2.

Example Alertmanager rules with multi-window burn and tier-aware routing:

# alerting-rules.yaml
- alert: Tier0AvailabilityBurn
  expr: (slo:availability:burn5m{tier="0"} > 14) and (slo:availability:burn1h{tier="0"} > 7)
  for: 5m
  labels:
    severity: critical
    tier: "0"
  annotations:
    summary: "Tier 0 availability burn >14x/7x"
    runbook: "https://runbooks.internal/tier0-burn"

- alert: Tier2LatencyBurn
  expr: (slo:latency:burn5m{tier="2"} > 8) and (slo:latency:burn6h{tier="2"} > 2)
  for: 15m
  labels:
    severity: warning
    tier: "2"
  annotations:
    summary: "Tier 2 latency burn high"

Gate rollouts and flags on the budget

This is where most teams stop — a page goes to on-call and the rollout keeps marching. Wire your budgets into the control plane so bad changes die fast without a meeting.

Argo Rollouts AnalysisTemplate querying Prometheus for burn rate and tail latency slope, auto-pausing/rolling back:

# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: tier0-slo-gate
spec:
  metrics:
  - name: availability-burn
    interval: 1m
    successCondition: result < 5
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus:9090
        query: slo:availability:burn5m{job="checkout",tier="0"}
  - name: p99-slope
    interval: 1m
    successCondition: result < 0.2 # p99 slope < 0.2x/min
    provider:
      prometheus:
        address: http://prometheus:9090
        query: rate(http_request_duration_seconds_bucket{job="checkout",le="1"}[5m])
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - analysis:
          templates:
          - templateName: tier0-slo-gate
      - setWeight: 25
      - pause: {duration: 120}
      - analysis:
          templates:
          - templateName: tier0-slo-gate
      - setWeight: 50
      - analysis:
          templates:
          - templateName: tier0-slo-gate

If you’re using Flagger with Istio/Envoy, the pattern is the same — check burn and error rate before each bump:

# flagger-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: pricing
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: pricing
  service:
    port: 80
    gateways:
    - public-gw
  analysis:
    interval: 1m
    threshold: 1
    metrics:
    - name: availability
      interval: 30s
      threshold: 99.9
      query: 100 - (slo:availability:error_ratio5m{job="pricing"} * 100)
    - name: p99
      interval: 30s
      thresholdRange:
        max: 300 # ms
      query: histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le))

For feature flags (LaunchDarkly/Unleash), kill switches should be tied to the same Prometheus queries. Don’t wait for humans to flip the switch. A simple controller or webhook can auto-disable flags when burn exceeds N for M minutes.

Example pseudo-automation via prometheus-msteams/webhook to your flag service:

curl -X POST https://flags.internal/api/disable \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"flag": "new-checkout-flow", "reason": "burn-rate>7x/5m"}'

Make triage boring (in a good way)

When burn crosses a threshold, you need two things right now: the owner and the next action. Don’t page a human to go hunt for a dashboard.

Wire telemetry to ownership and runbooks:

  • Service catalog (Backstage, OpsLevel): map service -> team -> on-call -> runbook. Alert annotations link directly.
  • Deploy annotations: include the ArgoCD app, commit SHA, author, and feature flags in trace and log attributes.
  • OpenTelemetry resource attrs: propagate release data so traces tell you which cohort is breaking.

Minimal otel-collector config to inject release/version and forward to Prometheus and your APM:

receivers:
  otlp:
    protocols:
      http:
exporters:
  prometheus:
    endpoint: ":9464"
  otlphttp/apm:
    endpoint: https://apm.example.com/v1/traces
processors:
  batch: {}
  resource:
    attributes:
    - action: upsert
      key: deployment.sha
      value: ${GIT_SHA}
    - action: upsert
      key: feature.flags
      value: ${FLAGS}
extensions:
  health_check: {}
service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlphttp/apm]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Route smartly with PagerDuty schedules:

  • Tier 0 critical -> primary + incident commander, 24/7.
  • Tier 1 -> service owner business-hours critical, after-hours warning.
  • Tier 2 -> ticket with auto-triage suggestions; page only on sustained burn.

Bonus: a Slack bot that posts the live budget, top error signatures, and the last 3 deploys to #incident-<id>. I’ve built this twice; it pays for itself in the first quarter.

Allocate the budget by tier (and by risk)

Here’s the allocation model I use when GitPlumbers walks into a mess and needs sanity in a week:

  1. Define SLOs per tier. Keep it to availability + one more SLO type (latency for online paths, freshness for pipelines).
  2. Set numeric budgets per 30-day window.
    • Tier 0: 0.1% availability, P99 under 300ms for top 5 endpoints, freshness <= 60s.
    • Tier 1: 0.5% availability, P99 under 500ms.
    • Tier 2: 1%-2% availability or freshness <= 15m.
  3. Split each budget across risk buckets for policy:
    • Tier 0: 60% change, 30% steady-state, 10% peak.
    • Tier 1: 40% change, 40% steady-state, 20% peak.
    • Tier 2: 20% change, 60% steady-state, 20% peak.
  4. Write the policy in Git as code (YAML), so automation can read it and product can PR it.
  5. Gate rollouts/flags on burn thresholds tied to those allocations.
  6. Review spend weekly with product, and re-allocate if a dependency is chronically spending your budget.

Example policy file consumed by your CD controller:

# error-budget-policy.yaml
service: checkout
tier: 0
window: 30d
budgets:
  availability: 0.001
  latency_p99_ms: 300
allocation:
  change: 0.6
  steady_state: 0.3
  peak: 0.1
thresholds:
  fast_burn: 10   # 5m window
  slow_burn: 5    # 1h window
actions:
  - condition: fast_burn_exceeded
    do: pause_rollout
  - condition: slow_burn_exceeded
    do: rollback
  - condition: steady_state_burn_over_50%
    do: escalate_dependency_owner

And yes, account for AI-generated changes. I’ve seen AI “vibe code” patches pass unit tests and blow up under backpressure. Treat AI-driven changes as higher risk: apply stricter Tier 0 change budget gates until they’ve proved themselves. GitPlumbers does a lot of this vibe code cleanup; the pattern above is what prevents repeat offenses.

Example: Checkout (Tier 0) vs. Email Renderer (Tier 2)

At a fintech client, we split error budgets by tier in a week and wired them to rollouts:

  • Checkout (Tier 0): 99.95% availability, P99 < 250ms; fast/slow burn thresholds 14x/7x. Canary gated by Argo Rollouts + Prometheus. Feature flag kill switch auto-triggers above 7x for 3m.
  • Email Renderer (Tier 2): 99% availability, freshness SLO (new templates visible <= 10m). Burn thresholds 8x/2x. Flagger progressive delivery with a longer pause and no auto-rollback (alert-only).

Results in 60 days:

  • Checkout MTTR dropped from ~45m to ~12m because rollouts paused themselves and triage posted runbooks + last deploys in Slack.
  • Feature-related incidents fell by ~40% because canaries stopped at 10% when burn spiked.
  • Email renderer stopped spending Tier 0 goodwill; outages still happened, but they only burned its own Tier 2 budget, and the system stopped paging people at 2am for it.

No magic, just budgets tied to automation and signals that predict pain.

What I’d do again (and what I’d skip)

Do again:

  • Start with 2-3 SLOs per service. Too many and no one believes any of them.
  • Use multi-window burn and slope-based triggers; they’re boringly effective.
  • Put the policy in Git next to the service, reviewed by product and SRE together.
  • Annotate traces/logs with deploys and flags; it shortens root-cause time by half.

Skip:

  • Big-bang SLO rollouts. Start with Tier 0/1 only, then expand.
  • Vanity dashboards that show weekly uptime but hide saturation.
  • Paging humans for recoverable canary failures; let automation roll back.

If you need a sanity check or want a second set of hands to wire Prometheus to your Argo gates and clean up the vibe code that’s making your burn graphs look like a ski jump, GitPlumbers does this all the time. Happy to jump in, pair with your SRE, and leave you with guardrails that keep working after we’re gone.

Related Resources

Key takeaways

  • Use tier-specific SLOs and budgets: different classes of service get different error budgets and leading indicators.
  • Monitor burn using multi-window alerts and slope-based signals, not just weekly uptime charts.
  • Gate rollouts and feature flags on burn rate and saturation, automatically pause/rollback without human drama.
  • Triage flows from telemetry: route by service ownership, surface runbooks, and annotate with recent deploys and flags.
  • Allocate budget across change risk vs. steady-state risk; don’t let a Tier 2 batch job spend Tier 0 budget.
  • Make the budget policy a config artifact in Git so product, SRE, and eng are literally on the same file.

Implementation checklist

  • Define service tiers and SLO types (availability, latency, freshness) with numeric budgets.
  • Instrument leading indicators: burn rate, saturation queues, GC/CPU pressure, dependency errors, and lag growth rates.
  • Create multi-window burn alerts (fast and slow) per tier; route to on-call and automation.
  • Wire burn thresholds to rollout gates in Argo Rollouts/Flagger and to feature flag kill switches.
  • Automate triage: service owner paging, runbook links, recent deploy/flag annotations, and golden queries.
  • Review budget spend weekly with product; reallocate budget across change vs. steady state as needed.

Questions we hear from teams

How do I pick fast/slow burn thresholds?
Start with Google SRE’s guidance: for 99.9% availability, alert when 5m burn >14x AND 1h burn >7x. Adjust per tier. Tier 2 can tolerate lower thresholds (e.g., 8x/2x). Watch for paging volume for two weeks and tune.
What if dependencies blow my budget?
Create dependency SLOs and carve out steady-state budget. Escalate with evidence: show the burn attributed to the dependency. Add circuit breakers/timeouts and consider caching or degradation paths so their pain doesn’t become your burn.
We’re a small team. Is this overkill?
You can do a lightweight version: a single availability SLO + a p99 threshold, one multi-window alert, and a canary gate in Argo Rollouts. It’s a day or two of work and it stops most self-inflicted incidents.
Where do feature flags fit?
Treat flags like deploys. Tag traces/logs with active flags, and wire kill switches to the same burn metrics. Auto-disable high-risk flags on Tier 0 when burn spikes.
Can I do this without Kubernetes?
Yes. Use Prometheus/Alertmanager or CloudWatch alarms for burn, route to PagerDuty, and integrate with your deploy system (Spinnaker, Jenkins, GitHub Actions). The pattern is tooling-agnostic.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Set up tiered error budgets with an engineer who’s done it before Assess your burn rate and rollout gates (free 30‑min review)

Related resources