Stop Burning Budgets Blind: Designing Error Budget Allocation by Service Tier (and Wiring It to Rollouts)

Vanity SLOs won’t save you. Here’s how to tier your services, watch the right leading indicators, and plug error budgets into triage and deployment automation.

If your SLOs don’t change deployment behavior, they’re just compliance theater.
Back to all posts

The tiering mistake I keep seeing

You’ve probably lived this: a “Tier-2” internal API gets the same 99.99% target as your checkout service, and you wonder why the burn alerts never stop. At a previous gig, our payments service (real Tier-0) shared an error budget with a cron that reconciled logs (Tier-3). Guess which one triggered freezes? The cron. We were optimizing dashboards while revenue bled.

Here’s what actually works: define service tiers by business blast radius, choose error budgets that reflect your customer promise, and monitor leading indicators that predict impact. Then wire those budgets directly into alerting and rollout automation so the system slows itself down before you take an outage.

If your SLOs don’t change deployment behavior, they’re just compliance theater.

Map service tiers to error budgets that reflect risk

Don’t start with 99.99 because it “sounds right.” Start with the promise and the money.

  • Tier-0: direct revenue events or irreversible side effects (checkout, auth, ledger). Target 99.95–99.99 monthly. Budget: 21.6–4.3 minutes.
  • Tier-1: core user flows with retries/fallbacks (feed API, search). Target 99.9. Budget: 43.2 minutes.
  • Tier-2: internal APIs/back-office, async consumer UX. Target 99.5. Budget: 3.6 hours.
  • Tier-3: batch/analytics. Target 97–99. Budget: hours to days.

Translate that into change windows and burn policies:

  • If a Tier-0 service burns >20% of budget in 24h, auto-freeze risky rollouts and enable circuit breakers.
  • Tier-1 can burn 40% in 24h before a freeze; Tier-2 gets 60%.
  • Reserve budget for maintenance: e.g., Tier-0 reserves 30% monthly for infra upgrades.

I like to be explicit. For a Tier-1 API with a 99.9% SLO, you’ve got 43.2 minutes per month. Allocate it:

  • 15 minutes: planned changes (deploys, schema migrations)
  • 10 minutes: dependencies (DB/Kafka/identity)
  • 10 minutes: unknowns (incidents)
  • 8.2 minutes: risk buffer

When engineering asks “can we ship feature X this week?”, the answer is “check the budget.”

Pick leading indicators that predict incidents (not vanity graphs)

Stop paging on CPU averages and 200-counts. Page on precursors that correlate with user pain.

Leading indicators per tier usually look like:

  • Latency tail shifts: p99 or p99.9 over SLO threshold for consecutive windows
  • Saturation: thread-pool/executor saturation, connection pool exhaustion, HPA churn, throttling
  • Queue health: Kafka/SQS/Azure SB lag, dead-letter rates
  • Dependency SLI drift: upstream error rates/latency, DNS/TLS handshake failures
  • Runtime health: GC pause times, stop-the-world frequency, heap pressure
  • Concurrency/backpressure: dropped connections, 429/503 ratios, circuit breaker open rate

Concrete PromQL examples that actually predict trouble:

# API error ratio (5xx) as SLI
sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))
# Latency tail (p99) over SLO threshold
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m]))) > 0.3
# Kafka consumer lag leading indicator
max(kafka_consumergroup_lag{consumergroup="orders-cg"}) > 10000
# JVM GC pause time rolling window
rate(jvm_gc_pause_seconds_sum{service="payments"}[5m]) > 0.5

Tie these to exemplars via OpenTelemetry so you can drill from a p99 spike to the exact trace/tenant swiftly. If you can’t jump from metric to trace to log within 30 seconds, your triage cost is eating the budget.

Turn telemetry into triage: burn alerts with actions per tier

Alert on burn, not just current error rate. Multi-window, multi-burn catches both fast meltdowns and slow bleeds.

A Sloth-style alert pair for a 99.9% availability SLO:

# prometheus_rules.yaml
groups:
- name: slo-burn-api
  rules:
  - alert: APIHighBurnShort
    expr: |
      (
        sum(rate(sli_availability_bad_total{job="api"}[5m]))
        /
        sum(rate(sli_availability_total{job="api"}[5m]))
      ) > (0.1 * (1 - 0.999))
    for: 5m
    labels:
      severity: page
      tier: T1
    annotations:
      summary: "API SLO high burn (short)"
      runbook: "https://runbooks.company.local/api-slo"
  - alert: APIHighBurnLong
    expr: |
      (
        sum(rate(sli_availability_bad_total{job="api"}[30m]))
        /
        sum(rate(sli_availability_total{job="api"}[30m]))
      ) > (0.02 * (1 - 0.999))
    for: 30m
    labels:
      severity: page
      tier: T1
    annotations:
      summary: "API SLO high burn (long)"
      runbook: "https://runbooks.company.local/api-slo"

Then route by tier with explicit actions:

# alertmanager.yaml
route:
  receiver: noc
  routes:
  - matchers:
    - tier = T0
    receiver: oncall-sev1
    continue: true
  - matchers:
    - tier = T1
    receiver: oncall-sev2

receivers:
- name: oncall-sev1
  slack_configs:
  - channel: "#prod-sev1"
    send_resolved: true
  opsgenie_configs:
  - responders:
    - type: team
      name: "SRE"
    priority: P1

- name: oncall-sev2
  slack_configs:
  - channel: "#prod-sev2"

What matters: every alert includes the budget status and the default action. For Tier-0: “If burn > 20% today, pause rollouts (label freeze=true), enable circuit breakers, and shift traffic 10% to the older replica set.” Put the buttons in the alert or automate them.

Wire error budgets to rollout automation (canary + kill-switches)

If budgets don’t control rollout speed, you’re doing theater. Argo Rollouts, Flagger, or Spinnaker can evaluate SLIs mid-deploy.

An AnalysisTemplate that fails a canary when error rate or p99 drifts:

# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: api-slo-check
spec:
  metrics:
  - name: error-rate
    interval: 1m
    count: 5
    successCondition: result < 0.01
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{job="api",status=~"5.."}[1m]))
          /
          sum(rate(http_requests_total{job="api"}[1m]))
  - name: p99-latency
    interval: 1m
    count: 5
    successCondition: result < 0.3
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[1m])))

Attach it to the rollout steps:

# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 60}
      - analysis:
          templates:
          - templateName: api-slo-check
      - setWeight: 25
      - pause: {duration: 120}
      - analysis:
          templates:
          - templateName: api-slo-check
      - setWeight: 50
      - pause: {duration: 180}
      - analysis:
          templates:
          - templateName: api-slo-check

And give yourself fast containment with mesh-level circuit breakers. For Istio:

# destinationrule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payments-dr
spec:
  host: payments.default.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1000
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Tie ejection thresholds to tier: Tier-0 eject fast and shallow; Tier-2 eject slower and tolerate more noise.

Allocate budgets across dependencies and shared platforms

Most incidents aren’t your code; they’re the DB, Kafka, or auth. Stop pretending those are separate worlds.

  • Composite SLOs: include weighted dependency SLIs in your service SLO. If auth has 99.9% and your API has 99.95%, the composite isn’t 99.95%. Model it and surface it on dashboards.
  • Budget split: explicitly allocate a percentage of your monthly budget to dependencies (e.g., 25% to DB, 15% to Kafka). When dependency burn exceeds allocation, escalate to the owning team with the business impact.
  • Shared platform SLOs: platform teams own SLOs that map to tenants’ tiers. Example: “Postgres primary write latency p99 < 50ms for Tier-0 tenants.”
  • Async backpressure: for Tier-2/3, allow lag to absorb spikes. Measure max(consumer lag) and set autoscaling to keep lag burn within budget.

If you want to force the conversation, present the math in review: “We lost 12 of 43 minutes due to DB lock contention; platform owes us a mitigation plan.” This is where GitPlumbers gets brought in to untangle cross-team SLOs without turning it into a blame-fest.

Governance with teeth: reviews, freezes, and debt paydown

Monthly SLO reviews shouldn’t be a retro bingo card.

  • Review inputs: burn by cause (change, capacity, dependency), top leading indicators, time-to-detection, MTTR, change failure rate (DORA).
  • Consequences by tier:
    • Tier-0: >50% budget burned triggers a partial freeze on non-critical rollouts until next review.
    • Tier-1: >75% burned triggers extra checks (require canary + analysis) and senior approver.
    • Tier-2/3: convert burn into prioritized backlog items with dates.
  • Budget resets and accruals: no carryover across months for Tier-0/1; Tier-2/3 can bank 20% to fund risky migrations.
  • Dashboards: a single page per service with SLO, remaining budget (minutes and %), leading indicators, and deployment status. If a VP can’t see “can we ship today?” in one screen, fix it.

Track that this governance actually improves outcomes: MTTR down, change failure rate down, and fewer rollbacks despite a steady cadence.

What I’d do on Monday

You don’t need a six-month program. Stand up the guardrails, then iterate.

  1. Classify services into Tier-0/1/2/3 by revenue/user impact.
  2. Set concrete monthly SLO targets and compute budgets in minutes.
  3. Pick two leading indicators per service (latency tail + one saturation signal) and add exemplars.
  4. Deploy multi-window burn alerts, routed by tier, with explicit default actions.
  5. Gate canaries with error-rate and p99-latency checks via Argo Rollouts or Flagger.
  6. Add circuit breaker configs for Tier-0 dependencies.
  7. Establish a monthly review with freeze rules and debt paydown agreements.
  8. After two weeks, prune noisy alerts, tune thresholds, and socialize the one-page dashboard.

I’ve seen this fail when teams try to boil the ocean. Get Tier-0 solid, then move outward. And if you’re stuck in cross-team SLO politics, call in a neutral party to broker the math and the automation. That’s literally our Tuesday at GitPlumbers.

Related Resources

Key takeaways

  • Tiering is about business risk, not org charts. Tie SLOs and budgets to revenue impact and user promise per tier.
  • Use leading indicators (latency tail, saturation, queue lag, GC pauses, dependency SLI) to predict incidents before users feel them.
  • Implement multi-window, multi-burn alerts and route by tier with clear, automated runbooks.
  • Gate rollouts with real SLI checks (error rate and p99 latency) and wire kill-switches/circuit breakers for fast containment.
  • Allocate budgets across dependencies and shared platforms; composite SLOs force real conversations and capacity investment.
  • Hold monthly/quarterly budget reviews with consequences: freeze risky changes, prioritize debt, and measure change failure rate/MTTR.

Implementation checklist

  • Define Tier-0/1/2/3 by revenue/user impact and pick SLO targets that match the promise.
  • Translate SLO targets into monthly/weekly error budgets (minutes or percentage) per tier.
  • Select leading indicators for each service: latency tail, saturation, queue lag, GC pauses, dependency health, HPA churn.
  • Implement multi-window burn alerts in Prometheus and route by tier in Alertmanager with explicit actions.
  • Add Argo Rollouts/Flagger checks that query Prometheus for SLI health before and during rollout.
  • Configure Istio/Linkerd outlier detection and circuit breaking tied to budget burn.
  • Create composite SLOs that include critical dependencies (DB/Kafka/identity/feature flags).
  • Run monthly budget reviews with freeze policies and debt paydown when budgets are spent.

Questions we hear from teams

How do I pick SLO targets without a ton of historical data?
Start with your customer promise and incident history. Pick conservative targets per tier (e.g., Tier-1 at 99.9%), then monitor the actual error/latency distributions for 2–4 weeks. Calibrate based on real burn; it’s normal to revise early.
What’s the fastest way to implement burn alerts if we don’t use Sloth?
Create ‘good’ and ‘bad’ SLI counters in Prometheus and use the multi-window thresholds shown here. You can wire them manually into Alertmanager or use Sloth to generate the rules from a simple YAML SLO spec.
How do error budgets apply to batch jobs?
Define SLOs in terms of timeliness (e.g., 99% of runs finish within 60 minutes of schedule) and freshness (data no more than 24 hours stale). Burn when deadlines slip, and gate rollouts on those SLIs the same way you would for APIs.
What about third-party SaaS dependencies?
Model them as dependencies with their own SLIs and an allocated portion of your budget. If a vendor burns your budget, you have leverage: mitigation plans, credits, or architecture changes (caching, bulkheads).
How do I prevent alert fatigue with more indicators?
Use indicators as inputs to SLO burn, not separate paging alerts. Page on burn, annotate with leading indicators, and send non-paging heads-up alerts to a low-noise channel for trend watching.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an SLO plumber Download our SLO runbook template

Related resources