The Legacy Service That Finally Stopped Paging Us: Progressive Observability + SLOs That Stick

You don’t harden a legacy service with a Big Bang overhaul. You level it up in weeks, not quarters—instrument, baseline, define SLOs, then enforce error budgets. Here’s the playbook that’s actually worked for us in production.

If you can’t see it, you can’t SLO it. If you don’t SLO it, you’ll ship blind and pay interest every deploy.
Back to all posts

The fire drill you’ve lived through

You’ve got a Java 8 Spring Boot service that’s been quietly rotting since 2017. No one touches it unless it’s paging. Last week a deploy “worked on my machine,” then spent your Friday night at 3x latency because a dependency changed TLS ciphers. I’ve seen this film at marketplaces, fintechs, and even data platforms running on pets-not-cattle VMs. The fix was never a heroic rewrite. It was progressive observability plus SLOs—just enough instrumentation to see, then guardrails to keep it honest.

This is the hardened path we run at GitPlumbers when teams need results in weeks, not quarters.

Step 1: Baseline without boiling the ocean (Week 0–2)

Goal: get a 2-week read on reality with minimal changes. No platform migrations yet.

  1. Add service-level metrics and health endpoints.
    • If it’s Spring Boot: expose actuator/prometheus and a /livez and /readyz.
    • If Node or Python, use prom-client or prometheus_client.
  2. Ship logs centrally with correlation IDs.
    • Add an X-Request-ID header; propagate through downstream calls.
  3. Start distributed traces at 1–5% sampling.
    • Use the OpenTelemetry Java Agent to avoid code churn.

Example: attach OpenTelemetry to a legacy Spring Boot service with zero code changes:

java \
  -javaagent:/opt/otel/opentelemetry-javaagent.jar \
  -Dotel.resource.attributes=service.name=payments-legacy,service.version=1.52.3 \
  -Dotel.exporter.otlp.endpoint=https://otel-collector.internal:4317 \
  -Dotel.traces.sampler=traceidratio \
  -Dotel.traces.sampler.arg=0.05 \
  -jar app.jar

Prometheus scrape for metrics:

# prometheus.yml
scrape_configs:
  - job_name: payments-legacy
    scrape_interval: 15s
    metrics_path: /actuator/prometheus
    static_configs:
      - targets: ['payments-legacy.svc.cluster.local:8080']

Centralized logs via Loki/Promtail (or Fluent Bit + Elasticsearch):

# promtail-config.yaml
scrape_configs:
  - job_name: payments-legacy
    static_configs:
      - targets: [localhost]
        labels:
          job: payments-legacy
          __path__: /var/log/payments/*.log

Checkpoints by end of Week 2:

  • RED metrics: request_rate, error_rate, p95_latency visible in Grafana.
  • USE metrics for key resources (CPU, memory, thread pools, DB connections).
  • 1–5% traces in Jaeger/Tempo with X-Request-ID correlation.
  • No new alerts yet—this is still a listening phase.

What tends to bite here: teams turn on tracing at 100% and melt storage. Start small; you can ratchet sampling later.

Step 2: Define SLIs that reflect user pain, then bind SLOs (Week 2–3)

You don’t set SLOs to what’s aspirational—you set them to what the business can tolerate. We usually start with two SLIs:

  • Availability: ratio of good requests over total.
  • Latency: p95 or p99 under a threshold for key endpoints.

Example SLIs (Prometheus):

# Good events: HTTP 2xx/3xx
good = sum(rate(http_server_requests_seconds_count{status=~"2..|3..",service="payments-legacy"}[5m]))
# Total events:
total = sum(rate(http_server_requests_seconds_count{service="payments-legacy"}[5m]))
availability_sli = good / total

# Latency SLI: proportion of requests under 300ms (p95 over 5m windows)
p95_latency = histogram_quantile(0.95, sum by (le) (rate(http_server_requests_seconds_bucket{service="payments-legacy"}[5m])))

We formalize this with Sloth so SLOs live in Git and generate rules/alerts.

# slo-payments-legacy.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: payments-legacy
spec:
  service: payments-legacy
  labels:
    team: core-payments
  slos:
    - name: availability
      objective: 99.9
      description: 3-nines availability on charge API
      sli:
        events:
          errorQuery: sum(rate(http_server_requests_seconds_count{service="payments-legacy",status=~"5.."}[5m]))
          totalQuery: sum(rate(http_server_requests_seconds_count{service="payments-legacy"}[5m]))
      alerting:
        name: payments-legacy-availability
        labels:
          severity: page
        annotations:
          runbook: https://runbooks.internal/payments-legacy#availability
        pageAlert:
          disable: false
        ticketAlert:
          disable: false
    - name: latency
      objective: 99
      description: 99% of requests under 300ms
      sli:
        threshold:
          metric: histogram
          # Sloth will derive good/bad from bucket math
          # Provide your histogram’s metric prefix
          thresholdMetric: http_server_requests_seconds
          buckets:
            le: 0.3
      alerting:
        labels:
          severity: page

After you sloth generate and apply, you get Prometheus recording rules and burn-rate alerts. Error budget for 99.9% ≈ 43m/month. Spend it wisely.

Checkpoints by end of Week 3:

  • SLOs merged via Git, generated rules applied, Grafana panels show error budget.
  • Leadership signs off that these map to real user pain.
  • Alert severities and runbooks agreed (no on-call roulette).

Step 3: Make alerts actionable with multi-window burn rates (Week 3–4)

I’ve seen more teams quit SLOs because of alert fatigue than anything else. Use burn-rate alerts with fast and slow windows so you only page on real budget burn.

Prometheus rules (generated or hand-rolled):

# alerting-rules.yaml
groups:
- name: payments-legacy-slo
  rules:
  - alert: SLOAvailabilityBudgetBurnFast
    expr: (1 - availability_sli) > (14.4 * (1 - 0.999))
    for: 2m
    labels: {severity: page}
    annotations:
      summary: Fast burn on availability SLO
      runbook: https://runbooks.internal/payments-legacy#availability
  - alert: SLOAvailabilityBudgetBurnSlow
    expr: (1 - availability_sli[1h]) > (2 * (1 - 0.999))
    for: 1h
    labels: {severity: ticket}
    annotations:
      summary: Slow burn on availability SLO
      runbook: https://runbooks.internal/payments-legacy#availability

Alertmanager routes:

# alertmanager.yml
route:
  receiver: default
  routes:
    - matchers:
        - severity="page"
      receiver: pagerduty
    - matchers:
        - severity="ticket"
      receiver: jira
receivers:
  - name: pagerduty
    pagerduty_configs:
      - routing_key: ${PD_KEY}
  - name: jira
    webhook_configs:
      - url: https://jira.internal/hooks/alert

Checkpoints by end of Week 4:

  • Paging only on fast burns and severe latency regressions.
  • Tickets created for slow burns. On-call load stable (<1 page/shift for this service).
  • Every alert has a runbook link and a clear primary owner.

What fails here: pushing every warning to PagerDuty. Guardrails, not sirens.

Step 4: Harden the runtime: timeouts, retries, circuit breakers (Week 4–6)

Now that you can see and measure, fix the brittle bits that caused the pages.

  1. Enforce timeouts and bounded retries.
    • Java WebClient/RestTemplate with readTimeout=300ms, maxRetries=2, jittered backoff.
  2. Add circuit breakers for flaky deps.
    • resilience4j or service mesh features (Istio/Envoy/Linkerd) if you already have them.
  3. Queues over sync calls for critical paths.
    • If you can’t, at least decouple with a bulkhead thread pool.

Resilience4j example:

// Circuit breaker + retry (Resilience4j)
CircuitBreakerConfig cbConfig = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(100)
    .build();
RetryConfig rConfig = RetryConfig.custom()
    .maxAttempts(3)
    .waitDuration(Duration.ofMillis(100))
    .build();

CircuitBreaker cb = CircuitBreaker.of("card-gateway", cbConfig);
Retry retry = Retry.of("card-gateway", rConfig);
Supplier<Response> supplier = () -> client.callCardGateway(req);
Response resp = Decorators.ofSupplier(supplier)
    .withCircuitBreaker(cb)
    .withRetry(retry)
    .get();

Istio destination rule circuit breaker:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: card-gateway
spec:
  host: card-gateway.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Checkpoints by end of Week 6:

  • p95 latency is stable under SLO during dependency hiccups (breaker opens instead of cascading).
  • Error budget burn during third-party incidents reduced by >60%.
  • Runbooks updated with breaker override procedures for incident comms.

Step 5: Prove it under load and failure (Week 6–8)

Nothing hardens a legacy service like rehearsals.

  • Load test the happy path and worst offender endpoints.
    • Use k6 or vegeta with production-like headers and data sizes.
  • Introduce controlled chaos to validate SLOs and breaker behavior.
    • chaos-mesh or pumba to add latency/packet loss to downstreams.

k6 example tied to your SLO threshold (300ms):

// k6 script: payments.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
  vus: 50,
  duration: '10m',
  thresholds: {
    http_req_duration: ['p(95)<300'],
    http_req_failed: ['rate<0.001'],
  },
};
export default function () {
  const res = http.post('https://api.internal/charge', JSON.stringify({amount: 1234}), {
    headers: { 'Content-Type': 'application/json', 'X-Request-ID': `${__ITER}` },
  });
  check(res, { 'status is 200/201': (r) => r.status === 200 || r.status === 201 });
  sleep(0.2);
}

Chaos Mesh network latency to card-gateway:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: inject-latency-card-gateway
spec:
  action: delay
  mode: one
  selector:
    namespaces: ["payments"]
    labelSelectors:
      app: card-gateway
  delay:
    latency: "200ms"
    jitter: "50ms"
  duration: "10m"

Checkpoints by end of Week 8:

  • SLO dashboards and burn-rate alerts respond as expected during tests.
  • No paging for planned chaos; ticket created for slow burns.
  • MTTR trending down (target <20m) because runbooks + traces point directly to the choke point.

Step 6: GitOps the observability so it doesn’t rot (Week 8+)

I’ve watched beautiful dashboards die with their creator. Bake it into the repo.

  • SLO specs (Sloth) and alert rules as code.
  • Grafana dashboards versioned via jsonnet/grizzly or Terraform provider.
  • Collector configs (OTel Collector, Promtail) templated with Helm/Kustomize.
  • ArgoCD/Flux enforces drift-free deployments.

OTel Collector pipeline example:

receivers:
  otlp:
    protocols:
      grpc:
exporters:
  otlp:
    endpoint: tempo:4317
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
processors:
  batch: {}
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Checkpoints ongoing:

  • PRs change SLOs and dashboards—no click-ops.
  • promtool check rules and CI validate everything before rollout.
  • On-call feedback loop: alert descriptions and runbooks improve every incident retro.

Results you can expect if you actually do this

From a recent GitPlumbers rescue of a payments service at a mid-market retailer:

  • Time to first meaningful dashboard: 10 days.
  • SLO adoption: 3 weeks to first burn alerts with runbooks.
  • Paging volume: down 72% after breaker + timeout rollout.
  • MTTR: from 65m to 18m in 6 weeks.
  • Velocity: 25% more deploys/month because engineers trust guardrails.

Would we rewrite? Eventually. But hardening via progressive observability bought them a year to plan a sane migration instead of vibe-coding a rewrite that would’ve doubled their incident rate.

If you can’t see it, you can’t SLO it. If you don’t SLO it, you’ll ship blind and pay interest every deploy.

Where teams stumble:

  • Over-instrumenting day one, then drowning in cardinality.
  • Aspirational SLOs with zero product buy-in.
  • Alerts without owners or runbooks.
  • Tracing on in dev only—then surprised when prod is a black box.

What I’d do differently next time

  • Start SLOs on one critical endpoint, not the whole service. Prove value fast.
  • Make product attend the first two error-budget reviews. Tie budget burn to roadmap tradeoffs.
  • Add a “boring checklist” to every PR touching this service: timeouts, idempotency, metrics, trace attributes.
  • If AI-generated code snuck into the service during crunch time, schedule a vibe code cleanup pass: remove inline sleeps, unbounded retries, and mystery global state. Tie it to your SLOs so it’s not a philosophical debate.

If you want an outside pair who has shipped this playbook under fire, that’s what we do at GitPlumbers. We don’t sell dashboards—we sell fewer 2 a.m. pages.

Related Resources

Key takeaways

  • Start with a 2-week baseline and a thin slice of instrumentation—don’t boil the ocean.
  • Define SLIs that reflect user pain (availability and latency) and wire them into SLOs with error budgets.
  • Use multi-window burn-rate alerts to keep noise down and actionability up.
  • Automate dashboards/alerts via GitOps so this doesn’t rot after the hero leaves.
  • Harden under load with timeouts, retries, circuit breakers, and real chaos tests.
  • Track business impact with fewer pages, better MTTR, and stable velocity—not just pretty dashboards.

Implementation checklist

  • Stakeholder-aligned SLIs and SLOs documented in repo
  • Prometheus scraping core service + RED/USE coverage
  • Distributed tracing sampling configured (1-10%)
  • Centralized logs with correlation IDs
  • Sloth SLO specs + generated Prometheus recording/alerting rules
  • Grafana dashboards with p50/p95/p99 latency, RPS, saturation, error budget
  • Runbooks linked from alerts
  • Canary + circuit breaker policies in place
  • Monthly error budget review with product

Questions we hear from teams

What if our legacy service isn’t on Kubernetes?
No problem. The same approach works on VMs. Use node exporters, run Prometheus as a service, ship logs via Fluent Bit to Loki/Elastic, and expose an HTTP metrics endpoint. You can still attach the OpenTelemetry Java agent and point to a collector over TCP. GitOps can be achieved with Ansible or Terraform plus a CI runner.
How do we pick the right latency threshold for SLOs?
Start with user experience and current performance. Look at p95 from your 2-week baseline and talk to product about what’s acceptable. If p95 is 420ms today and users notice slowness beyond ~500ms, set an initial SLO of 99% under 500ms, then tighten once breakers/timeouts are in.
Won’t tracing be too expensive?
Not if you sample intelligently. Start with 1–5%, use tail-based sampling on anomalies if your collector supports it, and drop high-cardinality attributes. Retain spans for shorter periods (24–72h) and metrics longer (30–90d).
What’s the fastest way to get SLOs into Git?
Use Sloth. Define availability/latency SLOs in YAML, `sloth generate` to produce Prometheus rules and alerts, and let ArgoCD/Flux roll them out. Add dashboards as code so panels don’t drift.
How do we align SLOs with business impact?
Run a monthly error budget review with product. If you’re burning budget, pause feature work to harden. If you’re consistently under budget, you can accelerate. Tie these decisions to real incidents and user-facing metrics (conversion, drop-offs).

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about an Observability Jumpstart Read the Retail SLO Overhaul Case Study

Related resources