The Legacy Service That Finally Stopped Paging Us: Progressive Observability + SLOs That Stick
You don’t harden a legacy service with a Big Bang overhaul. You level it up in weeks, not quarters—instrument, baseline, define SLOs, then enforce error budgets. Here’s the playbook that’s actually worked for us in production.
If you can’t see it, you can’t SLO it. If you don’t SLO it, you’ll ship blind and pay interest every deploy.Back to all posts
The fire drill you’ve lived through
You’ve got a Java 8 Spring Boot service that’s been quietly rotting since 2017. No one touches it unless it’s paging. Last week a deploy “worked on my machine,” then spent your Friday night at 3x latency because a dependency changed TLS ciphers. I’ve seen this film at marketplaces, fintechs, and even data platforms running on pets-not-cattle VMs. The fix was never a heroic rewrite. It was progressive observability plus SLOs—just enough instrumentation to see, then guardrails to keep it honest.
This is the hardened path we run at GitPlumbers when teams need results in weeks, not quarters.
Step 1: Baseline without boiling the ocean (Week 0–2)
Goal: get a 2-week read on reality with minimal changes. No platform migrations yet.
- Add service-level metrics and health endpoints.
- If it’s Spring Boot: expose
actuator/prometheusand a/livezand/readyz. - If Node or Python, use
prom-clientorprometheus_client.
- If it’s Spring Boot: expose
- Ship logs centrally with correlation IDs.
- Add an
X-Request-IDheader; propagate through downstream calls.
- Add an
- Start distributed traces at 1–5% sampling.
- Use the OpenTelemetry Java Agent to avoid code churn.
Example: attach OpenTelemetry to a legacy Spring Boot service with zero code changes:
java \
-javaagent:/opt/otel/opentelemetry-javaagent.jar \
-Dotel.resource.attributes=service.name=payments-legacy,service.version=1.52.3 \
-Dotel.exporter.otlp.endpoint=https://otel-collector.internal:4317 \
-Dotel.traces.sampler=traceidratio \
-Dotel.traces.sampler.arg=0.05 \
-jar app.jarPrometheus scrape for metrics:
# prometheus.yml
scrape_configs:
- job_name: payments-legacy
scrape_interval: 15s
metrics_path: /actuator/prometheus
static_configs:
- targets: ['payments-legacy.svc.cluster.local:8080']Centralized logs via Loki/Promtail (or Fluent Bit + Elasticsearch):
# promtail-config.yaml
scrape_configs:
- job_name: payments-legacy
static_configs:
- targets: [localhost]
labels:
job: payments-legacy
__path__: /var/log/payments/*.logCheckpoints by end of Week 2:
- RED metrics:
request_rate,error_rate,p95_latencyvisible in Grafana. - USE metrics for key resources (CPU, memory, thread pools, DB connections).
- 1–5% traces in Jaeger/Tempo with
X-Request-IDcorrelation. - No new alerts yet—this is still a listening phase.
What tends to bite here: teams turn on tracing at 100% and melt storage. Start small; you can ratchet sampling later.
Step 2: Define SLIs that reflect user pain, then bind SLOs (Week 2–3)
You don’t set SLOs to what’s aspirational—you set them to what the business can tolerate. We usually start with two SLIs:
- Availability: ratio of good requests over total.
- Latency: p95 or p99 under a threshold for key endpoints.
Example SLIs (Prometheus):
# Good events: HTTP 2xx/3xx
good = sum(rate(http_server_requests_seconds_count{status=~"2..|3..",service="payments-legacy"}[5m]))
# Total events:
total = sum(rate(http_server_requests_seconds_count{service="payments-legacy"}[5m]))
availability_sli = good / total
# Latency SLI: proportion of requests under 300ms (p95 over 5m windows)
p95_latency = histogram_quantile(0.95, sum by (le) (rate(http_server_requests_seconds_bucket{service="payments-legacy"}[5m])))We formalize this with Sloth so SLOs live in Git and generate rules/alerts.
# slo-payments-legacy.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: payments-legacy
spec:
service: payments-legacy
labels:
team: core-payments
slos:
- name: availability
objective: 99.9
description: 3-nines availability on charge API
sli:
events:
errorQuery: sum(rate(http_server_requests_seconds_count{service="payments-legacy",status=~"5.."}[5m]))
totalQuery: sum(rate(http_server_requests_seconds_count{service="payments-legacy"}[5m]))
alerting:
name: payments-legacy-availability
labels:
severity: page
annotations:
runbook: https://runbooks.internal/payments-legacy#availability
pageAlert:
disable: false
ticketAlert:
disable: false
- name: latency
objective: 99
description: 99% of requests under 300ms
sli:
threshold:
metric: histogram
# Sloth will derive good/bad from bucket math
# Provide your histogram’s metric prefix
thresholdMetric: http_server_requests_seconds
buckets:
le: 0.3
alerting:
labels:
severity: pageAfter you sloth generate and apply, you get Prometheus recording rules and burn-rate alerts. Error budget for 99.9% ≈ 43m/month. Spend it wisely.
Checkpoints by end of Week 3:
- SLOs merged via Git, generated rules applied, Grafana panels show error budget.
- Leadership signs off that these map to real user pain.
- Alert severities and runbooks agreed (no on-call roulette).
Step 3: Make alerts actionable with multi-window burn rates (Week 3–4)
I’ve seen more teams quit SLOs because of alert fatigue than anything else. Use burn-rate alerts with fast and slow windows so you only page on real budget burn.
Prometheus rules (generated or hand-rolled):
# alerting-rules.yaml
groups:
- name: payments-legacy-slo
rules:
- alert: SLOAvailabilityBudgetBurnFast
expr: (1 - availability_sli) > (14.4 * (1 - 0.999))
for: 2m
labels: {severity: page}
annotations:
summary: Fast burn on availability SLO
runbook: https://runbooks.internal/payments-legacy#availability
- alert: SLOAvailabilityBudgetBurnSlow
expr: (1 - availability_sli[1h]) > (2 * (1 - 0.999))
for: 1h
labels: {severity: ticket}
annotations:
summary: Slow burn on availability SLO
runbook: https://runbooks.internal/payments-legacy#availabilityAlertmanager routes:
# alertmanager.yml
route:
receiver: default
routes:
- matchers:
- severity="page"
receiver: pagerduty
- matchers:
- severity="ticket"
receiver: jira
receivers:
- name: pagerduty
pagerduty_configs:
- routing_key: ${PD_KEY}
- name: jira
webhook_configs:
- url: https://jira.internal/hooks/alertCheckpoints by end of Week 4:
- Paging only on fast burns and severe latency regressions.
- Tickets created for slow burns. On-call load stable (<1 page/shift for this service).
- Every alert has a runbook link and a clear primary owner.
What fails here: pushing every warning to PagerDuty. Guardrails, not sirens.
Step 4: Harden the runtime: timeouts, retries, circuit breakers (Week 4–6)
Now that you can see and measure, fix the brittle bits that caused the pages.
- Enforce timeouts and bounded retries.
- Java
WebClient/RestTemplatewithreadTimeout=300ms,maxRetries=2, jittered backoff.
- Java
- Add circuit breakers for flaky deps.
resilience4jor service mesh features (Istio/Envoy/Linkerd) if you already have them.
- Queues over sync calls for critical paths.
- If you can’t, at least decouple with a bulkhead thread pool.
Resilience4j example:
// Circuit breaker + retry (Resilience4j)
CircuitBreakerConfig cbConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(100)
.build();
RetryConfig rConfig = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(100))
.build();
CircuitBreaker cb = CircuitBreaker.of("card-gateway", cbConfig);
Retry retry = Retry.of("card-gateway", rConfig);
Supplier<Response> supplier = () -> client.callCardGateway(req);
Response resp = Decorators.ofSupplier(supplier)
.withCircuitBreaker(cb)
.withRetry(retry)
.get();Istio destination rule circuit breaker:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: card-gateway
spec:
host: card-gateway.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50Checkpoints by end of Week 6:
- p95 latency is stable under SLO during dependency hiccups (breaker opens instead of cascading).
- Error budget burn during third-party incidents reduced by >60%.
- Runbooks updated with breaker override procedures for incident comms.
Step 5: Prove it under load and failure (Week 6–8)
Nothing hardens a legacy service like rehearsals.
- Load test the happy path and worst offender endpoints.
- Use
k6orvegetawith production-like headers and data sizes.
- Use
- Introduce controlled chaos to validate SLOs and breaker behavior.
chaos-meshorpumbato add latency/packet loss to downstreams.
k6 example tied to your SLO threshold (300ms):
// k6 script: payments.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
vus: 50,
duration: '10m',
thresholds: {
http_req_duration: ['p(95)<300'],
http_req_failed: ['rate<0.001'],
},
};
export default function () {
const res = http.post('https://api.internal/charge', JSON.stringify({amount: 1234}), {
headers: { 'Content-Type': 'application/json', 'X-Request-ID': `${__ITER}` },
});
check(res, { 'status is 200/201': (r) => r.status === 200 || r.status === 201 });
sleep(0.2);
}Chaos Mesh network latency to card-gateway:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: inject-latency-card-gateway
spec:
action: delay
mode: one
selector:
namespaces: ["payments"]
labelSelectors:
app: card-gateway
delay:
latency: "200ms"
jitter: "50ms"
duration: "10m"Checkpoints by end of Week 8:
- SLO dashboards and burn-rate alerts respond as expected during tests.
- No paging for planned chaos; ticket created for slow burns.
- MTTR trending down (target <20m) because runbooks + traces point directly to the choke point.
Step 6: GitOps the observability so it doesn’t rot (Week 8+)
I’ve watched beautiful dashboards die with their creator. Bake it into the repo.
- SLO specs (
Sloth) and alert rules as code. - Grafana dashboards versioned via
jsonnet/grizzlyor Terraform provider. - Collector configs (OTel Collector, Promtail) templated with Helm/Kustomize.
- ArgoCD/Flux enforces drift-free deployments.
OTel Collector pipeline example:
receivers:
otlp:
protocols:
grpc:
exporters:
otlp:
endpoint: tempo:4317
loki:
endpoint: http://loki:3100/loki/api/v1/push
processors:
batch: {}
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]Checkpoints ongoing:
- PRs change SLOs and dashboards—no click-ops.
promtool check rulesand CI validate everything before rollout.- On-call feedback loop: alert descriptions and runbooks improve every incident retro.
Results you can expect if you actually do this
From a recent GitPlumbers rescue of a payments service at a mid-market retailer:
- Time to first meaningful dashboard: 10 days.
- SLO adoption: 3 weeks to first burn alerts with runbooks.
- Paging volume: down 72% after breaker + timeout rollout.
- MTTR: from 65m to 18m in 6 weeks.
- Velocity: 25% more deploys/month because engineers trust guardrails.
Would we rewrite? Eventually. But hardening via progressive observability bought them a year to plan a sane migration instead of vibe-coding a rewrite that would’ve doubled their incident rate.
If you can’t see it, you can’t SLO it. If you don’t SLO it, you’ll ship blind and pay interest every deploy.
Where teams stumble:
- Over-instrumenting day one, then drowning in cardinality.
- Aspirational SLOs with zero product buy-in.
- Alerts without owners or runbooks.
- Tracing on in dev only—then surprised when prod is a black box.
What I’d do differently next time
- Start SLOs on one critical endpoint, not the whole service. Prove value fast.
- Make product attend the first two error-budget reviews. Tie budget burn to roadmap tradeoffs.
- Add a “boring checklist” to every PR touching this service: timeouts, idempotency, metrics, trace attributes.
- If AI-generated code snuck into the service during crunch time, schedule a vibe code cleanup pass: remove inline sleeps, unbounded retries, and mystery global state. Tie it to your SLOs so it’s not a philosophical debate.
If you want an outside pair who has shipped this playbook under fire, that’s what we do at GitPlumbers. We don’t sell dashboards—we sell fewer 2 a.m. pages.
Related Resources
Key takeaways
- Start with a 2-week baseline and a thin slice of instrumentation—don’t boil the ocean.
- Define SLIs that reflect user pain (availability and latency) and wire them into SLOs with error budgets.
- Use multi-window burn-rate alerts to keep noise down and actionability up.
- Automate dashboards/alerts via GitOps so this doesn’t rot after the hero leaves.
- Harden under load with timeouts, retries, circuit breakers, and real chaos tests.
- Track business impact with fewer pages, better MTTR, and stable velocity—not just pretty dashboards.
Implementation checklist
- Stakeholder-aligned SLIs and SLOs documented in repo
- Prometheus scraping core service + RED/USE coverage
- Distributed tracing sampling configured (1-10%)
- Centralized logs with correlation IDs
- Sloth SLO specs + generated Prometheus recording/alerting rules
- Grafana dashboards with p50/p95/p99 latency, RPS, saturation, error budget
- Runbooks linked from alerts
- Canary + circuit breaker policies in place
- Monthly error budget review with product
Questions we hear from teams
- What if our legacy service isn’t on Kubernetes?
- No problem. The same approach works on VMs. Use node exporters, run Prometheus as a service, ship logs via Fluent Bit to Loki/Elastic, and expose an HTTP metrics endpoint. You can still attach the OpenTelemetry Java agent and point to a collector over TCP. GitOps can be achieved with Ansible or Terraform plus a CI runner.
- How do we pick the right latency threshold for SLOs?
- Start with user experience and current performance. Look at p95 from your 2-week baseline and talk to product about what’s acceptable. If p95 is 420ms today and users notice slowness beyond ~500ms, set an initial SLO of 99% under 500ms, then tighten once breakers/timeouts are in.
- Won’t tracing be too expensive?
- Not if you sample intelligently. Start with 1–5%, use tail-based sampling on anomalies if your collector supports it, and drop high-cardinality attributes. Retain spans for shorter periods (24–72h) and metrics longer (30–90d).
- What’s the fastest way to get SLOs into Git?
- Use Sloth. Define availability/latency SLOs in YAML, `sloth generate` to produce Prometheus rules and alerts, and let ArgoCD/Flux roll them out. Add dashboards as code so panels don’t drift.
- How do we align SLOs with business impact?
- Run a monthly error budget review with product. If you’re burning budget, pause feature work to harden. If you’re consistently under budget, you can accelerate. Tie these decisions to real incidents and user-facing metrics (conversion, drop-offs).
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
