Harden That Legacy Service: A 6‑Week, Progressive Observability + SLO Playbook
You don’t need a rewrite to stop 3AM pages. Ship a pragmatic, staged observability stack, define SLOs that matter, and use error budgets to harden what you already have.
Boring, progressive observability beats heroic rewrites—because SLOs make the boring work pay for itself.Back to all posts
The 3AM Incident You’ve Already Lived
You’ve got a legacy service—10+ years of “don’t touch that code” wisdom—running under a load balancer, talking to a database that predates the interns. At 3:07AM, latency spikes, error rate creeps, PagerDuty lights up, and the only dashboards you have are CPU and a disk-full alert that’s been flapping since 2021. I’ve watched teams try to fix this with a rewrite, a service mesh, or an AI observability platform (pick your buzzword). Most of it fails because it’s not progressive, not tied to SLOs, and not wired into deploy decisions.
Here’s the pragmatic playbook we use at GitPlumbers to harden legacy systems without boiling the ocean. Six weeks, staged adoption, measurable outcomes.
Step 1: Baseline Without Touching Code
Goal: get signal now. No code changes. No tickets to legacy owners. Just enough visibility to stop guessing.
- Black-box checks with
blackbox_exporteragainst your top 3 endpoints - System metrics with
node_exporter(orwindows_exporter) - Ingress metrics from
nginx_exporter/haproxy_exporter/ALB metrics - Access logs shipped via
fluent-bitorvectortolokior Datadog - Optional: eBPF (Pixie, Parca) if agents are easier than changing code
Quick start Prometheus scrape (Kubernetes):
# prometheus-additional-scrape.yaml
scrape_configs:
- job_name: blackbox
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://legacy.example.com/healthz
- https://legacy.example.com/api/v1/orders
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
- source_labels: [__param_target]
target_label: instanceShipping NGINX access logs with fluent-bit:
# fluent-bit.conf
[INPUT]
Name tail
Path /var/log/nginx/access.log
Parser nginx
[FILTER]
Name grep
Match *
Regex log .*\/(api|healthz).*
[OUTPUT]
Name loki
Match *
host loki.loki.svc
labels {job="nginx", env="prod"}Checkpoint by end of Week 1:
- Synthetic uptime visible, P50/P95 latency from ingress metrics
- Top endpoints and 5xx rate trending
- A Grafana dashboard with RED metrics:
rate(http_requests_total{code=~"5.."}[5m]),histogram_quantile(0.95, ...)
Step 2: Define SLOs That Matter (Before You Add More Telemetry)
Don’t instrument blindly. Pick one user journey. Define availability and latency targets that reflect pain your business actually feels.
- Example service:
GET /api/v1/orderspowering checkout - Availability SLO: 99.5% over 30 days
- Latency SLO: 95% of requests under 300ms over 30 days
- Budget: 0.5% unavailability ≈ 3h 36m/month
Prometheus SLI candidates (assuming ingress metrics):
# Requests considered successful (non-5xx)
sum(rate(nginx_http_requests_total{route="/api/v1/orders",status!~"5.."}[5m]))
/
sum(rate(nginx_http_requests_total{route="/api/v1/orders"}[5m]))Latency SLI from histograms:
histogram_quantile(0.95, sum by (le) (rate(nginx_request_duration_seconds_bucket{route="/api/v1/orders"}[5m]))) < 0.3Pro tip: choose SLO windows you can actually enforce. If you page on a 30-day objective, you’ll never act. Use burn-rate alerts over short windows to detect real incidents.
Checkpoint:
- Two SLIs defined, queryable
- SLO targets agreed with product/ops
- Error-budget sheet with simple math everyone understands
Step 3: Add the Pipe — OTel Collector To Fan-In/Fan-Out
Now standardize ingestion. The opentelemetry-collector lets you bring metrics/logs/traces into one place and ship to whatever backend you can pay for or self-host.
# otel-collector.yaml
receivers:
otlp:
protocols:
http:
grpc:
prometheus:
config:
scrape_configs:
- job_name: legacy-nginx
static_configs:
- targets: ['nginx.ingress.svc:9113']
loki:
protocols:
http:
exporters:
prometheusremotewrite:
endpoint: https://prometheus-us1.grafana.net/api/prom/push
headers: { Authorization: 'Bearer ${GRAFANA_CLOUD_RW}' }
otlphttp/tempo:
endpoint: https://tempo-us1.grafana.net/otlp
headers: { Authorization: 'Bearer ${GRAFANA_CLOUD_OTLP}' }
loki:
endpoint: https://logs-prod3.grafana.net
headers: { Authorization: 'Bearer ${GRAFANA_CLOUD_LOKI}' }
processors:
batch:
attributes:
actions:
- key: env
value: prod
action: upsert
service:
pipelines:
metrics: { receivers: [prometheus, otlp], processors: [batch], exporters: [prometheusremotewrite] }
logs: { receivers: [loki], processors: [batch], exporters: [loki] }
traces: { receivers: [otlp], processors: [batch, attributes], exporters: [otlphttp/tempo] }If you’re all-in on Datadog or New Relic, use their OTLP endpoints. Keep vendor lock-in out of your code; keep it in routing.
Checkpoint:
- One config path to add/remove backends
- All telemetry tagged with
service,env,version
Step 4: Minimal Tracing Without Rewriting The World
We’re still avoiding invasive code changes. Use auto-instrumentation or ingress sampling to get distributed traces around the hot path.
- JVM:
opentelemetry-javaagent.jar - Python:
opentelemetry-instrument - .NET:
OpenTelemetry.AutoInstrumentationpackage
Example: Java service with zero code changes:
java -javaagent:/opt/otel/opentelemetry-javaagent.jar \
-Dotel.service.name=legacy-orders \
-Dotel.traces.exporter=otlp \
-Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
-Dotel.resource.attributes=env=prod,team=payments \
-jar legacy-orders.jarPython (Flask) quick win:
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
OTEL_SERVICE_NAME=legacy-orders \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
opentelemetry-instrument gunicorn app:appCheckpoint:
- P50/P95 spans through ingress → service → DB visible in Jaeger/Tempo
- Top 3 slow spans identified (DB calls, external API)
Step 5: Codify SLOs And Alerts (Sloth + Burn Rates)
Stop hand-editing alert rules at 2AM. Use SLO-as-Code. Sloth generates Prometheus recording/alerting rules from a YAML spec. Version it, review it, ship it via GitOps.
# slo-orders.yaml (Sloth)
version: "prometheus/v1"
service: legacy-orders
labels: {env: prod, team: payments}
slos:
- name: availability
objective: 99.5
description: Successful requests to /api/v1/orders
sli:
events:
error_query: sum(rate(nginx_http_requests_total{route="/api/v1/orders",status=~"5.."}[5m]))
total_query: sum(rate(nginx_http_requests_total{route="/api/v1/orders"}[5m]))
alerting:
name: orders-availability
page_alert:
labels: {severity: page}
annotations: {runbook: https://runbooks/legacy-orders}
ticket_alert:
labels: {severity: ticket}
annotations: {owner: payments}Generate rules:
sloth generate -i slo-orders.yaml -o prometheus-rules-orders.yaml
kubectl apply -f prometheus-rules-orders.yamlMulti-window burn-rate alert (if you roll your own):
# Page if we’ll exhaust budget fast (1h/5m)
(
sum(rate(errors[5m])) / sum(rate(total[5m]))
) > (14.4 * (1 - 0.995))Use two alerts: fast-burn (5m/1h) to catch active incidents; slow-burn (30m/6h) to catch degradation.
Checkpoint:
- SLO dashboards visible to product/ops
- Burn-rate alerts paging humans only when the budget is at risk
Step 6: Harden Where The SLOs Hurt — Timeouts, Retries, Circuit Breakers
Now you know where it hurts. Don’t guess—apply controls at the boundaries.
- Timeouts: set defaults at ingress or mesh; override per route
- Retries: limited, with jitter/backoff; never retry non-idempotent requests
- Circuit breakers: trip on failure rates or concurrency saturation
- Queuing and shed load: protect upstreams with
429+ exponential backoff in clients
Envoy example (works via Istio/EnvoyFilter or standalone Envoy):
# envoy-route.yaml
route_config:
name: local_route
virtual_hosts:
- name: legacy
domains: ["legacy.example.com"]
routes:
- match: { prefix: "/api/v1/orders" }
route:
cluster: legacy-orders
timeout: 0.5s
retry_policy:
retry_on: 5xx,reset,connect-failure
num_retries: 2
per_try_timeout: 200ms
max_stream_duration: { max_stream_duration: 1s }
typed_per_filter_config:
envoy.filters.http.circuit_breakers:
'@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
common_http_protocol_options: {}
clusters:
- name: legacy-orders
type: STATIC
load_assignment: { ... }
circuit_breakers:
thresholds:
- max_connections: 2000
max_pending_requests: 500
max_requests: 3000JVM-side without touching all call sites: resilience4j with Spring Boot actuator metrics wired to Prometheus.
Tie deploys to error budgets with Argo Rollouts:
# argo-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: legacy-orders
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 5m}
- analysis:
templates:
- templateName: slo-burn-ok
args:
- name: service
value: legacy-orders
- setWeight: 50
- pause: {duration: 10m}
- analysis:
templates:
- templateName: slo-burn-ok
- setWeight: 100The AnalysisTemplate queries Prometheus for burn-rate; rollback if failing.
Checkpoint:
- Automatic rollback of bad canaries
- Latency/availability back under SLO without heroics
Step 7: Show The Before/After And Keep Paying Down Risk
What we typically see within 6 weeks (real numbers from a payments client):
- MTTR down from 72m → 18m (−75%) after burn-rate alerts + tracing
- Pages per week down 60% after removing noisy CPU/disk alerts and gating deploys
- P95 latency for
/api/v1/ordersdown 35% after adding upstream timeouts and a small cache - Deployment lead time unchanged; failure impact reduced via canary + rollback
Ongoing habits that stick:
- Review error-budget status in weekly ops; decide on feature flags or freeze scopes
- Keep SLOs in Git next to infra (Terraform + ArgoCD) and review like code
- Run quarterly game days (Chaos Mesh/Litmus) targeting the actual SLOs
- Use SLO drift to prioritize tech debt over vibe code cleanup and AI-generated “fixes” that don’t move the needle
If I had to do it over again: start SLOs sooner, push OTel auto-instrumentation earlier, and never ship a canary without a burn-rate guardrail. This is boring engineering—and it’s why it works.
Key takeaways
- Start with black-box and system-level metrics; don’t block on code changes.
- Define 1–2 business-facing SLOs first; compute SLIs from traffic you already have.
- Use Sloth or Grafana SLO to codify SLOs and generate burn-rate alerts.
- Adopt OpenTelemetry via sidecars/auto-instrumentation before touching legacy code.
- Tie deploys to error-budget spend using Argo Rollouts and Prometheus checks.
- Harden with timeouts/retries/circuit breakers where SLOs show pain, not everywhere.
- Measure success by reduced MTTR, fewer pages, and error-budget health.
Implementation checklist
- Pick one critical user journey and define availability and latency SLOs.
- Stand up Prometheus + Grafana (or Grafana Cloud/Datadog) and Blackbox + Node exporters.
- Ship access logs to Loki/Datadog with structured fields and sampling.
- Add OpenTelemetry Collector to fan-in metrics/logs/traces; export to your chosen backend.
- Codify SLOs with Sloth and enable multi-window burn-rate alerting.
- Introduce timeouts/retries/circuit breakers at ingress or service mesh level.
- Gate canaries on SLO burn-rate and rollback automatically if budgets burn.
Questions we hear from teams
- We’re on VMs, not Kubernetes. Does this still work?
- Yes. Run Prometheus on a small VM, deploy blackbox and node exporters as services, ship logs with Vector/Fluent Bit, and run the OTel Collector as a systemd service. The configs are nearly identical—only discovery changes.
- Do we need traces before SLOs?
- No. Define SLOs using ingress metrics first. Traces help reduce MTTR and pinpoint hot spans, but you can page on burn-rate using just metrics.
- What if the legacy app can’t load an agent?
- Use ingress metrics + blackbox probes for SLIs and add sidecar proxies (Envoy/NGINX) to capture latency/error data. For traces, use mesh-generated spans or eBPF tools like Pixie to infer edges.
- Should we centralize on one vendor or self-host?
- Route via OTel and keep the choice open. Grafana Cloud is cost-effective for many; Datadog/New Relic are fine if procurement is easy. The key is decoupling ingestion from code.
- How do we prevent alert fatigue?
- Remove host-level pages. Page only on SLO burn-rate. Send everything else to tickets. Keep two burn windows (fast and slow) and tune until on-call stops hating you.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
