The Legacy Service That Stopped Paging at 2 a.m.: Progressive Observability and SLOs That Stick
You don’t need a platform rewrite to sleep through the night. You need a measured path to metrics, logs, traces, and SLOs that drive decisions.
You can’t manage what you can’t see; you can’t prioritize what you can’t measure.Back to all posts
The situation you’ve probably lived
You’ve got a legacy service that makes money and enemies. It runs on a creaky Tomcat 8.5 box behind an NGINX that nobody wants to touch. On Kubernetes, it’s a Deployment with mysterious initContainers and a cron that occasionally DOSes the DB. Alerts? A Nagios relic plus a Slack bot that screams for every 5xx spike. The team is numb, incidents drag, and product thinks SRE is just “the people who say no.”
I’ve seen this movie at a fintech, an ad-tech unicorn, and a Fortune 100. The fix wasn’t a platform rewrite. It was a progressive observability rollout with SLOs that senior leadership could read without a decoder ring—and burn-rate alerts that page only when the budget is really burning.
The plan: progressive adoption in 6 passes
We’ll layer capabilities intentionally:
- Baseline service metrics (golden signals) and black-box checks. Fast win, 1 week.
- Logs with shape and retention that don’t bankrupt you.
- Traces around the critical path only, sampled hard.
- SLOs that reflect user pain, not vanity metrics.
- Burn-rate alerting and runbooks that cut MTTR.
- SLO-driven delivery: canaries, flags, and chaos where it counts.
We’ll measure success with: alert volume ↓ 50–80%, MTTR ↓ 30–50%, change failure rate ↓, and a weekly SLO review the business actually attends.
Pass 1: metrics baseline you can ship in a week
Start with what’s cheap and decisive: metrics. Use the RED method for services and USE for infrastructure.
- Tools:
Prometheus,Alertmanager,Grafana,blackbox_exporter. - If you’re on K8s: add a
ServiceMonitorand pod/service labels. If on VMs: installnode_exporterand scrape via static configs.
Quick K8s ServiceMonitor for your legacy app:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: legacy-service
labels:
release: prometheus
spec:
selector:
matchLabels:
app: legacy-service
namespaceSelector:
matchNames: ["prod"]
endpoints:
- port: http-metrics
interval: 15s
path: /metricsAdd a black-box probe for the user-facing endpoint (don’t trust only internal metrics):
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_status_codes: [200, 201, 204]Prometheus scrape for black-box exporter:
- job_name: blackbox
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://api.example.com/v1/payments
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: instance
source_labels: [__param_target]
- target_label: __address__
replacement: blackbox-exporter:9115Checkpoints by end of week:
- RED dashboard shows: request rate, 95th latency, error ratio per endpoint.
- Black-box probe latency tracked and alerting on outright failures (no paging yet).
- One pager: “How to view the legacy service health in Grafana.”
Pass 2: logs without the tax
Logs help when metrics say "it’s bad" but not "why." Keep them cheap and focused.
- Tools:
Loki+promtail(orFluent Bit→ your vendor). Avoid unboundedDEBUGlogs and user PII. - Shape logs to enable queries: include
request_id,user_id(hashed),endpoint,status,latency_ms.
K8s promtail snippet with label dropping to avoid cardinality explosions:
scrape_configs:
- job_name: kubernetes-pods
pipeline_stages:
- match:
selector: '{app="legacy-service"}'
stages:
- json
- labels:
endpoint:
status:
- drop_labels:
- pod
- containerOperational guardrails:
- Retention: 7–14 days hot; archive to S3 if compliance wants more.
- Sampling: keep
WARN/ERRORalways, sampleINFO1:10 or 1:100.
Checkpoints:
- You can pivot from a 95th latency spike to top offenders by
endpointandstatusin < 2 minutes. - Logging bill stable; cardinality growth < 10% week over week.
Pass 3: traces where latency hides
Traces are expensive if you go YOLO. Only instrument the critical path.
- Tools:
OpenTelemetry SDK,OpenTelemetry Collector,TempoorJaeger. - Trace what you can’t infer from metrics: DB calls, cache misses, external APIs, queue publish/consume.
Collector with head sampling and tail triggers for slow spans:
receivers:
otlp:
protocols:
http:
grpc:
processors:
batch:
probabilistic_sampler:
sampling_percentage: 5
tail_sampling:
policies:
- name: slow-traces
type: latency
latency:
threshold_ms: 500
- name: error-status
type: status_code
status_code:
status_codes: [ERROR]
exporters:
otlphttp/tempo:
endpoint: http://tempo:4318
service:
pipelines:
traces:
receivers: [otlp]
processors: [probabilistic_sampler, tail_sampling, batch]
exporters: [otlphttp/tempo]In the app, ensure you propagate trace_id to logs to correlate quickly.
Checkpoint:
- 3–5 spans per request show where the time goes (e.g.,
legacy.db.query,legacy.http.call.partnerX). - Top N slow external calls visible; you can answer “is it us or partner?” in minutes.
Pass 4: SLOs and error budgets you can defend to finance
Now that you can see, draw the line. SLIs must reflect user pain, not CPU.
- Choose 2 SLIs to start:
- Availability: proportion of successful requests for your top money endpoint.
- Latency: proportion of requests under a threshold users feel (e.g., 300ms p95 for reads, 800ms p95 for writes).
- Targets: pick
99.9%for external-facing core path,99.5%for admin/async. Window:28d.
Use Sloth to codify SLOs as code and auto-generate Prometheus rules.
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: legacy-slos
spec:
service: legacy-service
slos:
- name: api-availability
objective: 99.9
timeWindow: 28d
labels:
tier: critical
sli:
events:
errorQuery: |
sum(rate(http_requests_total{job="legacy-service",status=~"5.."}[{{.window}}]))
totalQuery: |
sum(rate(http_requests_total{job="legacy-service"}[{{.window}}]))
alerting:
name: legacy-availability
labels:
severity: page
annotations:
summary: "Legacy API availability SLO burn"
pageAlert:
shortWindow: 5m
longWindow: 1h
shortWindowBurnRate: 14.4
longWindowBurnRate: 14.4
ticketAlert:
shortWindow: 30m
longWindow: 6h
shortWindowBurnRate: 3
longWindowBurnRate: 3
- name: api-latency
objective: 99.0
timeWindow: 28d
sli:
raw:
errorRatioQuery: |
1 - (
sum(rate(http_request_duration_seconds_bucket{job="legacy-service",le="0.3"}[{{.window}}]))
/
sum(rate(http_request_duration_seconds_count{job="legacy-service"}[{{.window}}]))
)Checkpoints:
- SLO dashboard shows budgets remaining and burn trend; execs can read it.
- SLOs live in Git (
GitOps) and are reviewed weekly with product.
Pass 5: burn-rate alerting and runbooks that cut MTTR
Page only when you’re burning the budget fast enough to matter. Everything else is a ticket.
PromQL for two-window multi-burn (if not using Sloth’s generated rules):
# Error ratio over windows
sli_error_ratio:5m = sum(rate(http_requests_total{job="legacy-service",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="legacy-service"}[5m]))
sli_error_ratio:1h = sum(rate(http_requests_total{job="legacy-service",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{job="legacy-service"}[1h]))
# Page for 99.9% SLO when both windows burn > 14.4x
(sli_error_ratio:5m > (0.001 * 14.4))
and
(sli_error_ratio:1h > (0.001 * 14.4))Alertmanager route hygiene:
route:
group_by: ['alertname','service']
receiver: oncall
routes:
- matchers: [severity="ticket"]
receiver: backlog
receivers:
- name: oncall
slack_configs:
- channel: '#prod-pager'
send_resolved: true
- name: backlog
slack_configs:
- channel: '#slo-tickets'
send_resolved: trueRunbook must include:
- Where to look: Grafana dashboard link, LogQL queries, Tempo trace search by endpoint.
- “Is it us or partner?” decision tree with playbooks (e.g., failover to
partnerY). - Rollback and feature-flag toggles.
Checkpoints:
- Page count reduced 50–80% within two weeks; false positives near zero.
- MTTR down 30–50% as oncall bypasses noise and hits the critical path.
Pass 6: SLO-driven delivery (canaries, flags, chaos)
Now close the loop so SLOs influence change, not just pager noise.
- Canary deploys with
Argo RolloutsorFlagger+ mesh (Istio/Linkerd). Gate promotions on SLO burn. - Feature flags (
LaunchDarkly,Unleash) automatically disable a feature if short-window burn > threshold. - Chaos experiments (
litmus,chaos-mesh) targeted at the critical path, only during business-approved windows.
Example Rollouts analysis tied to SLO burn metric:
analysis:
templates:
- name: slo-burn
metrics:
- name: error-burn
interval: 1m
successCondition: result < 0.003 # 3x burn of 0.1% budget
failureLimit: 2
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{job="legacy-service",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="legacy-service"}[5m]))Checkpoint:
- Change failure rate drops; bad canaries auto-halt before harming the 28d budget.
What “good” looks like in 60–90 days
- Alert volume: -60% (pages), -40% (tickets). Pager noise correlates with actual budget burn.
- MTTR: down from 90m → 40m. You find the slow hop in minutes via traces.
- Business alignment: weekly 20-min SLO review with product; roadmap debates anchored on budget spend instead of anecdotes.
- Cost: log storage stable; tracing kept to < 10% of total telemetry spend through sampling.
- Culture: engineers open PRs for SLO changes like any code; PMs ask “what’s the budget impact?” before adding features.
Common traps and how to dodge them
- Boiling the ocean: don’t instrument everything. One endpoint, one queue. Expand later.
- Vanity SLOs: “CPU < 80%” isn’t user pain. Stick to availability and latency users feel.
- Trace everything: don’t. Sample and target hot spots. Propagate
trace_idinto logs. - Cardinality spikes: unbounded labels (
user_id) in metrics/logs will implode Prometheus and your wallet. Hash or drop. - Pager-as-monitoring: alerts are outputs of SLO math, not a brainstorming board.
If you want a guide-by-your-side, GitPlumbers runs lightweight SLO hardening sprints that leave you with dashboards, rules, and a team that knows how to run them.
Key takeaways
- Harden legacy systems incrementally: start with metrics and golden signals before logs and selective traces.
- Define SLIs that map to user pain, then set SLOs you can defend to the business.
- Use multi-window burn-rate alerts to reduce noise and page only for real budget burn.
- Instrument the critical path first; use sampling and log shaping to control cost and cardinality.
- Close the loop: tie deploy gates and feature flags to SLO health, not vibes.
Implementation checklist
- Pick one high-value endpoint and one high-value job queue as your first slice.
- Ship service-level metrics in one week; defer logs/traces until SLIs are live.
- Write SLOs using Sloth (28d, 99.9/99.5 targets), publish dashboards and Alertmanager routes.
- Enable two-window burn-rate page and ticket alerts; suppress everything else initially.
- Instrument selective spans around DB calls, external APIs, and cache; sample aggressively.
- Wire deploy canaries to error-budget burn; block if burn > threshold over short window.
- Review SLOs weekly; adjust targets based on real budgets and product priorities.
Questions we hear from teams
- Do I need Kubernetes for this?
- No. The same approach works on VMs. Use node_exporter for system metrics, blackbox_exporter for HTTP checks, Prometheus static scrape configs, and the OpenTelemetry Collector as a systemd service. The SLO math doesn’t care where it runs.
- What if my legacy app can’t be instrumented easily?
- Start outside-in: black-box checks and NGINX or Envoy access logs give you RED metrics. Add a sidecar for metrics if possible. Trace only around the edges (DB proxy, HTTP client libraries) before touching app code.
- How do I pick SLO targets without historical data?
- Start conservative: 99.5% availability and a latency threshold users feel (e.g., 500ms p95) over 28d. After 2–4 weeks, adjust based on observed budgets and incident impact. Document the rationale so finance and product buy in.
- Will tracing blow up my costs?
- Not if you sample. Head sample 1–5% and tail-sample slow/error traces. Keep spans minimal and short-lived. Store only a few days of traces hot; metrics should drive most decisions.
- Can I keep Datadog/New Relic and still do this?
- Yes. You can still codify SLOs with Sloth and query Datadog metrics, or use their SLO objects. The progressive approach and burn-rate alerting pattern are tool-agnostic.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
