The Legacy Service That Stopped Paging at 2 a.m.: Progressive Observability and SLOs That Stick

You don’t need a platform rewrite to sleep through the night. You need a measured path to metrics, logs, traces, and SLOs that drive decisions.

You can’t manage what you can’t see; you can’t prioritize what you can’t measure.
Back to all posts

The situation you’ve probably lived

You’ve got a legacy service that makes money and enemies. It runs on a creaky Tomcat 8.5 box behind an NGINX that nobody wants to touch. On Kubernetes, it’s a Deployment with mysterious initContainers and a cron that occasionally DOSes the DB. Alerts? A Nagios relic plus a Slack bot that screams for every 5xx spike. The team is numb, incidents drag, and product thinks SRE is just “the people who say no.”

I’ve seen this movie at a fintech, an ad-tech unicorn, and a Fortune 100. The fix wasn’t a platform rewrite. It was a progressive observability rollout with SLOs that senior leadership could read without a decoder ring—and burn-rate alerts that page only when the budget is really burning.

The plan: progressive adoption in 6 passes

We’ll layer capabilities intentionally:

  1. Baseline service metrics (golden signals) and black-box checks. Fast win, 1 week.
  2. Logs with shape and retention that don’t bankrupt you.
  3. Traces around the critical path only, sampled hard.
  4. SLOs that reflect user pain, not vanity metrics.
  5. Burn-rate alerting and runbooks that cut MTTR.
  6. SLO-driven delivery: canaries, flags, and chaos where it counts.

We’ll measure success with: alert volume ↓ 50–80%, MTTR ↓ 30–50%, change failure rate ↓, and a weekly SLO review the business actually attends.

Pass 1: metrics baseline you can ship in a week

Start with what’s cheap and decisive: metrics. Use the RED method for services and USE for infrastructure.

  • Tools: Prometheus, Alertmanager, Grafana, blackbox_exporter.
  • If you’re on K8s: add a ServiceMonitor and pod/service labels. If on VMs: install node_exporter and scrape via static configs.

Quick K8s ServiceMonitor for your legacy app:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: legacy-service
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: legacy-service
  namespaceSelector:
    matchNames: ["prod"]
  endpoints:
    - port: http-metrics
      interval: 15s
      path: /metrics

Add a black-box probe for the user-facing endpoint (don’t trust only internal metrics):

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200, 201, 204]

Prometheus scrape for black-box exporter:

- job_name: blackbox
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
    - targets:
        - https://api.example.com/v1/payments
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - target_label: instance
      source_labels: [__param_target]
    - target_label: __address__
      replacement: blackbox-exporter:9115

Checkpoints by end of week:

  • RED dashboard shows: request rate, 95th latency, error ratio per endpoint.
  • Black-box probe latency tracked and alerting on outright failures (no paging yet).
  • One pager: “How to view the legacy service health in Grafana.”

Pass 2: logs without the tax

Logs help when metrics say "it’s bad" but not "why." Keep them cheap and focused.

  • Tools: Loki + promtail (or Fluent Bit → your vendor). Avoid unbounded DEBUG logs and user PII.
  • Shape logs to enable queries: include request_id, user_id (hashed), endpoint, status, latency_ms.

K8s promtail snippet with label dropping to avoid cardinality explosions:

scrape_configs:
- job_name: kubernetes-pods
  pipeline_stages:
    - match:
        selector: '{app="legacy-service"}'
        stages:
          - json
          - labels:
              endpoint:
              status:
          - drop_labels:
              - pod
              - container

Operational guardrails:

  • Retention: 7–14 days hot; archive to S3 if compliance wants more.
  • Sampling: keep WARN/ERROR always, sample INFO 1:10 or 1:100.

Checkpoints:

  • You can pivot from a 95th latency spike to top offenders by endpoint and status in < 2 minutes.
  • Logging bill stable; cardinality growth < 10% week over week.

Pass 3: traces where latency hides

Traces are expensive if you go YOLO. Only instrument the critical path.

  • Tools: OpenTelemetry SDK, OpenTelemetry Collector, Tempo or Jaeger.
  • Trace what you can’t infer from metrics: DB calls, cache misses, external APIs, queue publish/consume.

Collector with head sampling and tail triggers for slow spans:

receivers:
  otlp:
    protocols:
      http:
      grpc:

processors:
  batch:
  probabilistic_sampler:
    sampling_percentage: 5
  tail_sampling:
    policies:
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 500
      - name: error-status
        type: status_code
        status_code:
          status_codes: [ERROR]

exporters:
  otlphttp/tempo:
    endpoint: http://tempo:4318

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [probabilistic_sampler, tail_sampling, batch]
      exporters: [otlphttp/tempo]

In the app, ensure you propagate trace_id to logs to correlate quickly.

Checkpoint:

  • 3–5 spans per request show where the time goes (e.g., legacy.db.query, legacy.http.call.partnerX).
  • Top N slow external calls visible; you can answer “is it us or partner?” in minutes.

Pass 4: SLOs and error budgets you can defend to finance

Now that you can see, draw the line. SLIs must reflect user pain, not CPU.

  • Choose 2 SLIs to start:
    • Availability: proportion of successful requests for your top money endpoint.
    • Latency: proportion of requests under a threshold users feel (e.g., 300ms p95 for reads, 800ms p95 for writes).
  • Targets: pick 99.9% for external-facing core path, 99.5% for admin/async. Window: 28d.

Use Sloth to codify SLOs as code and auto-generate Prometheus rules.

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: legacy-slos
spec:
  service: legacy-service
  slos:
    - name: api-availability
      objective: 99.9
      timeWindow: 28d
      labels:
        tier: critical
      sli:
        events:
          errorQuery: |
            sum(rate(http_requests_total{job="legacy-service",status=~"5.."}[{{.window}}]))
          totalQuery: |
            sum(rate(http_requests_total{job="legacy-service"}[{{.window}}]))
      alerting:
        name: legacy-availability
        labels:
          severity: page
        annotations:
          summary: "Legacy API availability SLO burn"
        pageAlert:
          shortWindow: 5m
          longWindow: 1h
          shortWindowBurnRate: 14.4
          longWindowBurnRate: 14.4
        ticketAlert:
          shortWindow: 30m
          longWindow: 6h
          shortWindowBurnRate: 3
          longWindowBurnRate: 3
    - name: api-latency
      objective: 99.0
      timeWindow: 28d
      sli:
        raw:
          errorRatioQuery: |
            1 - (
              sum(rate(http_request_duration_seconds_bucket{job="legacy-service",le="0.3"}[{{.window}}]))
              /
              sum(rate(http_request_duration_seconds_count{job="legacy-service"}[{{.window}}]))
            )

Checkpoints:

  • SLO dashboard shows budgets remaining and burn trend; execs can read it.
  • SLOs live in Git (GitOps) and are reviewed weekly with product.

Pass 5: burn-rate alerting and runbooks that cut MTTR

Page only when you’re burning the budget fast enough to matter. Everything else is a ticket.

PromQL for two-window multi-burn (if not using Sloth’s generated rules):

# Error ratio over windows
sli_error_ratio:5m = sum(rate(http_requests_total{job="legacy-service",status=~"5.."}[5m]))
  / sum(rate(http_requests_total{job="legacy-service"}[5m]))

sli_error_ratio:1h = sum(rate(http_requests_total{job="legacy-service",status=~"5.."}[1h]))
  / sum(rate(http_requests_total{job="legacy-service"}[1h]))

# Page for 99.9% SLO when both windows burn > 14.4x
(sli_error_ratio:5m > (0.001 * 14.4))
and
(sli_error_ratio:1h > (0.001 * 14.4))

Alertmanager route hygiene:

route:
  group_by: ['alertname','service']
  receiver: oncall
  routes:
    - matchers: [severity="ticket"]
      receiver: backlog
receivers:
  - name: oncall
    slack_configs:
      - channel: '#prod-pager'
        send_resolved: true
  - name: backlog
    slack_configs:
      - channel: '#slo-tickets'
        send_resolved: true

Runbook must include:

  • Where to look: Grafana dashboard link, LogQL queries, Tempo trace search by endpoint.
  • “Is it us or partner?” decision tree with playbooks (e.g., failover to partnerY).
  • Rollback and feature-flag toggles.

Checkpoints:

  • Page count reduced 50–80% within two weeks; false positives near zero.
  • MTTR down 30–50% as oncall bypasses noise and hits the critical path.

Pass 6: SLO-driven delivery (canaries, flags, chaos)

Now close the loop so SLOs influence change, not just pager noise.

  • Canary deploys with Argo Rollouts or Flagger + mesh (Istio/Linkerd). Gate promotions on SLO burn.
  • Feature flags (LaunchDarkly, Unleash) automatically disable a feature if short-window burn > threshold.
  • Chaos experiments (litmus, chaos-mesh) targeted at the critical path, only during business-approved windows.

Example Rollouts analysis tied to SLO burn metric:

analysis:
  templates:
    - name: slo-burn
      metrics:
        - name: error-burn
          interval: 1m
          successCondition: result < 0.003 # 3x burn of 0.1% budget
          failureLimit: 2
          provider:
            prometheus:
              address: http://prometheus:9090
              query: |
                sum(rate(http_requests_total{job="legacy-service",status=~"5.."}[5m]))
                /
                sum(rate(http_requests_total{job="legacy-service"}[5m]))

Checkpoint:

  • Change failure rate drops; bad canaries auto-halt before harming the 28d budget.

What “good” looks like in 60–90 days

  • Alert volume: -60% (pages), -40% (tickets). Pager noise correlates with actual budget burn.
  • MTTR: down from 90m → 40m. You find the slow hop in minutes via traces.
  • Business alignment: weekly 20-min SLO review with product; roadmap debates anchored on budget spend instead of anecdotes.
  • Cost: log storage stable; tracing kept to < 10% of total telemetry spend through sampling.
  • Culture: engineers open PRs for SLO changes like any code; PMs ask “what’s the budget impact?” before adding features.

Common traps and how to dodge them

  • Boiling the ocean: don’t instrument everything. One endpoint, one queue. Expand later.
  • Vanity SLOs: “CPU < 80%” isn’t user pain. Stick to availability and latency users feel.
  • Trace everything: don’t. Sample and target hot spots. Propagate trace_id into logs.
  • Cardinality spikes: unbounded labels (user_id) in metrics/logs will implode Prometheus and your wallet. Hash or drop.
  • Pager-as-monitoring: alerts are outputs of SLO math, not a brainstorming board.

If you want a guide-by-your-side, GitPlumbers runs lightweight SLO hardening sprints that leave you with dashboards, rules, and a team that knows how to run them.

Related Resources

Key takeaways

  • Harden legacy systems incrementally: start with metrics and golden signals before logs and selective traces.
  • Define SLIs that map to user pain, then set SLOs you can defend to the business.
  • Use multi-window burn-rate alerts to reduce noise and page only for real budget burn.
  • Instrument the critical path first; use sampling and log shaping to control cost and cardinality.
  • Close the loop: tie deploy gates and feature flags to SLO health, not vibes.

Implementation checklist

  • Pick one high-value endpoint and one high-value job queue as your first slice.
  • Ship service-level metrics in one week; defer logs/traces until SLIs are live.
  • Write SLOs using Sloth (28d, 99.9/99.5 targets), publish dashboards and Alertmanager routes.
  • Enable two-window burn-rate page and ticket alerts; suppress everything else initially.
  • Instrument selective spans around DB calls, external APIs, and cache; sample aggressively.
  • Wire deploy canaries to error-budget burn; block if burn > threshold over short window.
  • Review SLOs weekly; adjust targets based on real budgets and product priorities.

Questions we hear from teams

Do I need Kubernetes for this?
No. The same approach works on VMs. Use node_exporter for system metrics, blackbox_exporter for HTTP checks, Prometheus static scrape configs, and the OpenTelemetry Collector as a systemd service. The SLO math doesn’t care where it runs.
What if my legacy app can’t be instrumented easily?
Start outside-in: black-box checks and NGINX or Envoy access logs give you RED metrics. Add a sidecar for metrics if possible. Trace only around the edges (DB proxy, HTTP client libraries) before touching app code.
How do I pick SLO targets without historical data?
Start conservative: 99.5% availability and a latency threshold users feel (e.g., 500ms p95) over 28d. After 2–4 weeks, adjust based on observed budgets and incident impact. Document the rationale so finance and product buy in.
Will tracing blow up my costs?
Not if you sample. Head sample 1–5% and tail-sample slow/error traces. Keep spans minimal and short-lived. Store only a few days of traces hot; metrics should drive most decisions.
Can I keep Datadog/New Relic and still do this?
Yes. You can still codify SLOs with Sloth and query Datadog metrics, or use their SLO objects. The progressive approach and burn-rate alerting pattern are tool-agnostic.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Run an SLO Hardening Sprint with GitPlumbers Download our Prometheus SLO Rule Pack

Related resources