We’re on VMs, not Kubernetes. Does this still work?

Yes. Run Prometheus on a small VM, deploy blackbox and node exporters as services, ship logs with Vector/Fluent Bit, and run the OTel Collector as a systemd service. The configs are nearly identical—only discovery changes.

Do we need traces before SLOs?

No. Define SLOs using ingress metrics first. Traces help reduce MTTR and pinpoint hot spans, but you can page on burn-rate using just metrics.

What if the legacy app can’t load an agent?

Use ingress metrics + blackbox probes for SLIs and add sidecar proxies (Envoy/NGINX) to capture latency/error data. For traces, use mesh-generated spans or eBPF tools like Pixie to infer edges.

Should we centralize on one vendor or self-host?

Route via OTel and keep the choice open. Grafana Cloud is cost-effective for many; Datadog/New Relic are fine if procurement is easy. The key is decoupling ingestion from code.

How do we prevent alert fatigue?

Remove host-level pages. Page only on SLO burn-rate. Send everything else to tickets. Keep two burn windows (fast and slow) and tune until on-call stops hating you.

Guides · Dec 4, 2025 · 10 minute read

Harden That Legacy Service: A 6‑Week, Progressive Observability + SLO Playbook

You don’t need a rewrite to stop 3AM pages. Ship a pragmatic, staged observability stack, define SLOs that matter, and use error budgets to harden what you already have.

Alex Marin

Partner, Reliability Engineering at GitPlumbers

20 years keeping brittle systems alive—from LAMP monoliths at startups during the dot-com hangover to Kubernetes + Istio at Fortune 100s. Built SRE teams, killed noisy alerts, and turned AI-generated vibe code into something you can deploy without flinching.

Boring, progressive observability beats heroic rewrites—because SLOs make the boring work pay for itself.

Back to all posts

The 3AM Incident You’ve Already Lived

You’ve got a legacy service—10+ years of “don’t touch that code” wisdom—running under a load balancer, talking to a database that predates the interns. At 3:07AM, latency spikes, error rate creeps, PagerDuty lights up, and the only dashboards you have are CPU and a disk-full alert that’s been flapping since 2021. I’ve watched teams try to fix this with a rewrite, a service mesh, or an AI observability platform (pick your buzzword). Most of it fails because it’s not progressive, not tied to SLOs, and not wired into deploy decisions.

Here’s the pragmatic playbook we use at GitPlumbers to harden legacy systems without boiling the ocean. Six weeks, staged adoption, measurable outcomes.

Step 1: Baseline Without Touching Code

Goal: get signal now. No code changes. No tickets to legacy owners. Just enough visibility to stop guessing.

Black-box checks with blackbox_exporter against your top 3 endpoints
System metrics with node_exporter (or windows_exporter)
Ingress metrics from nginx_exporter/haproxy_exporter/ALB metrics
Access logs shipped via fluent-bit or vector to loki or Datadog
Optional: eBPF (Pixie, Parca) if agents are easier than changing code

Quick start Prometheus scrape (Kubernetes):

# prometheus-additional-scrape.yaml
scrape_configs:
  - job_name: blackbox
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://legacy.example.com/healthz
        - https://legacy.example.com/api/v1/orders
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115
      - source_labels: [__param_target]
        target_label: instance

Shipping NGINX access logs with fluent-bit:

# fluent-bit.conf
[INPUT]
  Name tail
  Path /var/log/nginx/access.log
  Parser nginx

[FILTER]
  Name grep
  Match *
  Regex log .*\/(api|healthz).*

[OUTPUT]
  Name  loki
  Match *
  host  loki.loki.svc
  labels {job="nginx", env="prod"}

Checkpoint by end of Week 1:

Synthetic uptime visible, P50/P95 latency from ingress metrics
Top endpoints and 5xx rate trending
A Grafana dashboard with RED metrics: rate(http_requests_total{code=~"5.."}[5m]), histogram_quantile(0.95, ...)

Step 2: Define SLOs That Matter (Before You Add More Telemetry)

Don’t instrument blindly. Pick one user journey. Define availability and latency targets that reflect pain your business actually feels.

Example service: GET /api/v1/orders powering checkout
Availability SLO: 99.5% over 30 days
Latency SLO: 95% of requests under 300ms over 30 days
Budget: 0.5% unavailability ≈ 3h 36m/month

Prometheus SLI candidates (assuming ingress metrics):

# Requests considered successful (non-5xx)
sum(rate(nginx_http_requests_total{route="/api/v1/orders",status!~"5.."}[5m]))
/
sum(rate(nginx_http_requests_total{route="/api/v1/orders"}[5m]))

Latency SLI from histograms:

histogram_quantile(0.95, sum by (le) (rate(nginx_request_duration_seconds_bucket{route="/api/v1/orders"}[5m]))) < 0.3

Pro tip: choose SLO windows you can actually enforce. If you page on a 30-day objective, you’ll never act. Use burn-rate alerts over short windows to detect real incidents.

Checkpoint:

Two SLIs defined, queryable
SLO targets agreed with product/ops
Error-budget sheet with simple math everyone understands

Step 3: Add the Pipe — OTel Collector To Fan-In/Fan-Out

Now standardize ingestion. The opentelemetry-collector lets you bring metrics/logs/traces into one place and ship to whatever backend you can pay for or self-host.

# otel-collector.yaml
receivers:
  otlp:
    protocols:
      http:
      grpc:
  prometheus:
    config:
      scrape_configs:
        - job_name: legacy-nginx
          static_configs:
            - targets: ['nginx.ingress.svc:9113']
  loki:
    protocols:
      http:
exporters:
  prometheusremotewrite:
    endpoint: https://prometheus-us1.grafana.net/api/prom/push
    headers: { Authorization: 'Bearer ${GRAFANA_CLOUD_RW}' }
  otlphttp/tempo:
    endpoint: https://tempo-us1.grafana.net/otlp
    headers: { Authorization: 'Bearer ${GRAFANA_CLOUD_OTLP}' }
  loki:
    endpoint: https://logs-prod3.grafana.net
    headers: { Authorization: 'Bearer ${GRAFANA_CLOUD_LOKI}' }
processors:
  batch:
  attributes:
    actions:
      - key: env
        value: prod
        action: upsert
service:
  pipelines:
    metrics: { receivers: [prometheus, otlp], processors: [batch], exporters: [prometheusremotewrite] }
    logs:    { receivers: [loki],        processors: [batch], exporters: [loki] }
    traces:  { receivers: [otlp],        processors: [batch, attributes], exporters: [otlphttp/tempo] }

If you’re all-in on Datadog or New Relic, use their OTLP endpoints. Keep vendor lock-in out of your code; keep it in routing.

Checkpoint:

One config path to add/remove backends
All telemetry tagged with service, env, version

Step 4: Minimal Tracing Without Rewriting The World

We’re still avoiding invasive code changes. Use auto-instrumentation or ingress sampling to get distributed traces around the hot path.

JVM: opentelemetry-javaagent.jar
Python: opentelemetry-instrument
.NET: OpenTelemetry.AutoInstrumentation package

Example: Java service with zero code changes:

java -javaagent:/opt/otel/opentelemetry-javaagent.jar \
  -Dotel.service.name=legacy-orders \
  -Dotel.traces.exporter=otlp \
  -Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
  -Dotel.resource.attributes=env=prod,team=payments \
  -jar legacy-orders.jar

Python (Flask) quick win:

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
OTEL_SERVICE_NAME=legacy-orders \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
opentelemetry-instrument gunicorn app:app

Checkpoint:

P50/P95 spans through ingress → service → DB visible in Jaeger/Tempo
Top 3 slow spans identified (DB calls, external API)

Step 5: Codify SLOs And Alerts (Sloth + Burn Rates)

Stop hand-editing alert rules at 2AM. Use SLO-as-Code. Sloth generates Prometheus recording/alerting rules from a YAML spec. Version it, review it, ship it via GitOps.

# slo-orders.yaml (Sloth)
version: "prometheus/v1"
service: legacy-orders
labels: {env: prod, team: payments}
slos:
- name: availability
  objective: 99.5
  description: Successful requests to /api/v1/orders
  sli:
    events:
      error_query: sum(rate(nginx_http_requests_total{route="/api/v1/orders",status=~"5.."}[5m]))
      total_query: sum(rate(nginx_http_requests_total{route="/api/v1/orders"}[5m]))
  alerting:
    name: orders-availability
    page_alert:
      labels: {severity: page}
      annotations: {runbook: https://runbooks/legacy-orders}
    ticket_alert:
      labels: {severity: ticket}
      annotations: {owner: payments}

Generate rules:

sloth generate -i slo-orders.yaml -o prometheus-rules-orders.yaml
kubectl apply -f prometheus-rules-orders.yaml

Multi-window burn-rate alert (if you roll your own):

# Page if we’ll exhaust budget fast (1h/5m)
(
  sum(rate(errors[5m])) / sum(rate(total[5m]))
) > (14.4 * (1 - 0.995))

Use two alerts: fast-burn (5m/1h) to catch active incidents; slow-burn (30m/6h) to catch degradation.

Checkpoint:

SLO dashboards visible to product/ops
Burn-rate alerts paging humans only when the budget is at risk

Step 6: Harden Where The SLOs Hurt — Timeouts, Retries, Circuit Breakers

Now you know where it hurts. Don’t guess—apply controls at the boundaries.

Timeouts: set defaults at ingress or mesh; override per route
Retries: limited, with jitter/backoff; never retry non-idempotent requests
Circuit breakers: trip on failure rates or concurrency saturation
Queuing and shed load: protect upstreams with 429 + exponential backoff in clients

Envoy example (works via Istio/EnvoyFilter or standalone Envoy):

# envoy-route.yaml
route_config:
  name: local_route
  virtual_hosts:
  - name: legacy
    domains: ["legacy.example.com"]
    routes:
    - match: { prefix: "/api/v1/orders" }
      route:
        cluster: legacy-orders
        timeout: 0.5s
        retry_policy:
          retry_on: 5xx,reset,connect-failure
          num_retries: 2
          per_try_timeout: 200ms
        max_stream_duration: { max_stream_duration: 1s }
    typed_per_filter_config:
      envoy.filters.http.circuit_breakers:
        '@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
        common_http_protocol_options: {}
clusters:
- name: legacy-orders
  type: STATIC
  load_assignment: { ... }
  circuit_breakers:
    thresholds:
      - max_connections: 2000
        max_pending_requests: 500
        max_requests: 3000

JVM-side without touching all call sites: resilience4j with Spring Boot actuator metrics wired to Prometheus.

Tie deploys to error budgets with Argo Rollouts:

# argo-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: legacy-orders
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: slo-burn-ok
          args:
          - name: service
            value: legacy-orders
      - setWeight: 50
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: slo-burn-ok
      - setWeight: 100

The AnalysisTemplate queries Prometheus for burn-rate; rollback if failing.

Checkpoint:

Automatic rollback of bad canaries
Latency/availability back under SLO without heroics

Step 7: Show The Before/After And Keep Paying Down Risk

What we typically see within 6 weeks (real numbers from a payments client):

MTTR down from 72m → 18m (−75%) after burn-rate alerts + tracing
Pages per week down 60% after removing noisy CPU/disk alerts and gating deploys
P95 latency for /api/v1/orders down 35% after adding upstream timeouts and a small cache
Deployment lead time unchanged; failure impact reduced via canary + rollback

Ongoing habits that stick:

Review error-budget status in weekly ops; decide on feature flags or freeze scopes
Keep SLOs in Git next to infra (Terraform + ArgoCD) and review like code
Run quarterly game days (Chaos Mesh/Litmus) targeting the actual SLOs
Use SLO drift to prioritize tech debt over vibe code cleanup and AI-generated “fixes” that don’t move the needle

If I had to do it over again: start SLOs sooner, push OTel auto-instrumentation earlier, and never ship a canary without a burn-rate guardrail. This is boring engineering—and it’s why it works.

Related Resources

Key takeaways

Start with black-box and system-level metrics; don’t block on code changes.
Define 1–2 business-facing SLOs first; compute SLIs from traffic you already have.
Use Sloth or Grafana SLO to codify SLOs and generate burn-rate alerts.
Adopt OpenTelemetry via sidecars/auto-instrumentation before touching legacy code.
Tie deploys to error-budget spend using Argo Rollouts and Prometheus checks.
Harden with timeouts/retries/circuit breakers where SLOs show pain, not everywhere.
Measure success by reduced MTTR, fewer pages, and error-budget health.

Implementation checklist

Pick one critical user journey and define availability and latency SLOs.
Stand up Prometheus + Grafana (or Grafana Cloud/Datadog) and Blackbox + Node exporters.
Ship access logs to Loki/Datadog with structured fields and sampling.
Add OpenTelemetry Collector to fan-in metrics/logs/traces; export to your chosen backend.
Codify SLOs with Sloth and enable multi-window burn-rate alerting.
Introduce timeouts/retries/circuit breakers at ingress or service mesh level.
Gate canaries on SLO burn-rate and rollback automatically if budgets burn.

Questions we hear from teams

We’re on VMs, not Kubernetes. Does this still work?: Yes. Run Prometheus on a small VM, deploy blackbox and node exporters as services, ship logs with Vector/Fluent Bit, and run the OTel Collector as a systemd service. The configs are nearly identical—only discovery changes.
Do we need traces before SLOs?: No. Define SLOs using ingress metrics first. Traces help reduce MTTR and pinpoint hot spans, but you can page on burn-rate using just metrics.
What if the legacy app can’t load an agent?: Use ingress metrics + blackbox probes for SLIs and add sidecar proxies (Envoy/NGINX) to capture latency/error data. For traces, use mesh-generated spans or eBPF tools like Pixie to infer edges.
Should we centralize on one vendor or self-host?: Route via OTel and keep the choice open. Grafana Cloud is cost-effective for many; Datadog/New Relic are fine if procurement is easy. The key is decoupling ingestion from code.
How do we prevent alert fatigue?: Remove host-level pages. Page only on SLO burn-rate. Send everything else to tickets. Keep two burn windows (fast and slow) and tune until on-call stops hating you.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get an SLO + Observability Jumpstart See how we stabilized a legacy checkout in 6 weeks

The 3AM Incident You’ve Already Lived

Step 1: Baseline Without Touching Code

Step 2: Define SLOs That Matter (Before You Add More Telemetry)

Step 3: Add the Pipe — OTel Collector To Fan-In/Fan-Out

Step 4: Minimal Tracing Without Rewriting The World

Step 5: Codify SLOs And Alerts (Sloth + Burn Rates)

Step 6: Harden Where The SLOs Hurt — Timeouts, Retries, Circuit Breakers

Step 7: Show The Before/After And Keep Paying Down Risk

Related Resources

Key takeaways

Implementation checklist

Questions we hear from teams

Ready to modernize your codebase?

Related resources