The 30‑Day Hardening Plan for a Legacy Service: Progressive Observability and SLOs That Stick

No rewrites. No Big Bang platform migrations. Just a month of disciplined, incremental moves that stop the bleeding and keep you shipping.

If it isn’t measured, it’s folklore. If it doesn’t gate a release, it’s a wish.
Back to all posts

The legacy firefight you already know

You’ve got a 10-year-old service that prints money when it’s up and ruins weekends when it’s not. It’s a monolith with a few repo fossils, a sticky session dependency, and mystery timeouts to an ancient SOAP endpoint somebody swore was “going away next quarter.” The dashboards are pretty but useless, the alerts ping the wrong people, and when the CEO asks, “Are we healthy?” your team says, “Depends which graph you look at.”

I’ve stabilized this movie enough times to know the ending doesn’t require a rewrite. It requires progressive observability and SLOs you can actually defend. Here’s the 30-day plan we run at GitPlumbers when we’re called in to stop the bleeding fast.

Week 0: Baseline and stop the bleeding

Goal: ship a minimal layer of signals without changing application logic. You need just enough to see and triage.

  • Add a synthetic probe with blackbox_exporter hitting the real user path (TLS, DNS, auth).

  • Scrape basic service metrics with Prometheus. If you’re in k8s, use ServiceMonitor. If VMs, add scrape jobs.

  • Route logs to a place you can search in under 5 seconds (Loki, Elasticsearch, Splunk—pick one).

  • Stand up a one-page Grafana dashboard with four graphs: traffic, p95 latency, error rate, CPU/memory saturation.

Prometheus scrape example (VMs):

scrape_configs:
  - job_name: legacy-service
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
      - targets: ['legacy-a.example:9100','legacy-b.example:9100']
  - job_name: synthetic-probe
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets: ['https://app.example.com/login']
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Checkpoints by end of Week 0:

  • Mean time to first useful signal (MTTFS) during an incident < 5 minutes.

  • You can answer “What’s broken: user path or backend?” using synthetic probe vs. service metrics.

  • One on-call runbook page exists with dashboard and log links.

Week 1: Minimal viable observability in the app

Now we add low-risk app instrumentation. Don’t refactor; wrap and tag.

  • Emit metrics: request count, duration histogram, and error labels by route and status.

  • Add distributed tracing with OpenTelemetry (sample 1–5% to start).

  • Use structured logs with request_id, route, duration_ms, user_tier.

Java (Spring) example:

// build.gradle
implementation 'io.micrometer:micrometer-registry-prometheus'
implementation 'io.opentelemetry:opentelemetry-exporter-otlp'

// Metrics config (Micrometer)
@Bean
MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
  return r -> r.config().commonTags("service", "legacy-service");
}

// Timing an endpoint
@GetMapping("/checkout")
public ResponseEntity<?> checkout() {
  long start = System.nanoTime();
  try {
    // business logic
    return ResponseEntity.ok().build();
  } finally {
    Timer.builder("http_server_duration")
      .tag("route", "/checkout")
      .tag("status", String.valueOf(HttpStatus.OK.value()))
      .register(Metrics.globalRegistry)
      .record(System.nanoTime() - start, TimeUnit.NANOSECONDS);
  }
}

// OpenTelemetry SDK (OTLP to collector)
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
  .setSampler(Sampler.traceIdRatioBased(0.05))
  .build();

OpenTelemetry Collector (traces/logs/metrics) minimal:

receivers:
  otlp:
    protocols:
      http:
      grpc:
exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:9464
processors:
  batch: {}
extensions:
  health_check: {}
service:
  extensions: [health_check]
  pipelines:
    traces: { receivers: [otlp], processors: [batch], exporters: [otlp] }
    metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }

Checkpoints by end of Week 1:

  • p95 latency, error rate, and traffic by route visible in Grafana.

  • End-to-end trace from ingress to DB exists for at least 1% of traffic.

  • Logs correlate via trace_id/span_id or request_id.

Week 2: Define SLOs that match reality

SLOs die on the hill of vagueness. Start with 1–2 user-centric SLOs. Use a 28–30 day window, and adopt an error budget policy.

  • Choose SLIs the business recognizes. Example journeys:

    • Auth: successful login under 500 ms.

    • Checkout: HTTP 2xx within 800 ms.

  • Avoid aggregate CPU/disk as SLOs; use them for capacity, not reliability promises.

  • Calculate error budget: if availability SLO = 99.9% over 30 days, budget ≈ 43 min of “bad.”

SLO definition with Sloth (SLO generator for Prometheus):

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: checkout-latency
spec:
  service: legacy-service
  slos:
    - name: checkout-fast
      objective: 99.0
      description: Checkout requests under 800ms
      sli:
        events:
          errorQuery: |
            sum(rate(http_server_duration_bucket{route="/checkout",le!="+Inf"}[5m]))
             - sum(rate(http_server_duration_bucket{route="/checkout",le="0.8"}[5m]))
          totalQuery: |
            sum(rate(http_server_duration_count{route="/checkout"}[5m]))
      alerting:
        name: checkout-burn
        labels: {severity: page}
        annotations: {runbook: https://runbooks.example/checkout}
        pageAlert:
          enable: true
          threshold: 14.4
          window: 5m
        ticketAlert:
          enable: true
          threshold: 6
          window: 1h

PromQL SLI without Sloth (availability example):

# Good events: 2xx & 3xx
sum(rate(http_requests_total{route="/checkout",status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total{route="/checkout"}[5m]))

Checkpoints by end of Week 2:

  • You have SLOs for 1–2 critical journeys with clear objectives and whom they protect.

  • Error budgets are visible on a Grafana panel with a timeline burn view.

  • Product agrees that budget spend will influence roadmap when exhausted.

Week 3: Alert on burn, not noise

Turn off the legacy “CPU > 80% page me” nonsense. Page when the budget is burning too fast; ticket when it’s smoldering.

  • Use multi-window, multi-burn-rate alerts (fast + slow windows).

  • Page when users are currently impacted; open tickets for slow burns.

  • Link every alert to a runbook and a dashboard; no orphans.

Prometheus burn-rate rules (availability SLO 99.9%):

groups:
- name: slo-burn
  rules:
  - alert: SLOBudgetBurnFast
    expr: (
      (1 - slo:availability:ratio_rate5m) / (1 - 0.999)
    ) > 14.4
    for: 5m
    labels:
      severity: page
    annotations:
      summary: Fast burn on availability SLO
      runbook: https://runbooks.example/availability
  - alert: SLOBudgetBurnSlow
    expr: (
      (1 - slo:availability:ratio_rate1h) / (1 - 0.999)
    ) > 6
    for: 2h
    labels:
      severity: ticket
    annotations:
      summary: Slow burn on availability SLO
      runbook: https://runbooks.example/availability

Where slo:availability:ratio_rate5m is your 5m rolling good/total ratio metric (Sloth or your own recording rule).

Alertmanager routing suggestion:

route:
  receiver: pager
  routes:
  - matchers: [severity="page"]
    receiver: pager
  - matchers: [severity="ticket"]
    receiver: backlog
receivers:
- name: pager
  pagerduty_configs:
  - routing_key: ${PAGERDUTY_KEY}
- name: backlog
  webhook_configs:
  - url: https://jira.example.com/webhook

Checkpoints by end of Week 3:

  • 50–80% fewer alerts than last month; pages are action-worthy.

  • MTTA (time to acknowledge) < 5 minutes during work hours; MTTR trending down.

  • On-call can answer “Are we burning budget now?” from one panel.

Week 4: Harden with progressive delivery and chaos

Now we weaponize SLOs. Releases only proceed if they keep the promise.

  • Gate canaries with SLO checks; auto-abort on burn or regression.

  • Inject failure with chaos-mesh or toxiproxy; verify timeouts and circuit breakers.

  • Add backoff and retry_budget logic on callers (Linkerd/Istio) to avoid retry storms.

Argo Rollouts canary with Prometheus analysis:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: legacy-service
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: slo-check
      - setWeight: 50
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: slo-check
      - setWeight: 100
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: slo-check
spec:
  metrics:
  - name: error-budget-burn
    interval: 1m
    successCondition: result < 1
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          ((1 - slo:checkout:ratio_rate5m) / (1 - 0.99))

Chaos example (toxiproxy) to add 300ms latency to DB:

# assuming toxiproxy in front of legacy-db
toxiproxy-cli toxic add db -t latency -a latency=300 -a jitter=50
# run synthetic and canary checks; verify timeouts and fallbacks
# remove toxic after test
toxiproxy-cli toxic remove db -t latency

Checkpoints by end of Week 4:

  • Releases auto-rollback on SLO regressions during canary.

  • Circuit breakers and timeouts trip under dependency impairment without cascading failures.

  • Change failure rate down; deployment frequency unaffected or up.

Operationalizing: dashboards, reviews, and results

Hardened isn’t a one-off. Bake the habits.

  • Dashboards: one per journey with SLI, budget, top contributors, and recent deployments overlay.

  • Runbooks: link from every alert; include hypothesis lists and known-good queries.

  • Weekly 30-minute error-budget review with engineering and product: agree on whether to spend budget or fix reliability.

  • Post-incident: tag causes (infra, code, third-party) and update SLOs if they incentivized the wrong behavior.

What “good” looks like after 30 days (real outcomes we’ve seen):

  • Page volume down 60–80%, on-call stress cuts in half.

  • p95 checkout latency from 1.4s → 700ms after three small fixes surfaced by traces.

  • MTTR from 2h → 25m due to faster isolation via span graphs and logs with request_id.

  • CEO question “Are we healthy?” answered with a single error-budget panel.

Common traps and how to dodge them

  • Boiling the ocean: a dozen SLOs day one. Start with two that map to revenue.

  • Vanity metrics: CPU and GC time as SLOs. Keep those for capacity planning.

  • Zero tolerance: 100% objectives create permanent burn and pager fatigue.

  • Too much tracing: 100% sampling at peak traffic. Start small and turn up only where needed.

  • Silent canaries: canaries without SLO gates are just a slower blast radius.

The tooling stack that works (pick your flavor)

  • Metrics: Prometheus + Grafana

  • Traces: Tempo or Jaeger + OpenTelemetry

  • Logs: Loki or your existing log store

  • Alerts: Alertmanager + PagerDuty

  • SLOs: Sloth (Prometheus) or OpenSLO with a provider like Nobl9

  • Delivery: Argo Rollouts (k8s) or a canary proxy like Flagger

  • Service mesh (optional): Linkerd or Istio for retries/circuit breakers

  • Chaos: chaos-mesh, toxiproxy, or gremlin

If you’re on VMs, you can still do all of this—replace ServiceMonitors with file-based scrape configs and run the OTel Collector as a systemd service.

What I’d do differently if I had one more week

  • Add histograms with explicit buckets that match your SLO thresholds (e.g., le=0.2,0.5,0.8,1.2...).

  • Tag users by tier/plan in metrics to see who you’re failing first.

  • Adopt GitOps for observability configs (PrometheusRules, SLO YAML) so changes are reviewable and auditable.

  • Stand up a synthetic load job (k6) tied to the canary to test peak paths during rollout.

Related Resources

Key takeaways

  • Treat observability as a progressive rollout; ship value in days, not quarters.
  • Define SLOs from real user journeys, not system vanity metrics.
  • Alert on error-budget burn, not raw thresholds, to cut pager noise.
  • Use canary + SLO gates to make hardening enforceable, not aspirational.
  • Keep change small: tiny PRs, one probe at a time, and measure improvement.

Implementation checklist

  • Expose `/metrics` and a health probe; scrape with Prometheus.
  • Add request IDs and structured logs with level, route, and latency.
  • Ship traces with OpenTelemetry; sample 1–5% to start.
  • Define 1–2 SLOs per customer-critical journey with 28–30 day windows.
  • Create multi-window burn-rate alerts that page only on fast burn.
  • Gate canaries with SLO checks; auto-abort bad releases.
  • Run chaos against dependencies; verify circuit breakers and timeouts.
  • Review error budget weekly; adjust priorities based on spend.

Questions we hear from teams

We’re not on Kubernetes. Does this still work?
Yes. Run Prometheus and the OTel Collector on VMs, scrape via static configs, and use your existing deploy tooling. Argo Rollouts becomes a canary proxy or feature flag system (e.g., Flagger, LaunchDarkly) with the same SLO checks.
What if we can’t change the app code quickly?
Start with sidecar/proxy metrics (Envoy/Nginx ingress), synthetics, and blackbox probes. Add tracing via auto-instrumentation where possible. You’ll get 60–70% of the value before touching core code.
How many SLOs should we have?
One to two per critical user journey. More than five per service becomes dashboard cosplay. Expand only when you have budget reviews running smoothly.
How do we pick good thresholds?
Use historical percentiles and business feedback. If 95% of checkouts complete under 700ms and users complain above 1s, pick 800ms for a 99% objective. Revisit quarterly.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Stabilize your legacy service with GitPlumbers Get our SLO starter pack (dashboards + rules)

Related resources