Stop Chasing Graphs: Build Correlation That Predicts Incidents (and Auto-Rolls Back)

Leading indicators, not vanity dashboards. Tie metrics, logs, and traces to triage and rollout automation so you fix the real issue—before PagerDuty wakes you.

If your metrics can’t point to a trace, and your alert can’t point to a change, you’re paying for pictures, not reliability.
Back to all posts

The 2 a.m. spike you’ve lived through

I’ve sat in enough war rooms to know the pattern: Grafana lights up with a p95 latency spike, someone yells “it’s the database,” we stare at five dashboards, then discover a quiet feature flag rollout hammered a downstream cache. We burned 45 minutes chasing symptoms.

If your correlation story is “those two graphs looked similar,” you’re going to keep paging the team with noise. The fix is boring and powerful: make telemetry carry context, promote a few leading indicators, and wire them to triage and progressive delivery so the blast radius stays tiny.

If your metrics can’t point to a trace, and your alert can’t point to a change, you’re paying for pictures, not reliability.

What actually predicts incidents (and what doesn’t)

Vanity dashboards are full of averages and totals. Incidents are driven by tails and saturation. The leading indicators that consistently buy you minutes:

  • Tail latency drift: z-score of p95/p99 vs a rolling baseline.
  • Queue age/length slope: rising work-in-progress beats raw RPS.
  • Retry storm: increase in 5xx followed by retry tag surge.
  • Consumer lag growth: Kafka/Kinesis lag trending up is a pre-incident smell.
  • GC pause rate / heap pressure: JVM/Go runtimes telling you they’re unhappy.
  • Cache miss drift: miss ratio rising is a cascade starter.
  • DB lock wait time: contention often precedes errors.

What doesn’t help much:

  • Average latency (“we’re fine on average”).
  • Host-level CPU without service context.
  • Arbitrary synthetic “health scores.”

Instrument like you mean correlation

Correlation only works if you can follow the breadcrumb trail from symptom to cause. That means IDs, dimensions, and guardrails on cardinality.

  1. Propagate context
    • Use OpenTelemetry SDKs to propagate trace_id/span_id across services. Inject into logs.
    • Enable Prometheus exemplars so metric points carry trace_id links.
// Node.js Express + OTel example
import { context, trace } from '@opentelemetry/api';
app.use((req, res, next) => {
  const span = trace.getSpan(context.active());
  const traceId = span?.spanContext().traceId;
  req.logger = req.logger.child({ trace_id: traceId });
  next();
});

app.get('/checkout', async (req, res) => {
  req.logger.info({ event: 'start_checkout' }, 'start');
  // ...
});
  1. Choose sane dimensions

    • Label metrics with: service, endpoint, version, region, shard, customer_tier.
    • Resist per-user labels. Use exemplars or sampling instead.
  2. Create change events

    • Emit deploys, feature flag changes, schema migrations as events into the same store.
    • Attach service, version, git_sha, flag_key, owner.
# Example: emit a change event during CI
curl -X POST https://events.internal/change \
  -H 'Content-Type: application/json' \
  -d '{
    "type": "deploy",
    "service": "orders",
    "version": "2025.12.11-abc123",
    "git_sha": "abc123",
    "owner": "team-orders"
  }'

Build leading indicators with Prometheus (recording rules > expensive panels)

Put the smarts in recording rules so queries are cheap and alerts/gates can reuse them.

# prometheus-rule.yaml
groups:
- name: predictive-signals
  rules:
  - record: service:latency_p95_seconds
    expr: |
      histogram_quantile(
        0.95,
        sum(rate(http_server_duration_seconds_bucket[5m])) by (le, service, endpoint, version)
      )
  - record: service:latency_p95_zscore
    expr: |
      (service:latency_p95_seconds - avg_over_time(service:latency_p95_seconds[1h]))
      /
      stddev_over_time(service:latency_p95_seconds[1h])
  - record: service:queue_age_seconds_slope
    expr: deriv(service_queue_oldest_message_seconds[5m])
  - record: service:consumer_lag_slope
    expr: deriv(kafka_consumer_lag{group="orders"}[5m])
  - record: service:retry_rate
    expr: rate(http_client_requests_retries_total[5m])
  - alert: PredictiveDegradation
    expr: |
      (service:latency_p95_zscore > 3)
      and on(service) (service:queue_age_seconds_slope > 0)
      and on(service) (service:retry_rate > 0.1)
    for: 10m
    labels:
      severity: page
      owner: team-orders
    annotations:
      summary: "{{ $labels.service }} degradation likely (p95 z>3, queue age rising, retries up)"
      runbook: "https://runbooks.internal/orders-latency"

This is not magic ML—just stats that consistently predict pain. We’ve used this to catch incidents 10–20 minutes before hard SLO breaches.

From symptom to cause in one click (metrics ↔ traces ↔ logs)

Once the alert fires, make the first 10 minutes automatic.

  • Exemplars to Tempo/Jaeger: Click the dot on your Grafana graph to open the trace that contributed to that metric point.
  • Top-K culprits: Precompute “top slow spans” and “top error classes” for the affected service/version.
  • Change overlay: Show deploy/flag events on the same timeline.
-- ClickHouse example: correlate traces, logs, and changes by trace_id and time window
SELECT
  t.service,
  t.root_span AS slow_span,
  count(*) AS hits,
  anyLast(c.change_type) AS change,
  anyLast(c.version) AS version
FROM traces t
LEFT JOIN changes c
  ON c.service = t.service
  AND c.ts BETWEEN t.start_time - INTERVAL 5 MINUTE AND t.end_time + INTERVAL 5 MINUTE
WHERE t.latency_p95_zscore > 3
GROUP BY t.service, slow_span
ORDER BY hits DESC
LIMIT 5;

Pragmatically: if you’re on Grafana stack, use Prometheus + Tempo + Loki with exemplars; on Datadog or Honeycomb, use their built-in correlation. The pattern is the same.

Wire correlation into rollout automation (so bad changes never get big)

If correlation can predict trouble, your rollout system should react. Progressive delivery gives you the handle.

  • Argo Rollouts / Flagger: Gate each step on your predictive signals. Auto-abort and roll back.
  • Feature flags: If a metric points to a flag-enabled path, auto-disable the flag for the impacted cohort.
  • Istio/Linkerd: Use weight shifting and circuit breaking to contain blast radius.
# argo-rollouts AnalysisTemplate using Prometheus signals
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: orders-predictive-checks
spec:
  metrics:
  - name: p95-zscore
    interval: 1m
    count: 10
    successCondition: result < 3
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          service:latency_p95_zscore{service="orders",version="{{args.version}}"}
  - name: retry-rate
    interval: 1m
    count: 10
    successCondition: result < 0.1
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          service:retry_rate{service="orders",version="{{args.version}}"}
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: orders
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - analysis:
          templates:
          - templateName: orders-predictive-checks
            args:
            - name: version
              value: 2025.12.11-abc123
      - setWeight: 50
      - analysis:
          templates:
          - templateName: orders-predictive-checks
            args:
            - name: version
              value: 2025.12.11-abc123
      - setWeight: 100

For flags, use LaunchDarkly or OpenFeature webhooks to auto-toggle on breach.

# pseudo: Alertmanager webhook -> disable flag for tier=premium
curl -X PATCH https://app.launchdarkly.com/api/v2/flags/orders/new-checkout \
  -H "Authorization: $LD_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
    "environments": {"prod": {"targets": [{"values": ["premium"], "variation": false}]}}
  }'

Example: End-to-end triage flow that doesn’t waste your pager time

  1. Alert: PredictiveDegradation fires on orders p95 z>3 + queue age slope.
  2. Context pack (Alertmanager): attaches rollouts version, last flag changes, top errors, exemplar trace link.
# Alertmanager route and template
route:
  receiver: oncall
  group_by: ['service','version']
  routes:
  - matchers:
    - severity="page"
    receiver: oncall
    continue: false
receivers:
- name: oncall
  pagerduty_configs:
  - routing_key: ${PD_KEY}
    severity: critical
    details:
      service: '{{ $labels.service }}'
      version: '{{ $labels.version }}'
      trace: '{{ index .Annotations "__exemplar_trace_url" }}'
      changes: '{{ template "recent_changes" . }}'
  1. One-click: Oncall opens exemplar trace; sees slow GetCart span talking to Redis with high miss ratio.
  2. Change overlay: Shows a flag enabling “cart recommendations” for premium users 15 minutes earlier.
  3. Containment: Argo Rollouts aborts at 10% and rolls back automatically; flag is disabled for premium cohort.
  4. Aftercare: GitHub issue auto-opened with attached graphs, trace IDs, and owner. MTTR < 20 minutes, SLO never breached.

What I’d implement this week

  • Day 1–2
    • Turn on OTel propagation and add trace_id to logs.
    • Add Prometheus recording rules for p95 z-score, queue age slope, retry rate, consumer lag slope.
    • Emit change events from CI/CD and flag system.
  • Day 3–4
    • Build a triage dashboard: p95, queue age, retries with exemplar links and change overlays.
    • Gate a single service’s rollout with Argo Rollouts + your predictive metrics.
  • Day 5
    • Backtest last 3 incidents: could these signals have predicted them? Tune thresholds and windows.
    • Add guardrails: label budgets, anti-cardinality linting in CI.

If your stack is a legacy monolith with AI-generated “vibe code” sprinkled in, start smaller: instrument the hot path, add trace IDs to logs, and create two signals (tail latency z-score, retry surge). GitPlumbers has done this in old Spring/Oracle stacks and in shiny Istio meshes—same playbook, different wrench sizes.

Related Resources

Key takeaways

  • Correlation that matters starts with trace-context everywhere: logs, metrics exemplars, and events share a `trace_id` and `span_id`.
  • Leading indicators are behavior deltas: tail latency z-scores, queue age slope, retry surge, GC pause rate, cache miss drift, consumer lag growth.
  • Record rules turn expensive queries into cheap signals; wire those into rollout gates (Argo Rollouts/Flagger) and Alertmanager routing.
  • Automate the first 10 minutes of triage: “group by trace root, top error class, top slow span, top dependency” and annotate the change that likely caused it (deploy/flag/db change).
  • Don’t drown in cardinality: pick 5-10 dimensions that predict pain (service, endpoint, version, shard, region, customer_tier).
  • Close the loop: failed gates auto-roll back; persistent leading indicators open an issue with context and owners, not just an alert.

Implementation checklist

  • Propagate `trace_id` and `span_id` via OpenTelemetry across services and include them in logs.
  • Define 6–10 leading indicators as recording rules (p95 z-score, queue age slope, consumer lag growth, retry rate, 5xx error ratio, GC pause rate).
  • Create change event streams for deploys, feature flags, and infra changes and attach them to traces/metrics.
  • Build a triage dashboard that starts from the symptom and jumps to the likely root with one click (exemplars to traces).
  • Gate rollouts with Argo Rollouts/Flagger using those leading indicators; auto-abort and roll back on breach.
  • Route alerts using correlation labels (service, version, owner); attach the triggering trace, top N logs, and change events.
  • Budget cardinality: enforce label hygiene, cardinality budgets per team, and aggregation via exemplars rather than exploding labels.
  • Backtest alerts against last 3–6 incidents; track precision, recall, lead time, and MTTR improvements.

Questions we hear from teams

How do we avoid blowing up Prometheus with cardinality when we add context?
Decide upfront on 5–10 allowed labels (service, endpoint, version, region, shard, customer_tier). Block PRs that add unbounded labels (user_id, request_id). Use exemplars to carry high-cardinality context into traces without exploding metric series. Enforce budgets with promtool and lint in CI.
Do we need ML for predictive detection?
Not to start. Z-scores over rolling baselines, slope/derivative for queue age and lag, and correlation with change events catch 80% of pre-incident patterns. Once stable, you can try Kayenta-style canary analysis or simple anomaly detectors—but keep interpretability for on-call.
What about cost? Logs and traces aren’t cheap.
Sample spans (tail-based sampling for errors/long latency), keep logs structured and short, and push high-cardinality data to traces, not metrics. Use recording rules to precompute signals; query raw data only on demand. We often cut costs 20–40% while improving MTTR.
We have a legacy monolith. Will this still work?
Yes. Add OTel auto-instrumentation where possible, inject `trace_id` into logs, and ship a minimal set of metrics (requests, p95, errors, queue age). You can still gate rollouts (blue/green or canary) on those signals. GitPlumbers has done this for JBoss/WebLogic and older Spring stacks.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about predictive signals See how we gate rollouts with metrics

Related resources