The Correlation Engine That Saved Our Canary (And Your Weekend)

Stop paging on vanity graphs. Design correlation that predicts incidents, shortens triage, and drives automated rollbacks.

Dashboards are for humans. Correlation is for machines that don’t get tired.
Back to all posts

The 2 a.m. incident that finally forced us to stop chasing vanity graphs

We’d just pushed a “harmless” refactor of the checkout service. Dashboards looked green—CPU, memory, 200s per second. Ten minutes later, pages lit up: p99 latency spiked, carts abandoned. The usual Grafana tour didn’t help. What did? A correlation job flagged three leading indicators before the blast radius expanded:

  • Cache hit ratio dropped from 0.96 to 0.78 for checkout -> pricing calls.
  • Retry storms on the payment-client jumped 5x, with circuit_breaker_open_total flapping.
  • CPU throttling on checkout pods (container_cpu_cfs_throttled_seconds_total) rose 3x.

The engine tied all three to a specific change event: checkout:v2025.10.09 rollout at 21:17, with a feature flag flip for promo_v3. Argo Rollouts auto-paused the canary, we rolled back, and MTTR was 12 minutes. No root-cause debate, no dashboard archaeology. The point: you need correlation that connects symptoms to causes and then drives action.

What actually predicts incidents: leading indicators you can measure

If your alerts are CPU > 80% and 5xx > 1%, you’re babysitting vanity metrics. The signals that predict pain usually show up 5–20 minutes before the incident budget explodes:

  • Queue depth and lag: rabbitmq_queue_messages_ready, kafka_consumergroup_lag predict request timeouts before they hit user-facing SLOs.
  • Retry rate and backoff collapses: http_client_request_retries_total, grpc_client_retry_total—watch for retry storms that amplify latency.
  • Saturation, not just utilization: container_cpu_cfs_throttled_seconds_total, thread-pool saturation (executor_active_threads / executor_pool_size).
  • Garbage collection and stop-the-world: jvm_gc_collection_seconds_sum rate and go_sched_latency_seconds tails.
  • Cache health: cache_hit_ratio, redis_cmd_duration_seconds_bucket p99; cache misses are a latency tax.
  • Circuit breaker state: circuit_breaker_open_total correlated with upstream timeouts.
  • Error-budget burn rate: fast and slow burn queries predict imminent SLO violation.

PromQL snippets you’ll actually use:

# p99 latency per service/version
histogram_quantile(0.99, sum by (le, service, version) (
  rate(http_server_request_duration_seconds_bucket[5m])
))

# Error budget burn rate (5m window)
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

# Kafka consumer lag (max across partitions)
max(kafka_consumergroup_lag{consumergroup="payments"})

# CPU throttling (per pod)
rate(container_cpu_cfs_throttled_seconds_total{container!=""}[5m])

If the only leading indicator you track is CPU, you’re flying IFR with a broken altimeter.

Build a correlation engine, not a dashboard: design that sticks

Dashboards are for humans. Correlation is for machines that don’t get tired. Here’s the architecture we deploy at GitPlumbers when we need predictive signals and explainable suspects:

  • OpenTelemetry everywhere: trace_id and span_id let you attach logs and metrics to actual requests. Emit exemplars from Prometheus so traces hang off graphs.
  • Service graph: explicit map of service -> dependency with metadata (protocol, criticality). Store it in Git next to the code.
  • Change events as first-class telemetry: deploys, feature flag flips, config changes, schema migrations. Label them with service, version, git_sha, change_type.
  • Correlation methods: start simple—cross-correlation with small lags (±10m) across leading indicators per edge in the graph. Weight by historical predictive power.
  • Cardinality control: constrain label sets to service, version, env, endpoint (if needed). Anything more, and your TSDB invoice becomes an outage.

OpenTelemetry Collector config that tags environment and ships to Prometheus remote write, Tempo/Jaeger, and Loki:

receivers:
  otlp:
    protocols:
      http:
      grpc:
processors:
  batch: {}
  attributes:
    actions:
      - key: deployment.environment
        from_attribute: k8s.namespace.name
        action: upsert
      - key: service.version
        from_attribute: k8s.pod.label.app_kubernetes_io_version
        action: upsert
exporters:
  prometheusremotewrite:
    endpoint: https://mimir.example.com/api/v1/push
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [loki]

Keep a checked-in service graph (YAML) that the engine reads:

services:
  checkout:
    owns: ["/checkout", "/cart"]
    depends:
      - name: pricing
        protocol: http
        critical: true
      - name: payments
        protocol: grpc
        critical: true
      - name: redis
        protocol: tcp
        critical: medium

Wire telemetry to triage: suspects, runbooks, and Slack, automatically

When SLOs burn, humans need a ranked suspect list and one-click runbooks—every time. Here’s the loop that works:

  1. Detect symptom: p99 breach or burn-rate alert fires.
  2. Correlate with leading indicators on upstream edges in the service graph (±10m lag).
  3. Attach change events around the same window (deploys, feature flips, DB migrations).
  4. Rank suspects by correlation strength and prior predictive accuracy.
  5. Post to Slack with deep links to Grafana, Jaeger, and runbooks.

A tiny Python sketch that computes cross-correlation with lags and ranks suspects:

# pip install pandas numpy
import numpy as np, pandas as pd

def xcorr(a, b, max_lag=10):
    # a: symptom series, b: candidate leading indicator
    lags = range(-max_lag, max_lag + 1)
    scores = []
    for l in lags:
        shifted = b.shift(l)
        corr = a.corr(shifted)
        scores.append((l, corr))
    return max(scores, key=lambda x: abs(x[1]))  # (lag, score)

# rank candidates [{name, series}] against symptom
symptom = df["p99_latency"]
candidates = [(name, df[name]) for name in ["kafka_lag", "cache_hit", "retry_rate", "cpu_throttle"]]
ranked = sorted(((n,)+xcorr(symptom, s) for n, s in candidates), key=lambda t: abs(t[2]), reverse=True)
print(ranked[:3])  # top suspects with lag and corr strength

Alertmanager annotates incidents with runbooks and trace links:

receivers:
- name: sre-slack
  slack_configs:
  - channel: sre-incidents
    title: "{{ .CommonLabels.alertname }}: {{ .CommonLabels.service }}"
    text: |
      SLO burn on {{ .CommonLabels.service }}
      Suspects: {{ .CommonAnnotations.suspects }}
      Runbook: {{ .CommonAnnotations.runbook_url }}
      Trace: {{ .CommonAnnotations.exemplar_trace_url }}

Pro tip: enable Prometheus exemplars in Grafana so clicking a spike jumps straight to a representative trace.

Let rollouts prove themselves: canaries gated by real signals

If you’re not tying correlation to automation, you’re just producing nicer postmortems. Gate rollouts with real indicators using Argo Rollouts or Flagger. Example AnalysisTemplate that blocks promotion on error rate and p99:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-canary-analysis
spec:
  args:
    - name: canaryVersion
  metrics:
    - name: error-rate
      interval: 1m
      count: 5
      successCondition: result < 0.01
      failureCondition: result > 0.02
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="checkout",version="{{args.canaryVersion}}",status=~"5.."}[1m]))
            /
            sum(rate(http_requests_total{service="checkout",version="{{args.canaryVersion}}"}[1m]))
    - name: p99-latency
      interval: 1m
      count: 5
      successCondition: result < 0.300
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            histogram_quantile(0.99, sum by (le) (
              rate(http_server_request_duration_seconds_bucket{service="checkout",version="{{args.canaryVersion}}"}[1m])
            ))

Rollout spec that uses it:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 120}
      - analysis:
          templates:
          - templateName: checkout-canary-analysis
          args:
          - name: canaryVersion
            valueFrom:
              podTemplateHashValue: Latest
      - setWeight: 50
      - pause: {duration: 180}
      - analysis:
          templates:
          - templateName: checkout-canary-analysis
          args:
          - name: canaryVersion
            valueFrom:
              podTemplateHashValue: Latest

Prefer real dependencies as guardrails too. If your checkout depends on Kafka, add a metric for kafka_consumergroup_lag to the analysis. It’s amazing how many “okay” canaries roll forward into a broken queue.

If you like Flagger with Istio/Linkerd, same idea—define metrics that reflect user outcomes and dependency health. Kayenta works too if you’re already in the Spinnaker ecosystem.

Anti-patterns, sharp edges, and how to not melt your TSDB

I’ve watched teams burn months building Rube Goldberg observability that nobody trusts. Avoid these traps:

  • Cardinality explosions: never index by user_id, session_id, or full path. Use exemplars or tracing for high-cardinality per-request context.
  • Symptoms as causes: alerting on 5xx and calling it root cause is how you get pager fatigue.
  • Ignoring change events: if your correlation doesn’t ingest GitOps rollouts, feature flags (LaunchDarkly/Unleash), and DB migrations, you’ll miss the smoking gun.
  • No backtesting: keep a table of incidents and evaluate which indicators would have predicted them. Delete rules that don’t move MTTR or false positive rate.
  • One-size-fits-all thresholds: p99 for batch jobs is noise; for checkout it’s religion. Calibrate per service SLOs.
  • Unbounded scraping: scrape intervals <10s with 100k time series? Enjoy your Mimir/VictoriaMetrics bill. Start with 15s; aggregate with recording rules.

Recording rules that save queries and your wallet:

groups:
- name: service-slo
  interval: 30s
  rules:
  - record: service:http_req_p99_seconds
    expr: |
      histogram_quantile(0.99, sum by (le, service) (
        rate(http_server_request_duration_seconds_bucket[5m])
      ))
  - record: service:error_rate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))

What good looks like: outcomes you can defend to the CFO

When we’ve rolled this out at scale (retail, fintech, B2B SaaS), the pattern is consistent within 6–8 weeks:

  • MTTR down 40–70% because triage starts with a ranked suspect list and relevant traces.
  • False positives down 50% from killing vanity alerts and gating on burn rate + leading indicators.
  • Safer deploys: automated canary rollback catches 80% of regressions before they hit 5% of users.
  • Cheaper observability: 20–30% cost reduction by moving to recording rules and pruning label cardinality.

If your incident reviews still end with “we need more dashboards,” you’re solving the wrong problem. Build correlation that predicts, explains, and acts.


If you want help blueprinting this and landing it in weeks, not quarters, GitPlumbers has done it in messier environments than yours—legacy JVMs, chatty Node, the works. We’ll plug in OpenTelemetry, wire your service graph, and make Argo Rollouts your safety net without grinding the team to a halt.

Related Resources

Key takeaways

  • Dashboards don’t diagnose—correlation across traces, metrics, logs, and change events does.
  • Leading indicators beat vanity metrics: queue depth, retry rate, GC time, CPU throttling, cache hit ratio, and error-budget burn are predictive.
  • Trace context and a service graph make correlation explainable and automatable.
  • Tie correlation to action: generate a suspect list for triage and gate canaries with AnalysisTemplates.
  • Control cardinality and compute cost or you’ll melt your TSDB and your on-call brain.

Implementation checklist

  • Instrument with OpenTelemetry and propagate context across services and message buses.
  • Model a service graph and attach change events (deploys, feature flags) as first-class signals.
  • Define leading indicators per dependency: queue depth, consumer lag, throttling, GC, retry rate, circuit breaker opens.
  • Create Prometheus recording rules for p99, error rate, and burn rate; publish exemplars to link traces.
  • Automate triage: Slack a ranked suspect list with links to runbooks and traces.
  • Gate rollouts with Argo Rollouts or Flagger using PromQL-based AnalysisTemplates.
  • Continuously backtest alerts against incident history; delete rules that don’t reduce MTTR or false positives.

Questions we hear from teams

How do we start without boiling the ocean?
Pick one critical service with a clear SLO, instrument it end-to-end with OpenTelemetry, define three leading indicators (e.g., retry rate, queue lag, CPU throttling), and gate its canary with an AnalysisTemplate. Backtest the alerts against the last five incidents. Expand from there.
Can we do this with Datadog/New Relic instead of Prometheus/OTel?
Yes. The principles are the same: propagate trace context, model a service graph, capture change events, define leading indicators, and gate rollouts. Datadog Monitors with APM traces and Deployment Tracking, or New Relic NerdGraph + AIOps, can implement the same loop.
Isn’t correlation just fancy alert spam?
It becomes spam if you don’t rank suspects, include change events, or tie it to action. Correlation should reduce pages by finding predictive signals and either auto-rolling back or handing you the top two suspects with links to runbooks and traces.
What about machine learning for anomaly detection?
Great once your basics are rock solid. We’ve seen simple cross-correlation + service graph + burn rate beat “AI” anomaly boxes until the data hygiene, sampling, and labeling mature. Then add ML for seasonality and multi-variate drift.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Schedule a Telemetry Architecture Assessment Read the Canary Correlation Case Study

Related resources