The Correlation Engine That Saved Our Canary (And Your Weekend)
Stop paging on vanity graphs. Design correlation that predicts incidents, shortens triage, and drives automated rollbacks.
Dashboards are for humans. Correlation is for machines that don’t get tired.Back to all posts
The 2 a.m. incident that finally forced us to stop chasing vanity graphs
We’d just pushed a “harmless” refactor of the checkout service. Dashboards looked green—CPU, memory, 200s per second. Ten minutes later, pages lit up: p99 latency spiked, carts abandoned. The usual Grafana tour didn’t help. What did? A correlation job flagged three leading indicators before the blast radius expanded:
- Cache hit ratio dropped from 0.96 to 0.78 for
checkout -> pricingcalls. - Retry storms on the
payment-clientjumped 5x, withcircuit_breaker_open_totalflapping. - CPU throttling on
checkoutpods (container_cpu_cfs_throttled_seconds_total) rose 3x.
The engine tied all three to a specific change event: checkout:v2025.10.09 rollout at 21:17, with a feature flag flip for promo_v3. Argo Rollouts auto-paused the canary, we rolled back, and MTTR was 12 minutes. No root-cause debate, no dashboard archaeology. The point: you need correlation that connects symptoms to causes and then drives action.
What actually predicts incidents: leading indicators you can measure
If your alerts are CPU > 80% and 5xx > 1%, you’re babysitting vanity metrics. The signals that predict pain usually show up 5–20 minutes before the incident budget explodes:
- Queue depth and lag:
rabbitmq_queue_messages_ready,kafka_consumergroup_lagpredict request timeouts before they hit user-facing SLOs. - Retry rate and backoff collapses:
http_client_request_retries_total,grpc_client_retry_total—watch for retry storms that amplify latency. - Saturation, not just utilization:
container_cpu_cfs_throttled_seconds_total, thread-pool saturation (executor_active_threads / executor_pool_size). - Garbage collection and stop-the-world:
jvm_gc_collection_seconds_sumrate andgo_sched_latency_secondstails. - Cache health:
cache_hit_ratio,redis_cmd_duration_seconds_bucketp99; cache misses are a latency tax. - Circuit breaker state:
circuit_breaker_open_totalcorrelated with upstream timeouts. - Error-budget burn rate: fast and slow burn queries predict imminent SLO violation.
PromQL snippets you’ll actually use:
# p99 latency per service/version
histogram_quantile(0.99, sum by (le, service, version) (
rate(http_server_request_duration_seconds_bucket[5m])
))
# Error budget burn rate (5m window)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Kafka consumer lag (max across partitions)
max(kafka_consumergroup_lag{consumergroup="payments"})
# CPU throttling (per pod)
rate(container_cpu_cfs_throttled_seconds_total{container!=""}[5m])If the only leading indicator you track is CPU, you’re flying IFR with a broken altimeter.
Build a correlation engine, not a dashboard: design that sticks
Dashboards are for humans. Correlation is for machines that don’t get tired. Here’s the architecture we deploy at GitPlumbers when we need predictive signals and explainable suspects:
- OpenTelemetry everywhere:
trace_idandspan_idlet you attach logs and metrics to actual requests. Emit exemplars from Prometheus so traces hang off graphs. - Service graph: explicit map of
service -> dependencywith metadata (protocol, criticality). Store it in Git next to the code. - Change events as first-class telemetry: deploys, feature flag flips, config changes, schema migrations. Label them with
service,version,git_sha,change_type. - Correlation methods: start simple—cross-correlation with small lags (±10m) across leading indicators per edge in the graph. Weight by historical predictive power.
- Cardinality control: constrain label sets to
service,version,env,endpoint(if needed). Anything more, and your TSDB invoice becomes an outage.
OpenTelemetry Collector config that tags environment and ships to Prometheus remote write, Tempo/Jaeger, and Loki:
receivers:
otlp:
protocols:
http:
grpc:
processors:
batch: {}
attributes:
actions:
- key: deployment.environment
from_attribute: k8s.namespace.name
action: upsert
- key: service.version
from_attribute: k8s.pod.label.app_kubernetes_io_version
action: upsert
exporters:
prometheusremotewrite:
endpoint: https://mimir.example.com/api/v1/push
otlp:
endpoint: tempo:4317
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [batch, attributes]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [batch, attributes]
exporters: [loki]Keep a checked-in service graph (YAML) that the engine reads:
services:
checkout:
owns: ["/checkout", "/cart"]
depends:
- name: pricing
protocol: http
critical: true
- name: payments
protocol: grpc
critical: true
- name: redis
protocol: tcp
critical: mediumWire telemetry to triage: suspects, runbooks, and Slack, automatically
When SLOs burn, humans need a ranked suspect list and one-click runbooks—every time. Here’s the loop that works:
- Detect symptom: p99 breach or burn-rate alert fires.
- Correlate with leading indicators on upstream edges in the service graph (±10m lag).
- Attach change events around the same window (deploys, feature flips, DB migrations).
- Rank suspects by correlation strength and prior predictive accuracy.
- Post to Slack with deep links to Grafana, Jaeger, and runbooks.
A tiny Python sketch that computes cross-correlation with lags and ranks suspects:
# pip install pandas numpy
import numpy as np, pandas as pd
def xcorr(a, b, max_lag=10):
# a: symptom series, b: candidate leading indicator
lags = range(-max_lag, max_lag + 1)
scores = []
for l in lags:
shifted = b.shift(l)
corr = a.corr(shifted)
scores.append((l, corr))
return max(scores, key=lambda x: abs(x[1])) # (lag, score)
# rank candidates [{name, series}] against symptom
symptom = df["p99_latency"]
candidates = [(name, df[name]) for name in ["kafka_lag", "cache_hit", "retry_rate", "cpu_throttle"]]
ranked = sorted(((n,)+xcorr(symptom, s) for n, s in candidates), key=lambda t: abs(t[2]), reverse=True)
print(ranked[:3]) # top suspects with lag and corr strengthAlertmanager annotates incidents with runbooks and trace links:
receivers:
- name: sre-slack
slack_configs:
- channel: sre-incidents
title: "{{ .CommonLabels.alertname }}: {{ .CommonLabels.service }}"
text: |
SLO burn on {{ .CommonLabels.service }}
Suspects: {{ .CommonAnnotations.suspects }}
Runbook: {{ .CommonAnnotations.runbook_url }}
Trace: {{ .CommonAnnotations.exemplar_trace_url }}Pro tip: enable Prometheus exemplars in Grafana so clicking a spike jumps straight to a representative trace.
Let rollouts prove themselves: canaries gated by real signals
If you’re not tying correlation to automation, you’re just producing nicer postmortems. Gate rollouts with real indicators using Argo Rollouts or Flagger. Example AnalysisTemplate that blocks promotion on error rate and p99:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: checkout-canary-analysis
spec:
args:
- name: canaryVersion
metrics:
- name: error-rate
interval: 1m
count: 5
successCondition: result < 0.01
failureCondition: result > 0.02
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="checkout",version="{{args.canaryVersion}}",status=~"5.."}[1m]))
/
sum(rate(http_requests_total{service="checkout",version="{{args.canaryVersion}}"}[1m]))
- name: p99-latency
interval: 1m
count: 5
successCondition: result < 0.300
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99, sum by (le) (
rate(http_server_request_duration_seconds_bucket{service="checkout",version="{{args.canaryVersion}}"}[1m])
))Rollout spec that uses it:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 120}
- analysis:
templates:
- templateName: checkout-canary-analysis
args:
- name: canaryVersion
valueFrom:
podTemplateHashValue: Latest
- setWeight: 50
- pause: {duration: 180}
- analysis:
templates:
- templateName: checkout-canary-analysis
args:
- name: canaryVersion
valueFrom:
podTemplateHashValue: LatestPrefer real dependencies as guardrails too. If your checkout depends on Kafka, add a metric for kafka_consumergroup_lag to the analysis. It’s amazing how many “okay” canaries roll forward into a broken queue.
If you like Flagger with Istio/Linkerd, same idea—define metrics that reflect user outcomes and dependency health. Kayenta works too if you’re already in the Spinnaker ecosystem.
Anti-patterns, sharp edges, and how to not melt your TSDB
I’ve watched teams burn months building Rube Goldberg observability that nobody trusts. Avoid these traps:
- Cardinality explosions: never index by
user_id,session_id, or fullpath. Use exemplars or tracing for high-cardinality per-request context. - Symptoms as causes: alerting on
5xxand calling it root cause is how you get pager fatigue. - Ignoring change events: if your correlation doesn’t ingest GitOps rollouts, feature flags (LaunchDarkly/Unleash), and DB migrations, you’ll miss the smoking gun.
- No backtesting: keep a table of incidents and evaluate which indicators would have predicted them. Delete rules that don’t move MTTR or false positive rate.
- One-size-fits-all thresholds: p99 for batch jobs is noise; for checkout it’s religion. Calibrate per service SLOs.
- Unbounded scraping: scrape intervals <10s with 100k time series? Enjoy your Mimir/VictoriaMetrics bill. Start with 15s; aggregate with recording rules.
Recording rules that save queries and your wallet:
groups:
- name: service-slo
interval: 30s
rules:
- record: service:http_req_p99_seconds
expr: |
histogram_quantile(0.99, sum by (le, service) (
rate(http_server_request_duration_seconds_bucket[5m])
))
- record: service:error_rate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))What good looks like: outcomes you can defend to the CFO
When we’ve rolled this out at scale (retail, fintech, B2B SaaS), the pattern is consistent within 6–8 weeks:
- MTTR down 40–70% because triage starts with a ranked suspect list and relevant traces.
- False positives down 50% from killing vanity alerts and gating on burn rate + leading indicators.
- Safer deploys: automated canary rollback catches 80% of regressions before they hit 5% of users.
- Cheaper observability: 20–30% cost reduction by moving to recording rules and pruning label cardinality.
If your incident reviews still end with “we need more dashboards,” you’re solving the wrong problem. Build correlation that predicts, explains, and acts.
If you want help blueprinting this and landing it in weeks, not quarters, GitPlumbers has done it in messier environments than yours—legacy JVMs, chatty Node, the works. We’ll plug in OpenTelemetry, wire your service graph, and make Argo Rollouts your safety net without grinding the team to a halt.
Key takeaways
- Dashboards don’t diagnose—correlation across traces, metrics, logs, and change events does.
- Leading indicators beat vanity metrics: queue depth, retry rate, GC time, CPU throttling, cache hit ratio, and error-budget burn are predictive.
- Trace context and a service graph make correlation explainable and automatable.
- Tie correlation to action: generate a suspect list for triage and gate canaries with AnalysisTemplates.
- Control cardinality and compute cost or you’ll melt your TSDB and your on-call brain.
Implementation checklist
- Instrument with OpenTelemetry and propagate context across services and message buses.
- Model a service graph and attach change events (deploys, feature flags) as first-class signals.
- Define leading indicators per dependency: queue depth, consumer lag, throttling, GC, retry rate, circuit breaker opens.
- Create Prometheus recording rules for p99, error rate, and burn rate; publish exemplars to link traces.
- Automate triage: Slack a ranked suspect list with links to runbooks and traces.
- Gate rollouts with Argo Rollouts or Flagger using PromQL-based AnalysisTemplates.
- Continuously backtest alerts against incident history; delete rules that don’t reduce MTTR or false positives.
Questions we hear from teams
- How do we start without boiling the ocean?
- Pick one critical service with a clear SLO, instrument it end-to-end with OpenTelemetry, define three leading indicators (e.g., retry rate, queue lag, CPU throttling), and gate its canary with an AnalysisTemplate. Backtest the alerts against the last five incidents. Expand from there.
- Can we do this with Datadog/New Relic instead of Prometheus/OTel?
- Yes. The principles are the same: propagate trace context, model a service graph, capture change events, define leading indicators, and gate rollouts. Datadog Monitors with APM traces and Deployment Tracking, or New Relic NerdGraph + AIOps, can implement the same loop.
- Isn’t correlation just fancy alert spam?
- It becomes spam if you don’t rank suspects, include change events, or tie it to action. Correlation should reduce pages by finding predictive signals and either auto-rolling back or handing you the top two suspects with links to runbooks and traces.
- What about machine learning for anomaly detection?
- Great once your basics are rock solid. We’ve seen simple cross-correlation + service graph + burn rate beat “AI” anomaly boxes until the data hygiene, sampling, and labeling mature. Then add ML for seasonality and multi-variate drift.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
