Correlation That Saves Your On-Call: Turning Symptoms into Root Cause (and Automated Rollbacks)
Forget dashboards that look pretty but tell you nothing. Here’s how to design correlation that predicts incidents, ties telemetry to triage, and drives automated rollouts and rollbacks.
Correlation isn’t a dashboard feature. It’s the contract your telemetry makes with your rollback button.Back to all posts
The night your “green” dashboards lied
We had a retailer whose checkout “looked fine” in Grafana while customers rage-refreshed. CPU green, RPS stable, p95 only slightly up. But five minutes before the incident, grpc_client_retry_total to the payment gateway spiked and Kafka consumer lag started creeping. That was the real signal. The correlation wasn’t visible because retries, lag, and version labels weren’t on the same playing field. We wired them up, built predictive alerts, and taught the rollout system to back off on its own. MTTR went from hours to minutes, and change failure rate halved.
If your telemetry can’t connect symptoms to the changes that caused them, it’s theater, not observability.
Measure what predicts incidents, not what flatters slides
Vanity metrics: raw CPU, total request count, “up” status, average latency. None of these predict pain.
Leading indicators that actually move the needle:
- Saturation:
workqueue_depth, thread-pool/concurrency utilization,nginx_active_connections, DB connection pool usage. - Backpressure: retry rates (
grpc_client_retry_total,http_client_retry_total),503/429rates, circuit breaker open counts. - Tail latency: p99s and TP999, especially on dependencies;
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])). - Queues and lag: Kafka consumer lag, SQS queue depth, Redis stream lag.
- Cache health: miss ratio, eviction rate.
- GC/Runtime headroom: JVM GC pause time, Go
sched_latency_seconds> 0.5s bursts. - Network/DNS: TCP retransmits, TLS handshake time, DNS resolution latency.
- Infra pressure: pending pods,
node_pressuresignals, disk I/O wait.
Tie these to business SLOs (checkout availability, ingestion freshness). Alert on burn rate and predictive saturation, not raw counts.
# p99 frontend latency (5m) for service/version
histogram_quantile(
0.99,
sum by (le, service, version) (
rate(http_request_duration_seconds_bucket{service="web"}[5m])
)
)Design your data so correlation is trivial
I’ve seen teams spend fortunes on “observability platforms” and still fail because dimensions don’t line up. You can’t correlate what you don’t consistently label.
Do this first:
- Standardize resource attributes across metrics, logs, and traces:
service,version,env,region,commit,feature_flag,canary(true/false),k8s_namespace,k8s_pod.
- Propagate
trace_ideverywhere. Add it to log lines and emit exemplars on metrics. - Emit change events as metrics. Expose “deploy/config/flag flip” as time series you can join by label.
- Cardinality budgets. Only add labels you’d actually page or route on.
OpenTelemetry resource config example:
# otel-collector.yaml
exporters:
otlphttp:
endpoint: https://otlp.example.com
processors:
batch: {}
attributes:
actions:
- key: service.version
action: upsert
value: ${DEPLOY_VERSION}
- key: deployment.environment
action: upsert
value: ${ENV}
- key: git.commit.sha
action: upsert
value: ${GIT_SHA}
- key: feature.flags
action: upsert
value: ${FEATURE_FLAGS}
receivers:
otlp:
protocols:
http: {}
grpc: {}
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes, batch]
exporters: [otlphttp]
metrics:
receivers: [otlp]
processors: [attributes, batch]
exporters: [otlphttp]Change events as metrics (exporter or Pushgateway):
cat <<'EOF' | curl --data-binary @- http://pushgateway:9091/metrics/job/change_events
# TYPE deploy_event gauge
# HELP deploy_event A deploy happened (labeled) at this timestamp
deploy_event{service="payments",version="1.42.0",env="prod",commit="abc123",canary="true"} 1
EOFWith this, a single dashboard can slice by service/version/feature_flag and link metrics ↔ traces ↔ logs.
Predict and page before customers feel it
Use PromQL to compute leading indicators and burn rates. Two staples I’d ship on day one:
- Predict saturation with
predict_linearto catch imminent queue explosions:
# Predict when queue depth will hit 1000 in the next 10m
predict_linear(
max by (service) (workqueue_depth{service="ingestor"}[15m]),
10 * 60
) > 1000- Fast+slow burn alerts to detect SLO violations early without flapping (Google SRE model):
# 2h fast burn (14x budget) + 24h slow burn (6x budget)
- alert: ApiErrorBudgetBurnFast
expr: (
sum(rate(http_requests_total{service="api",code=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
) > (1 / (28 * 24)) # 14x over budget
for: 10m
labels:
severity: page
team: api
annotations:
summary: "API error budget burning fast"
- alert: ApiErrorBudgetBurnSlow
expr: (
sum(rate(http_requests_total{service="api",code=~"5.."}[30m]))
/
sum(rate(http_requests_total{service="api"}[30m]))
) > (1 / (14 * 24)) # 6x over budget
for: 2h
labels:
severity: page
team: api
annotations:
summary: "API error budget burning slow"Now tie symptoms to changes in the alert itself. Include the top correlated version/flag and a trace link.
# alertmanager.yaml
route:
receiver: pagerduty
group_by: [service]
routes:
- matchers:
- severity="page"
receiver: triage-bot
receivers:
- name: triage-bot
webhook_configs:
- url: http://triage-bot.default.svc.cluster.local/alert
send_resolved: trueYour triage-bot should enrich the alert with:
- last deploys/config changes/flag flips for the same
service/env/region - canary status (from Argo Rollouts/Flagger)
- top traces for the failing endpoint (via Tempo/Jaeger) using exemplars
Stop guessing: compute correlation, don’t eyeball it
Here’s a pattern that works at scale:
- Window your symptom metric (e.g.,
p99_latency) and candidate causes (deploy events, retries, consumer lag) into the same 1m buckets. - Compute cross-correlation and rank suspects by coefficient and time lead.
- Output a short “blame list” with confidence scores to the alert/runbook.
Example using ClickHouse to correlate p99 latency with changes and retry rate:
-- Symptom: p99 latency for service=checkout
WITH
toStartOfMinute(ts) AS t,
quantileExact(0.99)(latency_ms) AS p99
SELECT t, p99
INTO TEMP TABLE p99_by_min
FROM metrics_http_latency
WHERE service = 'checkout' AND env = 'prod' AND ts > now() - INTERVAL 2 HOUR
GROUP BY t
ORDER BY t;
-- Candidate: retry rate and deploy events aligned per minute
WITH toStartOfMinute(ts) AS t
SELECT t,
sum(retries) AS retry_rate,
maxIf(1, event='deploy') AS deploy
INTO TEMP TABLE causes_by_min
FROM metrics_events
WHERE service = 'checkout' AND env = 'prod' AND ts > now() - INTERVAL 2 HOUR
GROUP BY t
ORDER BY t;
-- Correlate p99 with lagged causes (lead of 1-10 minutes)
SELECT cause,
lag_minutes,
corr(p99, value) AS corr_coeff
FROM (
SELECT p.t AS t, p.p99, c.retry_rate AS value, 'retry_rate' AS cause, 0 AS lag_minutes FROM p99_by_min p
JOIN causes_by_min c USING (t)
UNION ALL
SELECT p.t, p.p99, c.deploy, 'deploy', number AS lag_minutes
FROM p99_by_min p
JOIN causes_by_min c ON p.t = c.t + number*60
ARRAY JOIN range(1, 10) AS number
)
GROUP BY cause, lag_minutes
ORDER BY corr_coeff DESC
LIMIT 5;Return that list to the triage-bot so your page includes: “Strong correlation with deploy version 1.42.0 at t-4m; retry_rate spike at t-2m.” Now you’re not guessing.
For traces/logs, rely on trace_id propagation. In Grafana, exemplars on your latency histograms jump straight into Tempo traces. Loki example to pull error logs with trace IDs:
{app="checkout", level="error"} |= "payment" | line_format "{{.ts}} {{.msg}} trace={{.traceid}}"Close the loop: tie correlation to rollout automation
Once you trust your signals, let them drive rollouts. Argo Rollouts and Flagger make this straightforward.
Argo Rollouts AnalysisTemplate with PromQL checks:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: checkout-canary-analysis
spec:
metrics:
- name: p99-latency
interval: 1m
count: 5
successCondition: result < 450
failureLimit: 1
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="checkout",version=~"{{args.version}}"}[2m])) by (le))
- name: retry-rate
interval: 1m
count: 5
successCondition: result < 0.02
failureLimit: 1
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(grpc_client_retry_total{service="checkout",version=~"{{args.version}}"}[2m]))
/
sum(rate(http_requests_total{service="checkout",version=~"{{args.version}}"}[2m]))
args:
- name: versionRollout referencing the analysis, with auto-abort:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout
spec:
strategy:
canary:
canaryService: checkout-canary
stableService: checkout-stable
steps:
- setWeight: 10
- analysis:
templates:
- templateName: checkout-canary-analysis
args:
- name: version
valueFrom:
podTemplateHashValue: Canary
- pause: {duration: 3m}
- setWeight: 50
- analysis:
templates:
- templateName: checkout-canary-analysis
args:
- name: version
valueFrom:
podTemplateHashValue: Canary
abort: {}If p99 or retry rate fails, Argo aborts and rolls back automatically. Same story with Flagger if you prefer operator style.
Kayenta (if you’re in Spinnaker/Managed Delivery land) can run multi-metric canary analysis; feed it the same leading indicators.
What “good” looks like (and what’ll bite you)
Results we’ve seen after teams wire this up with GitPlumbers:
- MTTR down 30–60% because alerts include suspects and traces.
- Change failure rate cut by 40–60% with canary auto-abort.
- 70% fewer false pages by swapping vanity metrics for leading indicators.
Pitfalls:
- Cardinality explosions from
user_id/request_idlabels. Keep high-cardinality data in traces/logs; use exemplars for jump links. - Overfitting thresholds to past incidents. Validate with holdout windows and chaos experiments.
- Noisy dependencies. If your payment provider is flaky, isolate with circuit breakers and budget per-dependency.
- Half-instrumented services. One missing
versionlabel ruins correlation across the whole path.
Concrete next steps:
- Add
version,commit,feature_flag,canaryto all telemetry. - Emit deploy/flag flip events as metrics.
- Replace CPU and averages with p99, retries, lag, queue depth.
- Ship
predict_linearand burn-rate alerts. - Add Argo/Flagger analysis to your rollout pipeline.
- Stand up a trivial
triage-botto enrich alerts with correlation and trace links.
Key takeaways
- Instrument for leading indicators (saturation, queue depth, retries) and treat vanity metrics as noise.
- Use consistent dimensions across metrics, logs, traces, and change events: service, version, region, feature_flag, commit.
- Correlate symptoms to changes using time-aligned windows and labels; don’t eyeball it—compute it.
- Wire correlation into triage: alerts should include top suspected changes, relevant traces, and rollout status.
- Automate canary analysis and rollback using PromQL-backed AnalysisTemplates in Argo Rollouts or Flagger.
- Keep cardinality under control—only dimensions you triage by deserve to exist.
- Measure results in MTTR, change failure rate, and error-budget burn, not “dashboard completeness.”
Implementation checklist
- Define and standardize telemetry dimensions: service, version, region, env, feature_flag, commit, canary.
- Emit change events as metrics: deployment, config, and feature-flag flips with timestamps and labels.
- Create leading-indicator alerts: queue depth, retries, tail p99 latency, consumer lag, throttling.
- Use `predict_linear` and burn-rate alerts to catch incidents before customers do.
- Add triage automation: Alertmanager webhook that computes top-correlated changes and links to traces.
- Integrate rollout automation: Argo Rollouts AnalysisTemplates or Flagger with PromQL checks.
- Enforce cardinality budgets and sampling for traces/logs; propagate `trace_id` everywhere.
- Continuously review false positives/negatives and tighten queries and windows.
Questions we hear from teams
- How do I avoid blowing up Prometheus with labels like commit and version?
- Use stable labels (`service`, `version`, `env`, `region`, `canary`) and keep churn bounded. Avoid per-request/user IDs in metrics. Drop high-cardinality labels at the scrape or via relabeling, and store them in traces/logs instead. Enforce a label budget per team.
- Can I do this without Prometheus/Grafana?
- Yes. The same principles apply in Datadog, New Relic, Honeycomb, or Elastic. The key is consistent dimensions across telemetry types and the ability to query them together. Honeycomb’s traces-first model makes correlation fast; Datadog has deployment events and monitors you can hook to rollbacks.
- What about AI systems with non-deterministic behavior?
- Use leading indicators tailored to LLM workloads: token-per-second throughput, cache hit rate (prompt/embedding cache), upstream latency to model endpoints, and `429`/throttle rates. Canary on top-K user journeys and watch hallucination proxies (e.g., high answer entropy + low retrieval overlap) before full rollout.
- How do I justify the investment?
- Track MTTR, change failure rate, and error-budget burn before/after. Most teams recoup the effort in reduced outages and faster rollouts within a quarter. Tie it to business KPIs (conversion rate during deploys, ingestion freshness) to make it non-optional.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
