How do I avoid blowing up Prometheus with labels like commit and version?

Use stable labels (`service`, `version`, `env`, `region`, `canary`) and keep churn bounded. Avoid per-request/user IDs in metrics. Drop high-cardinality labels at the scrape or via relabeling, and store them in traces/logs instead. Enforce a label budget per team.

Can I do this without Prometheus/Grafana?

Yes. The same principles apply in Datadog, New Relic, Honeycomb, or Elastic. The key is consistent dimensions across telemetry types and the ability to query them together. Honeycomb’s traces-first model makes correlation fast; Datadog has deployment events and monitors you can hook to rollbacks.

What about AI systems with non-deterministic behavior?

Use leading indicators tailored to LLM workloads: token-per-second throughput, cache hit rate (prompt/embedding cache), upstream latency to model endpoints, and `429`/throttle rates. Canary on top-K user journeys and watch hallucination proxies (e.g., high answer entropy + low retrieval overlap) before full rollout.

How do I justify the investment?

Track MTTR, change failure rate, and error-budget burn before/after. Most teams recoup the effort in reduced outages and faster rollouts within a quarter. Tie it to business KPIs (conversion rate during deploys, ingestion freshness) to make it non-optional.

Reliability-observability · Oct 23, 2025 · 10 minute read

Correlation That Saves Your On-Call: Turning Symptoms into Root Cause (and Automated Rollbacks)

Forget dashboards that look pretty but tell you nothing. Here’s how to design correlation that predicts incidents, ties telemetry to triage, and drives automated rollouts and rollbacks.

Alex Mercer

Partner, Reliability & Observability, GitPlumbers

20 years building and fixing distributed systems at scale (Amazon, Stripe, and three startups you’ve never heard of). I turn noisy telemetry into decisions your rollout controller can act on.

Correlation isn’t a dashboard feature. It’s the contract your telemetry makes with your rollback button.

Back to all posts

The night your “green” dashboards lied

We had a retailer whose checkout “looked fine” in Grafana while customers rage-refreshed. CPU green, RPS stable, p95 only slightly up. But five minutes before the incident, grpc_client_retry_total to the payment gateway spiked and Kafka consumer lag started creeping. That was the real signal. The correlation wasn’t visible because retries, lag, and version labels weren’t on the same playing field. We wired them up, built predictive alerts, and taught the rollout system to back off on its own. MTTR went from hours to minutes, and change failure rate halved.

If your telemetry can’t connect symptoms to the changes that caused them, it’s theater, not observability.

Measure what predicts incidents, not what flatters slides

Vanity metrics: raw CPU, total request count, “up” status, average latency. None of these predict pain.

Leading indicators that actually move the needle:

Saturation: workqueue_depth, thread-pool/concurrency utilization, nginx_active_connections, DB connection pool usage.
Backpressure: retry rates (grpc_client_retry_total, http_client_retry_total), 503/429 rates, circuit breaker open counts.
Tail latency: p99s and TP999, especially on dependencies; histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])).
Queues and lag: Kafka consumer lag, SQS queue depth, Redis stream lag.
Cache health: miss ratio, eviction rate.
GC/Runtime headroom: JVM GC pause time, Go sched_latency_seconds > 0.5s bursts.
Network/DNS: TCP retransmits, TLS handshake time, DNS resolution latency.
Infra pressure: pending pods, node_pressure signals, disk I/O wait.

Tie these to business SLOs (checkout availability, ingestion freshness). Alert on burn rate and predictive saturation, not raw counts.

# p99 frontend latency (5m) for service/version
histogram_quantile(
  0.99,
  sum by (le, service, version) (
    rate(http_request_duration_seconds_bucket{service="web"}[5m])
  )
)

Design your data so correlation is trivial

I’ve seen teams spend fortunes on “observability platforms” and still fail because dimensions don’t line up. You can’t correlate what you don’t consistently label.

Do this first:

Standardize resource attributes across metrics, logs, and traces:
- service, version, env, region, commit, feature_flag, canary (true/false), k8s_namespace, k8s_pod.
Propagate trace_id everywhere. Add it to log lines and emit exemplars on metrics.
Emit change events as metrics. Expose “deploy/config/flag flip” as time series you can join by label.
Cardinality budgets. Only add labels you’d actually page or route on.

OpenTelemetry resource config example:

# otel-collector.yaml
exporters:
  otlphttp:
    endpoint: https://otlp.example.com
processors:
  batch: {}
  attributes:
    actions:
      - key: service.version
        action: upsert
        value: ${DEPLOY_VERSION}
      - key: deployment.environment
        action: upsert
        value: ${ENV}
      - key: git.commit.sha
        action: upsert
        value: ${GIT_SHA}
      - key: feature.flags
        action: upsert
        value: ${FEATURE_FLAGS}
receivers:
  otlp:
    protocols:
      http: {}
      grpc: {}
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes, batch]
      exporters: [otlphttp]
    metrics:
      receivers: [otlp]
      processors: [attributes, batch]
      exporters: [otlphttp]

Change events as metrics (exporter or Pushgateway):

cat <<'EOF' | curl --data-binary @- http://pushgateway:9091/metrics/job/change_events
# TYPE deploy_event gauge
# HELP deploy_event A deploy happened (labeled) at this timestamp
deploy_event{service="payments",version="1.42.0",env="prod",commit="abc123",canary="true"} 1
EOF

With this, a single dashboard can slice by service/version/feature_flag and link metrics ↔ traces ↔ logs.

Predict and page before customers feel it

Use PromQL to compute leading indicators and burn rates. Two staples I’d ship on day one:

Predict saturation with predict_linear to catch imminent queue explosions:

# Predict when queue depth will hit 1000 in the next 10m
predict_linear(
  max by (service) (workqueue_depth{service="ingestor"}[15m]),
  10 * 60
) > 1000

Fast+slow burn alerts to detect SLO violations early without flapping (Google SRE model):

# 2h fast burn (14x budget) + 24h slow burn (6x budget)
- alert: ApiErrorBudgetBurnFast
  expr: (
    sum(rate(http_requests_total{service="api",code=~"5.."}[5m]))
    /
    sum(rate(http_requests_total{service="api"}[5m]))
  ) > (1 / (28 * 24))  # 14x over budget
  for: 10m
  labels:
    severity: page
    team: api
  annotations:
    summary: "API error budget burning fast"

- alert: ApiErrorBudgetBurnSlow
  expr: (
    sum(rate(http_requests_total{service="api",code=~"5.."}[30m]))
    /
    sum(rate(http_requests_total{service="api"}[30m]))
  ) > (1 / (14 * 24))  # 6x over budget
  for: 2h
  labels:
    severity: page
    team: api
  annotations:
    summary: "API error budget burning slow"

Now tie symptoms to changes in the alert itself. Include the top correlated version/flag and a trace link.

# alertmanager.yaml
route:
  receiver: pagerduty
  group_by: [service]
  routes:
    - matchers:
        - severity="page"
      receiver: triage-bot
receivers:
  - name: triage-bot
    webhook_configs:
      - url: http://triage-bot.default.svc.cluster.local/alert
        send_resolved: true

Your triage-bot should enrich the alert with:

last deploys/config changes/flag flips for the same service/env/region
canary status (from Argo Rollouts/Flagger)
top traces for the failing endpoint (via Tempo/Jaeger) using exemplars

Stop guessing: compute correlation, don’t eyeball it

Here’s a pattern that works at scale:

Window your symptom metric (e.g., p99_latency) and candidate causes (deploy events, retries, consumer lag) into the same 1m buckets.
Compute cross-correlation and rank suspects by coefficient and time lead.
Output a short “blame list” with confidence scores to the alert/runbook.

Example using ClickHouse to correlate p99 latency with changes and retry rate:

-- Symptom: p99 latency for service=checkout
WITH
  toStartOfMinute(ts) AS t,
  quantileExact(0.99)(latency_ms) AS p99
SELECT t, p99
INTO TEMP TABLE p99_by_min
FROM metrics_http_latency
WHERE service = 'checkout' AND env = 'prod' AND ts > now() - INTERVAL 2 HOUR
GROUP BY t
ORDER BY t;

-- Candidate: retry rate and deploy events aligned per minute
WITH toStartOfMinute(ts) AS t
SELECT t,
       sum(retries) AS retry_rate,
       maxIf(1, event='deploy') AS deploy
INTO TEMP TABLE causes_by_min
FROM metrics_events
WHERE service = 'checkout' AND env = 'prod' AND ts > now() - INTERVAL 2 HOUR
GROUP BY t
ORDER BY t;

-- Correlate p99 with lagged causes (lead of 1-10 minutes)
SELECT cause,
       lag_minutes,
       corr(p99, value) AS corr_coeff
FROM (
  SELECT p.t AS t, p.p99, c.retry_rate AS value, 'retry_rate' AS cause, 0 AS lag_minutes FROM p99_by_min p
  JOIN causes_by_min c USING (t)
  UNION ALL
  SELECT p.t, p.p99, c.deploy, 'deploy', number AS lag_minutes
  FROM p99_by_min p
  JOIN causes_by_min c ON p.t = c.t + number*60
  ARRAY JOIN range(1, 10) AS number
)
GROUP BY cause, lag_minutes
ORDER BY corr_coeff DESC
LIMIT 5;

Return that list to the triage-bot so your page includes: “Strong correlation with deploy version 1.42.0 at t-4m; retry_rate spike at t-2m.” Now you’re not guessing.

For traces/logs, rely on trace_id propagation. In Grafana, exemplars on your latency histograms jump straight into Tempo traces. Loki example to pull error logs with trace IDs:

{app="checkout", level="error"} |= "payment" | line_format "{{.ts}} {{.msg}} trace={{.traceid}}"

Close the loop: tie correlation to rollout automation

Once you trust your signals, let them drive rollouts. Argo Rollouts and Flagger make this straightforward.

Argo Rollouts AnalysisTemplate with PromQL checks:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-canary-analysis
spec:
  metrics:
    - name: p99-latency
      interval: 1m
      count: 5
      successCondition: result < 450
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="checkout",version=~"{{args.version}}"}[2m])) by (le))
    - name: retry-rate
      interval: 1m
      count: 5
      successCondition: result < 0.02
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(grpc_client_retry_total{service="checkout",version=~"{{args.version}}"}[2m]))
            /
            sum(rate(http_requests_total{service="checkout",version=~"{{args.version}}"}[2m]))
  args:
    - name: version

Rollout referencing the analysis, with auto-abort:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      steps:
        - setWeight: 10
        - analysis:
            templates:
              - templateName: checkout-canary-analysis
            args:
              - name: version
                valueFrom:
                  podTemplateHashValue: Canary
        - pause: {duration: 3m}
        - setWeight: 50
        - analysis:
            templates:
              - templateName: checkout-canary-analysis
            args:
              - name: version
                valueFrom:
                  podTemplateHashValue: Canary
      abort: {}

If p99 or retry rate fails, Argo aborts and rolls back automatically. Same story with Flagger if you prefer operator style.

Kayenta (if you’re in Spinnaker/Managed Delivery land) can run multi-metric canary analysis; feed it the same leading indicators.

What “good” looks like (and what’ll bite you)

Results we’ve seen after teams wire this up with GitPlumbers:

MTTR down 30–60% because alerts include suspects and traces.
Change failure rate cut by 40–60% with canary auto-abort.
70% fewer false pages by swapping vanity metrics for leading indicators.

Pitfalls:

Cardinality explosions from user_id/request_id labels. Keep high-cardinality data in traces/logs; use exemplars for jump links.
Overfitting thresholds to past incidents. Validate with holdout windows and chaos experiments.
Noisy dependencies. If your payment provider is flaky, isolate with circuit breakers and budget per-dependency.
Half-instrumented services. One missing version label ruins correlation across the whole path.

Concrete next steps:

Add version, commit, feature_flag, canary to all telemetry.
Emit deploy/flag flip events as metrics.
Replace CPU and averages with p99, retries, lag, queue depth.
Ship predict_linear and burn-rate alerts.
Add Argo/Flagger analysis to your rollout pipeline.
Stand up a trivial triage-bot to enrich alerts with correlation and trace links.

Related Resources

Key takeaways

Instrument for leading indicators (saturation, queue depth, retries) and treat vanity metrics as noise.
Use consistent dimensions across metrics, logs, traces, and change events: service, version, region, feature_flag, commit.
Correlate symptoms to changes using time-aligned windows and labels; don’t eyeball it—compute it.
Wire correlation into triage: alerts should include top suspected changes, relevant traces, and rollout status.
Automate canary analysis and rollback using PromQL-backed AnalysisTemplates in Argo Rollouts or Flagger.
Keep cardinality under control—only dimensions you triage by deserve to exist.
Measure results in MTTR, change failure rate, and error-budget burn, not “dashboard completeness.”

Implementation checklist

Define and standardize telemetry dimensions: service, version, region, env, feature_flag, commit, canary.
Emit change events as metrics: deployment, config, and feature-flag flips with timestamps and labels.
Create leading-indicator alerts: queue depth, retries, tail p99 latency, consumer lag, throttling.
Use `predict_linear` and burn-rate alerts to catch incidents before customers do.
Add triage automation: Alertmanager webhook that computes top-correlated changes and links to traces.
Integrate rollout automation: Argo Rollouts AnalysisTemplates or Flagger with PromQL checks.
Enforce cardinality budgets and sampling for traces/logs; propagate `trace_id` everywhere.
Continuously review false positives/negatives and tighten queries and windows.

Questions we hear from teams

How do I avoid blowing up Prometheus with labels like commit and version?: Use stable labels (`service`, `version`, `env`, `region`, `canary`) and keep churn bounded. Avoid per-request/user IDs in metrics. Drop high-cardinality labels at the scrape or via relabeling, and store them in traces/logs instead. Enforce a label budget per team.
Can I do this without Prometheus/Grafana?: Yes. The same principles apply in Datadog, New Relic, Honeycomb, or Elastic. The key is consistent dimensions across telemetry types and the ability to query them together. Honeycomb’s traces-first model makes correlation fast; Datadog has deployment events and monitors you can hook to rollbacks.
What about AI systems with non-deterministic behavior?: Use leading indicators tailored to LLM workloads: token-per-second throughput, cache hit rate (prompt/embedding cache), upstream latency to model endpoints, and `429`/throttle rates. Canary on top-K user journeys and watch hallucination proxies (e.g., high answer entropy + low retrieval overlap) before full rollout.
How do I justify the investment?: Track MTTR, change failure rate, and error-budget burn before/after. Most teams recoup the effort in reduced outages and faster rollouts within a quarter. Tie it to business KPIs (conversion rate during deploys, ingestion freshness) to make it non-optional.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about correlation that drives safe rollouts Download our rollout analysis template pack (PromQL + Argo)