Kill the Chart Zoo: Dashboards That Make Decisions in 60 Seconds

Stop shipping slide decks to on-call. Build instruments that predict incidents and trigger rollouts before customers feel pain.

Dashboards aren’t reports; they’re instruments. If a panel can’t trigger a decision, it doesn’t belong.
Back to all posts

The dashboard that paged but didn’t decide

A few years back, I inherited a Grafana folder with 62 panels for a single payments service. Beautiful. Useless. When we got paged, the on-call would doomscroll through CPU, memory, fancy flame graphs, and—my favorite—“requests per color” from a prior A/B test. Meanwhile, the queue backed up, latencies spiked, and we debated whether to roll back. We didn’t lack telemetry; we lacked a dashboard that told us what to do.

Here’s what actually works: fewer charts, clearer thresholds, and a straight path from metric to action. If a panel can’t trigger a decision—triage route, rollout gate, or customer comms—it doesn’t belong.

Lead with leading indicators (predict pain, don’t describe it)

Vanity metrics (CPU, request count) are fine for capacity planning but late for incident prediction. Put predictive, causal signals at the top. The patterns below have saved my bacon more times than I care to admit:

  • Queues/backlog and flow control
    • kafka_consumergroup_lag rising faster than consumption → impending outage
    • work_queue_depth or Redis llen growth with flat throughput → saturation
  • Concurrency and saturation
    • Thread/connection pool saturation: pool_in_use / pool_capacity > 0.9
    • Go: go_goroutines + process_open_fds trending near limits
    • JVM: GC pause time jvm_gc_pause_seconds_sum > 200ms p95 → tail latencies next
  • Network and retries
    • TCP retransmits and 5xx retry storms: sum(rate(http_requests_total{code=~"5.."}[5m])) by (route) with retry_total
    • mTLS handshake errors (Istio/Linkerd) bump before full-blown failures
  • Storage pressure
    • DB connection pool saturation and slow queries p95 rising with QPS flat → impending lock contention
    • Disk IO queue depth (EBS VolumeQueueLength) > 1 per vCPU → write tail imminent
  • Autoscaling early warnings
    • HPA pending pods > 0, container_cpu_cfs_throttled_seconds_total rising → scale-up too slow
  • Circuit breakers and timeouts
    • Open circuit ratio > 2% for a dependence → you’re about to cascade
  • AI/LLM-specific (yes, it matters now)
    • Model confidence drops while token/sec flat → early drift or prompt regression
    • Cache miss rate up with same traffic → embeddings index staleness, cost spike next

PromQL examples that actually predict pain:

# Backlog increasing faster than throughput (early warning)
max_over_time(work_queue_depth[10m]) - min_over_time(work_queue_depth[10m])
  > 0 and rate(work_queue_depth[10m]) > 0

# Thread pool saturation (JVM)
max(thread_pool_active_threads) / max(thread_pool_max_threads) > 0.9

# Burn-rate (fast) for an availability SLO
(
  sum(rate(http_requests_total{code=~"5.."}[5m])) /
  sum(rate(http_requests_total[5m]))
) > (error_budget_per_minute * 14.4)  # 14.4x = 2h burn of a 30d budget

Make thresholds obvious (SLOs, not vibes)

If a human has to “eyeball” a chart to decide if it’s bad, you’ve already lost minutes. Use explicit, color-coded thresholds tied to SLOs and hard limits.

  • Tie panels to SLO burn rate and saturation thresholds; label them like incidents, not sensors.
  • Use red/yellow/green; show the actual numeric threshold on the panel (not just a line).
  • Show both fast and slow burn rates to catch spikes and simmering fires.

Prometheus rules that trigger human action:

# alerting-rules.yaml
groups:
- name: slo-burn
  rules:
  - alert: APIAvailabilityFastBurn
    expr: (
      sum(rate(http_requests_total{job="api",code=~"5.."}[5m])) /
      sum(rate(http_requests_total{job="api"}[5m]))
    ) > (0.001 * 14.4)  # 99.9% SLO error budget fast burn
    for: 2m
    labels:
      severity: page
      team: payments
    annotations:
      summary: "Fast burn on API availability"
      runbook: "https://runbooks.company.com/api-fast-burn"

  - alert: ThreadPoolSaturation
    expr: max_over_time(thread_pool_active_threads[5m])
          / max_over_time(thread_pool_max_threads[5m]) > 0.9
    for: 5m
    labels:
      severity: warn
    annotations:
      summary: "Thread pool saturation >90%"
      runbook: "https://runbooks.company.com/thread-pool"

Grafana panel thresholds that are unmistakable at 3am:

{
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "mode": "absolute",
        "steps": [
          { "color": "green", "value": null },
          { "color": "yellow", "value": 0.5 },
          { "color": "red", "value": 0.9 }
        ]
      },
      "unit": "percent"
    }
  },
  "options": { "reduceOptions": { "calcs": ["max"] } }
}

If you can’t defend the threshold in terms of SLOs, limits, or cost, it’s not a threshold—it’s décor.

From page to decision in 60 seconds

The path must be obvious:

  1. Alert fires with a clear summary and a runbook link.
  2. On-call opens the triage dashboard—with 8–12 panels ordered by predictiveness.
  3. One click into logs and traces; one click to rollback or flip a flag.

Practical wiring that works:

  • Alertmanager to PagerDuty/Incident.io with direct links to the dashboard and runbook.
  • Grafana data links to Loki/Tempo with traceID/labels pre-wired.
  • Buttons or slash commands for rollbacks or feature flags.

Alertmanager route with pre-linked dashboard:

route:
  receiver: pagerduty
  group_by: [team]
  routes:
  - matchers: [team="payments", severity="page"]
    receiver: pagerduty
    continue: false
receivers:
- name: pagerduty
  pagerduty_configs:
  - routing_key: ${PD_ROUTING_KEY}
    description: |-
      {{ .CommonAnnotations.summary }}
      Dashboard: https://grafana.company.com/d/triage-payments
      Runbook: {{ .CommonAnnotations.runbook }}

Pro tip: precompute triage context. For example, join deployment and version labels from Kubernetes so every panel shows which version is implicated. On-call shouldn’t have to play “guess the SHA.”

Let metrics babysit your rollouts (Argo/Flagger/LD)

I’ve seen too many teams “canary” by watching Slack. Automate the judgment using the same leading indicators from your dashboard.

Argo Rollouts with AnalysisTemplate that blocks on p99 and 5xx:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: api-canary
spec:
  metrics:
  - name: p99-latency
    interval: 1m
    successCondition: result < 300
    failureLimit: 2
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api",version="{{args.newVersion}}"}[1m])) by (le)) * 1000
  - name: error-rate
    interval: 1m
    successCondition: result < 0.005
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{job="api",version="{{args.newVersion}}",code=~"5.."}[1m]))
          /
          sum(rate(http_requests_total{job="api",version="{{args.newVersion}}"}[1m]))
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - analysis:
          templates:
          - templateName: api-canary
          args:
          - name: newVersion
            valueFrom:
              fieldRef: {fieldPath: metadata.labels['rollouts-pod-template-hash']}
      - setWeight: 25
      - pause: {duration: 120}
      - setWeight: 50
      - analysis:
          templates:
          - templateName: api-canary

Prefer Flagger? Same idea. And for feature flags (LaunchDarkly), gate the ramp with a webhook that checks the same Prometheus queries. The point: your rollout shouldn’t depend on who’s watching; it should depend on metrics that predict customer pain.

A reference stack that’s boring and works

You don’t need shiny. You need glue that’s stable.

  • Ingest: OpenTelemetry Collector → Prometheus remote-write for metrics; Loki for logs; Tempo for traces.
  • Store/Query: Prometheus 2.49, Loki 2.9, Tempo 2.5. Keep retention sane (7–14 days hot; 30–90 cold).
  • Visualize: Grafana 10 with folders per service and a “Triage” tag for on-call.
  • Alert: Alertmanager with routes per team and runbook annotations mandatory.
  • Automate: Argo Rollouts or Flagger; LaunchDarkly for flags; GitOps via ArgoCD.

OpenTelemetry Collector snippet that standardizes labels so dashboards and alerts align:

receivers:
  otlp:
    protocols: {http: {}, grpc: {}}
processors:
  k8sattributes:
    extract:
      labels:
      - key: app
      - key: version
  resource:
    attributes:
    - action: upsert
      key: service.namespace
      value: payments
exporters:
  prometheusremotewrite:
    endpoint: http://prometheus.monitoring:9090/api/v1/write
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [k8sattributes, resource]
      exporters: [prometheusremotewrite]

Provision triage dashboards via Git rather than clicking around in prod. Terraform + Grafana provider works fine. Require PRs to state the decision each panel supports.

resource "grafana_dashboard" "payments_triage" {
  config_json = file("dashboards/payments-triage.json")
  folder      = grafana_folder.payments.id
}

Results we’ve seen at GitPlumbers after doing just this hygiene:

  • 30–50% fewer pages within 30 days (retry storms stopped earlier)
  • MTTR down 40% because decisions got obvious
  • 70% of rollbacks happen automatically during canaries, not after customers file tickets

Stop the chart sprawl (governance that isn’t theater)

Dashboards rot without guardrails. Treat them like code.

  • Hard cap: 8–12 panels on the triage dashboard; everything else goes to “Deep Dive.”
  • Every panel must map to one: predict incident, route triage, or authorize rollout. No mapping? Delete.
  • Quarterly review: remove panels not viewed in 90 days unless tied to a runbook.
  • Dashboard linting in CI: fail PRs without thresholds or units, or with PromQL that lacks label filters.
  • Home page shows SLOs, error budgets, and burn rates. No vanity.

A simple CI guard (yes, crude; yes, helpful):

#!/usr/bin/env bash
set -euo pipefail
jq -e '.panels[] | select(.fieldConfig.defaults.thresholds == null) | length == 0' dashboards/*.json \
  || { echo "Panels missing thresholds"; exit 1; }

You’ll be shocked how quickly clarity returns when you treat dashboards as instruments, not art.

Related Resources

Key takeaways

  • Dashboards are for decisions, not decoration; cap panels and anchor them to SLOs.
  • Track leading indicators (saturation, backlog growth, retries) that predict incidents hours before customers notice.
  • Use explicit red/yellow/green thresholds tied to error budgets, not gut feels.
  • Make the page-to-decision path 60 seconds: from alert to dashboard to runbook to action.
  • Automate rollouts with metric gates (Argo Rollouts/Flagger) so bad canaries auto-rollback.
  • Manage dashboards as code and kill chart sprawl with review standards.

Implementation checklist

  • Link every panel to an SLO or triage decision; delete orphan panels.
  • Pin 8–12 panels per service; order by predictiveness, not prettiness.
  • Use burn-rate alerts and saturation thresholds with explicit colors and values.
  • One-click from panel to logs and traces (Grafana + Loki/Tempo).
  • Wire Prometheus queries into rollout analysis (Argo/Flagger) to auto-rollback.
  • Review dashboards quarterly; remove panels not viewed or not tied to incidents.
  • Provision dashboards via GitOps; PRs must state decision aided by each panel.

Questions we hear from teams

How many dashboards per service is sane?
Two. A triage dashboard (8–12 panels max, ordered by leading indicators and SLOs) and a deep-dive dashboard for specialists. Everything else is a report, not a dashboard.
What if we don’t have SLOs yet?
Start with rough thresholds pegged to hard limits (pool size, queue depth, GC pause). Then set SLOs based on historical performance and business impact. Wire burn-rate alerts once you know your error budget.
We’re not on Kubernetes—does this still apply?
Yes. The indicators don’t care: backlog growth, retries, connection saturation, and burn-rate translate to VMs and functions just fine. Swap HPA for ASG pending capacity and ELB 5xx for ingress 5xx.
How do we handle AI/LLM services?
Track model confidence, token/sec, cache hit rate, and cost per request. Alert on confidence dips with stable traffic and on sudden cost spikes. Gate prompt/weight rollouts with the same Argo/Flagger metrics you use for APIs.
What about traces—aren’t they leading indicators?
Traces are fantastic for attribution and pathologies, but they’re heavy as leading indicators. Use metrics to predict and traces to explain. Link directly from the panel to the representative trace sample.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a Triage Dashboard Workout (2 weeks) See how we wire rollouts to metrics

Related resources