The Dashboard Diet: Fewer Charts, Clear Thresholds, Faster Decisions

Turn your dashboards from museum walls into decision systems. Cut the noise, surface leading indicators, and wire telemetry to triage and rollouts.

Dashboards should make decisions obvious. If on-call still has to guess, it’s not a dashboard—it’s décor.
Back to all posts

Your NOC Wall Is Not a Decision System

I’ve stood in front of the 12-screen Grafana wall during a real incident. The lights were pretty. None of it helped me decide what to do next. We still dug through kubectl logs while the status page burned.

The problem isn’t tooling—Prometheus, Grafana, OpenTelemetry are fine. The problem is intent. Most teams build museum dashboards: lots of charts, no decisions. On-call needs the opposite: fewer charts, clearer thresholds, faster decisions.

If a panel doesn’t change what you’ll do in the next 5 minutes, it doesn’t belong on the on-call dashboard.

At GitPlumbers, the dashboards that actually reduce MTTR have three traits: they highlight leading indicators, they bake in explicit thresholds, and they connect telemetry to triage and rollouts.

The Minimum Viable Dashboard: 8 Charts That Predict Pain

For each critical service, keep it to 6–8 panels that predict incidents. Use RED for request-driven services and USE for infra. No averages; use percentiles and saturation.

  • SLO burn rate: Are we burning the error budget now? Fast predictor of paging and customer pain.

  • Tail latency (p95/p99): Latency spikes precede errors; track both overall and by dependency if possible.

  • Error rate (5xx or non-2xx): Break out by route or RPC method.

  • Saturation: Queue depth, thread pool queue length, container_cpu_cfs_throttled_seconds_total rate, node_filesystem_free headroom, or DB connection pool utilization.

  • Retries/timeouts: http_client_request_retries_total, grpc_client_handled_total{grpc_code!="OK"}; retries hide failures until they don’t.

  • Dependency health: Downstream SLI (e.g., DB p99, Kafka consumer lag, third-party API error rate).

  • Release annotations: Every deploy, config change, or feature-flag flip visible on the timeline.

  • Traffic shape: Request rate and payload size distribution to catch sudden load changes.

In practice, this looks like a single Grafana dashboard per service labeled “On-Call.” Everything else (infra deep dives, JVM tuning, cache hit ratios) lives on a separate “Explore” dashboard.

Thresholds That Drive Action: SLOs, Burn Rates, and Saturation

“Green is good” is not a threshold. Define SLOs first, then derive alerts and panel thresholds from them.

  • Publish SLOs at the top: “Availability SLO: 99.9%; Latency SLO: p99 < 400ms.”

  • Use multi-window, multi-burn-rate alerts (from the Google SRE playbook) to capture both fast and slow burns.

  • Mark actionable lines on graphs: “Above this line, page.”

Example PromQL for burn rate alerts (99.9% availability SLO, 5m and 1h windows):

# Error ratio over windows
err_ratio_5m = sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
  / sum(rate(http_requests_total{job="api"}[5m]))
err_ratio_1h = sum(rate(http_requests_total{job="api",status=~"5.."}[1h]))
  / sum(rate(http_requests_total{job="api"}[1h]))

# Burn rates; 0.1% budget for 99.9% SLO
burn_5m = err_ratio_5m / 0.001
burn_1h = err_ratio_1h / 0.001

Alerting rules that page only when it matters:

- alert: APIErrorBudgetBurnFast
  expr: burn_5m > 14.4 and burn_1h > 14.4
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "Fast burn: API error budget burning >14.4x"

- alert: APIErrorBudgetBurnSlow
  expr: burn_1h > 6 and burn_6h > 6
  for: 30m
  labels:
    severity: ticket
  annotations:
    summary: "Slow burn: investigate during business hours"

Saturation predicts failure earlier than CPU averages. A few PromQL examples we harden for clients:

# CPU throttling rate (container level)
sum(rate(container_cpu_cfs_throttled_seconds_total{container!=""}[5m])) by (pod) 
  / sum(rate(container_cpu_cfs_periods_total{container!=""}[5m])) by (pod)

# JVM GC pause p99 (if exported)
histogram_quantile(0.99, sum(rate(jvm_gc_pause_seconds_bucket[5m])) by (le))

# Kafka consumer lag (per group/topic)
max(kafka_consumergroup_lag{consumergroup=~"payments.*"}) by (topic)

These lines go on the graphs. When the line is crossed, the runbook says exactly what to do next.

Tie Dashboards to Triage: From Panel to Pager in Two Clicks

A good on-call dashboard is a triage flow, not a postcard. Every panel should link to the next question.

  • Drill-downs: From service p99 -> trace exemplars in Tempo/Jaeger. From error rate -> Loki logs filtered by route and trace_id.

  • Runbook links: Panel description links to a specific section anchor, not a wiki home page.

  • Context overlays: Grafana annotations for deploys, config changes, feature flags, and incidents.

  • Ownership: Each panel lists an owner Slack channel and escalation policy (PagerDuty service).

Here’s how we wire a latency panel to traces using exemplars and OpenTelemetry context propagation:

# Grafana panel field config: links
links:
  - title: View recent traces
    url: ${__url.traces}/search?service=api&operation=/checkout&latency>400ms&traceID=${__field.exemplar.trace_id}
    targetBlank: true

The point: in two clicks, on-call should move from “it’s red” to “this downstream call is timing out; roll back the last change.”

Wire Telemetry to Rollouts: Canaries, Flags, and Auto-Rollback

Dashboards that don’t change deploy behavior are theater. Tie metrics to rollout automation so the system protects itself.

  • Canary gating with Argo Rollouts or Flagger and Prometheus queries.

  • Progressive delivery with Istio/Linkerd traffic splits and circuit breakers.

  • Feature flags (LaunchDarkly, Unleash) with kill switches tied to SLOs.

Example Argo Rollouts analysis template that blocks a canary when p99 or error rate violates SLO:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: api-slo-check
spec:
  metrics:
  - name: p99-latency
    interval: 1m
    count: 5
    successCondition: result < 0.4  # < 400ms
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api",route="/checkout"}[1m])) by (le))
  - name: error-rate
    interval: 1m
    count: 5
    successCondition: result < 0.001  # < 0.1%
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{job="api",status=~"5.."}[1m]))
            / sum(rate(http_requests_total{job="api"}[1m]))

Hook this template into a Rollout with a 10% -> 25% -> 50% progression, and you’ll stop bad code within minutes—often before users notice.

For flags, we’ve shipped webhooks that drop a high-risk flag to OFF when the burn rate crosses the “page” threshold. No humans in the loop when the budget is on fire.

Kill Vanity Metrics: What to Delete Today

If it doesn’t predict or explain incidents, it’s vanity. Common culprits:

  • CPU/Memory averages without throttling or RSS headroom. They hide saturation until it’s too late.

  • Total request counts with no error/latency context. Traffic is not health.

  • Composite “health scores” that mix unrelated signals into a single meaningless number.

  • Disk usage percent without growth rate or time-to-full.

  • Ping success to the load balancer. Your app can be “up” and still unusable.

Move these to the “Explore” dashboard. On-call gets leading indicators only.

A 30-Day Plan to Make Dashboards Actionable

You don’t need a re-platform. You need intent and a few weeks of work.

  1. Week 1 – Inventory and SLOs
    • List top 5 user journeys or critical APIs; draft SLIs (availability, latency, correctness).
    • Set SLOs with product: e.g., availability 99.9%, p99 < 400ms for checkout.
    • Add release annotations in Grafana (CI posts to /api/annotations).
  2. Week 2 – Build the MVP dashboard
    • Create a single “On-Call” dashboard per service with 6–8 panels from the list above.
    • Add explicit thresholds and runbook links; owners on each panel.
    • Add trace exemplars; verify trace_id in logs and metrics via OpenTelemetry.
  3. Week 3 – Alerts and triage flow
    • Implement multi-window burn alerts; page only for fast burns.
    • Wire Grafana links to Loki/Tempo/Jaeger with pre-filtered queries.
    • Run a game day; time MTTA/MTTR; fix the slowest step.
  4. Week 4 – Rollout gating
    • Add Argo Rollouts/Flagger analysis templates for SLO checks.
    • Gate canaries on p99 and error rate; auto-rollback on failure.
    • Add a feature-flag kill switch tied to burn-rate alerts.

Expect to cut noisy alerts by ~50% and MTTR by 20–40% in the first month if you enforce the diet.

What We Learned and What We’d Do Again

  • Averages lie. Tail latency and saturation predict pain; averages tell bedtime stories.

  • Annotate everything. Releases, migrations, config flips. Saves you 10 minutes per incident just correlating events.

  • Make it a contract. Every panel must state: owner, SLO, threshold, runbook. If you can’t name these, delete it.

  • Automate rollback. Humans are too slow at 2 a.m. Let the pipeline trip the breaker when SLOs fail.

  • Review monthly. Dashboards rot. Delete a panel every time you add one.

At a payments client, this approach turned a 40-panel Grafana monster into eight decisive charts. We cut false pages by 60%, caught two regressions at 10% canary, and dropped median MTTR from 48 to 28 minutes in five weeks. No new shiny platform—just ruthless focus and wiring what we already had.

Related Resources

Key takeaways

  • Dashboards should drive decisions, not decorate walls. Keep 6–8 leading-indicator charts per service.
  • Measure what predicts incidents: saturation, burn rate, tail latency, retries, and queue/lag.
  • Make thresholds explicit with SLOs and multi-window burn alerts instead of vague colors.
  • Wire dashboards into triage: link panels to logs, traces, and runbooks in two clicks.
  • Gate rollouts with metrics: Argo Rollouts/Flagger + Prometheus to auto-pause or rollback.
  • Delete vanity metrics: CPU averages, request counts, and composite “health scores” that hide risk.
  • Implement in 30 days: inventory, define SLIs/SLOs, build MVP dashboard, alerts, rollout gates, refine.

Implementation checklist

  • Define SLIs/SLOs with owners and budgets; publish them in the dashboard header.
  • Choose 6–8 leading indicators per service using RED/USE and error budget burn.
  • Set explicit thresholds with PromQL recording rules; alert on multi-window burn.
  • Add drill-down links from each panel to logs (e.g., `Loki`), traces (`Tempo`/`Jaeger`), and runbooks.
  • Annotate releases in Grafana; require post-release panels for triage.
  • Integrate Argo Rollouts/Flagger analysis templates with Prometheus queries for canary gating.
  • Remove vanity charts; move them to a separate “explore” dashboard, not on-call.
  • Review alerts monthly; simulate incidents (game days) and tune thresholds.

Questions we hear from teams

How many panels should my on-call dashboard have?
Aim for 6–8 per service. If you can’t decide which to drop, you haven’t defined SLOs tightly enough. Everything else belongs on an “Explore” dashboard, not the on-call view.
What if we don’t have clean metrics yet?
Start by instrumenting SLIs via `OpenTelemetry` or your SDKs. You can still set up burn-rate alerts on existing request/error counters and tail latency histograms while you refine semantics.
Won’t we miss context if we cut charts?
You won’t if you add drill-down links to logs, traces, and dependency dashboards. Keep deep-dive dashboards separate; the on-call view is about decisions, not exploration.
How do we pick SLO targets?
Use historical performance and business tolerance. Start with conservative targets (e.g., 99.9% availability, p99 < 400ms for critical flows), then iterate quarterly with product on user impact and cost.
Can we do this without Kubernetes?
Yes. The approach is platform-agnostic. You can drive the same outcomes with VMs and a deployment tool that supports health checks plus Prometheus scraping and feature flags.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Make your dashboards actionable See how we cut MTTR with SLO-driven rollouts

Related resources