Stop Shipping Blind: Dashboards That Catch AI Model Rot Before Users Rage
If you’re not measuring drift, hallucination rate, and p95 latency per model version, you’re running an incident farm. Here’s the gritty playbook we use to spot AI degradation before customers do.
If your dashboards can’t spot drift, hallucinations, and latency spikes before users do, you’re not observant—you’re lucky.Back to all posts
The outage you don’t see coming
I’ve watched great teams get blindsided by AI model rot. At a fintech last quarter, a “minor” retriever tweak shipped behind a 10% flag. Within an hour, p95 latency jumped 2.3x and the hallucination complaints started trickling in. The culprit: a reranker version bump changed the candidate set, which ballooned prompt length, which pushed us into provider throttling. We didn’t lose users because the dashboard screamed first: a dip in ai_eval_pass_ratio, a spike in ai_guardrail_triggers_total, and a sudden rise in ai_prompt_tokens per request—only in the canary cohort. We froze the rollout at 10%, flipped to the previous reranker, and the graphs settled.
If your dashboards can’t spot those early tells—drift, hallucination, latency spikes—before your users feel them, you’re flying VFR in a thunderstorm.
What to measure (signals that move before churn)
You need SLIs that reflect the whole AI-enabled flow, not just the LLM call. At minimum:
- Latency
ai_inference_latency_seconds(histogram): p50/p95/p99 byroute,model,model_version,provider.- Stage timing:
retrieval_ms,rerank_ms,prompt_build_ms,queue_ms,provider_latency_ms.
- Errors & refusals
ai_inference_errors_totalbytype(timeout, rate_limit, 5xx, provider_refusal, guardrail_block).ai_fallback_totalwhen you switch models/providers.
- Quality proxies
ai_eval_pass_total/ai_eval_total->ai_eval_pass_ratio(RAGAS faithfulness/answer relevance, task-specific checks, keyword coverage).ai_hallucination_flags_totalfrom heuristics or reviewer labels.
- Retrieval health (for RAG)
ai_retrieval_hit_rate(% queries where ground truth doc appears in top-k).ai_mrr(mean reciprocal rank) on labeled sets.ai_topk_overlapbetween versions during canaries.
- Drift
ai_input_psi(Population Stability Index) for feature/term distributions.- Embedding centroid shift and cosine similarity vs baseline.
- Safety
ai_guardrail_triggers_totalby category: PII, jailbreak, toxicity.ai_pii_redaction_success_total/_attempt_total.
- Cost & tokens
ai_prompt_tokens_total,ai_completion_tokens_total,ai_cost_usd_totalper request.- Cache
hit_ratioif you use PromptCache/Redis.
Tag everything with: model, model_version, provider, route, prompt_template_hash, feature_flag, and a privacy-safe tenant/segment.
Instrumentation that survives real traffic
Use OpenTelemetry for end-to-end traces and Prometheus/Datadog for metrics. Structured logs go to ELK/ClickHouse/BigQuery for deep dives. The trick is consistent IDs and tags across stages.
- Trace the whole flow: request -> retrieve -> rerank -> prompt -> LLM -> postprocess -> guardrail.
- Propagate
trace_id,span_id, and a stablerequest_idthrough async hops and queues. - Log the prompt template hash and redacted features; never log raw PII.
# instrumentation.py
# Python example: OpenTelemetry + Prometheus for an LLM/RAG service
from time import perf_counter
from prometheus_client import Counter, Histogram, Gauge, start_http_server
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer(__name__)
llm_latency = Histogram(
'ai_inference_latency_seconds',
'LLM end-to-end latency',
['route', 'model', 'model_version', 'provider']
)
llm_errors = Counter(
'ai_inference_errors_total',
'LLM errors by type',
['route', 'model', 'model_version', 'provider', 'type']
)
llm_tokens = Counter(
'ai_tokens_total', 'Prompt + completion tokens', ['route', 'model', 'model_version', 'kind']
)
llm_eval_pass = Counter('ai_eval_pass_total', 'Eval pass count', ['route', 'model_version'])
llm_eval_total = Counter('ai_eval_total', 'Eval total count', ['route', 'model_version'])
retrieval_hit = Gauge('ai_retrieval_hit_rate', 'RAG hit rate (windowed via recording rule)', ['route'])
start_http_server(9464) # Prometheus scrape endpoint
def generate_llm(route, provider_client, prompt, model, model_version):
with tracer.start_as_current_span('llm.generate') as span:
span.set_attribute('route', route)
span.set_attribute('model', model)
span.set_attribute('model_version', model_version)
start = perf_counter()
try:
resp = provider_client.chat.completions.create(
model=model,
messages=prompt,
temperature=0.2,
timeout=10,
)
dur = perf_counter() - start
llm_latency.labels(route, model, model_version, provider_client.name).observe(dur)
usage = getattr(resp, 'usage', None) or {}
llm_tokens.labels(route, model, model_version, 'prompt').inc(usage.get('prompt_tokens', 0))
llm_tokens.labels(route, model, model_version, 'completion').inc(usage.get('completion_tokens', 0))
span.set_status(Status(StatusCode.OK))
return resp
except Exception as e:
llm_errors.labels(route, model, model_version, provider_client.name, type(e).__name__).inc()
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR))
raiseStructured logs matter for forensics and labeling:
{"ts":"2025-03-21T12:05:14Z","trace_id":"2d6c...","route":"answer_question","model":"gpt-4o","model_version":"2025-05-13","prompt_template_hash":"b7c1f3","feature_flag":"reranker_v2_canary","retrieval_docs":8,"pii_redacted":true,"eval":{"suite":"ragas_v0.9","faithfulness":0.81,"answer_relevance":0.77},"guardrail":{"triggers":["pii"],"blocked":false},"latency_ms":{"retrieval":42,"rerank":31,"llm":820}}Dashboards that tell the truth (Grafana/Datadog layout)
Build dashboards that answer “what broke, where, and why” in one screen. The layout we ship with GitPlumbers:
- Top row: SLOs and burn rate
- p95
ai_inference_latency_secondsbyrouteandmodel_version. - Error/refusal rate stacked by type.
ai_eval_pass_ratioas a sparkline with 7d baseline.
- p95
- Middle: Stage breakdown
- Retrieval/rerank/prompt/LLM timings and queue depth.
- Provider saturation: rate limits,
429/503by provider.
- Quality & drift
- RAG hit rate, MRR, top-k overlap vs control.
- PSI per key feature/term; embedding centroid shift.
- Safety & cost
- Guardrail trigger categories and block rate.
- Tokens/request and
ai_cost_usd_totalper tenant.
PromQL snippets you’ll reuse:
# p95 latency per model version, 5m rate
histogram_quantile(
0.95,
sum(rate(ai_inference_latency_seconds_bucket{route="answer"}[5m])) by (le, model, model_version)
)
# Eval pass ratio (15m)
(increase(ai_eval_pass_total[15m]) / increase(ai_eval_total[15m]))
# Error/refusal rate (5m)
sum(rate(ai_inference_errors_total{type!=""}[5m])) by (type, provider)In Datadog, tag your metrics the same way and create widgets per model_version. Lock a “canary vs control” comparison view so on-calls can spot divergence instantly.
Guardrails and evals: your early warning system
You won’t catch hallucinations with 5xx graphs. You need automated evals and safety guardrails that produce metrics.
- Evals
- Use
RAGASfor RAG (faithfulness, answer relevance, context precision/recall) andEvidentlyfor drift/stability reports. - Run small eval suites on canary traffic and nightly on sampled production data with redaction.
- Publish scores as time-series, not just CSVs.
- Use
- Guardrails
- Apply jailbreak/PII/toxicity checks using
Guardrails.ai,LlamaGuard, or provider moderation APIs. - Log triggers with categories; block or safe-complete depending on policy.
- Apply jailbreak/PII/toxicity checks using
Example: run evals in CI and surface a metric gate.
# ci-evals.sh
set -euo pipefail
python -m evals.run --suite ragas_v0_9 --input data/canary.jsonl --out out/metrics.json
python -m evidently --profile drift --ref data/baseline.parquet --cur data/canary.parquet --out out/drift.json
jq '.ragas.faithfulness' out/metrics.json | awk '{ if ($1 < 0.78) exit 1 }'Wire those results back as metrics:
# publish_evals.py
import json
from prometheus_client import Counter
pass_c = Counter('ai_eval_pass_total', 'Eval pass', ['suite','model_version'])
all_c = Counter('ai_eval_total', 'Eval total', ['suite','model_version'])
m = json.load(open('out/metrics.json'))
model_version = m.get('meta', {}).get('model_version', 'unknown')
for case in m['cases']:
all_c.labels('ragas_v0_9', model_version).inc()
if case['faithfulness'] >= 0.78 and case['answer_relevance'] >= 0.8:
pass_c.labels('ragas_v0_9', model_version).inc()Alerts that wake the right person (not the whole company)
Use SLO burn rates and composite conditions. Single-threshold alerts on p95 are noisy and late.
- Multi-window burn rate for latency and error budgets (SRE-style 5m/1h windows).
- Composite: alert when
eval_pass_ratiodrops ANDretrieval_hit_ratedrops (correlated failure), or whenfallback_totalspikes AND cost per request climbs. - Route to owners: retrieval issues to search team, provider 429s to platform.
Prometheus example:
# alerts.yaml
groups:
- name: ai-slo
rules:
- alert: AIHighLatencyBurnRate
expr: |
histogram_quantile(0.95, sum(rate(ai_inference_latency_seconds_bucket[5m])) by (le, route))
> on(route) group_left() (slo_latency_seconds{route!=""})
for: 10m
labels: {severity: page}
annotations:
summary: "p95 latency SLO breach on {{ $labels.route }}"
- alert: EvalAndRetrievalDegradation
expr: |
(1 - (increase(ai_eval_pass_total[15m]) / increase(ai_eval_total[15m]))) > 0.25
and on(route) (avg_over_time(ai_retrieval_hit_rate[15m]) < 0.7)
for: 15m
labels: {severity: page}
annotations:
summary: "Eval pass rate + retrieval hit drop"
- alert: HallucinationSpike
expr: increase(ai_hallucination_flags_total[30m]) > 100
labels: {severity: warn}
annotations:
summary: "Hallucination flags spiking"Datadog equivalent: create monitors with formulas combining metrics and add change alerts (delta vs 7d baseline) for ai_tokens_total and ai_cost_usd_total.
Rollbacks, canaries, and budget caps (when—not if—things degrade)
Don’t wait for humans to click buttons during an incident. Wire safeties into the system.
- Feature flags (LaunchDarkly/Unleash): select
model_versionand reranker behind flags; ramp gradually; log the flag in every trace. - Canaries (Argo Rollouts): 10% -> 25% -> 50% with automated pauses on SLO/eval failures.
- Circuit breakers (Istio/Envoy): trip on 5xx/latency to cut over to a stable provider or a smaller model.
- Budget caps: stop-the-bleed when token usage or cost spikes.
Examples:
// route.ts – choose model by flag
import { LDClient } from 'launchdarkly-node-server-sdk'
const ld = LDClient('sdk-key')
export async function modelFor(userKey: string) {
const ctx = { key: userKey }
const useNew = await ld.variation('model_v2_canary', ctx, false)
return useNew ? { model: 'gpt-4o', version: '2025-05-13' } : { model: 'gpt-4o', version: '2025-03-01' }
}# istio-destinationrule.yaml – outlier detection
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: llm-provider
spec:
host: provider.svc.cluster.local
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
connectionPool:
http:
http1MaxPendingRequests: 1000
maxRequestsPerConnection: 100# argo-rollouts.yaml – canary with metric gates
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: rag-service
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 10m}
- setWeight: 25
- pause: {duration: 20m}
analysis:
templates:
- templateName: eval-pass-ratio
trafficRouting:
istio: { virtualService: rag-vs, weight: 100 }If you can’t auto-pause a canary when evals regress, you don’t have a canary—you have a countdown.
What we’d set up in your first week
In a week, we’ve taken teams from vibes to visibility:
- Add OpenTelemetry traces across retrieval -> LLM -> postprocess with
model,model_version,prompt_template_hash. - Expose Prometheus metrics for latency, error/refusals, tokens/cost, eval pass ratio, retrieval hit/MRR, drift PSI, guardrail triggers.
- Stand up Grafana/Datadog dashboards: top-row SLOs, mid-row stage breakdown, bottom-row safety/cost.
- Wire RAGAS/Evidently evals into CI + nightly batch; publish metrics.
- Add LaunchDarkly/Unleash flags and Argo canaries; configure Istio circuit breakers.
- Create alert rules with burn rates and composite conditions; route to the right on-call.
- Write runbooks: rollback, provider failover, budget caps, and how to read the dashboard during an incident.
Results we typically see: 30–50% reduction in MTTR for AI incidents, 15–25% lower token spend from catching prompt bloat early, and—most important—regressions caught in canary rather than on Twitter.
Key takeaways
- Instrument the entire AI request path with traces and per-stage metrics; tag by model, version, prompt template hash, and feature flag.
- Track quality proxies (eval pass rate, hallucination flags), retrieval health (hit rate, MRR), and drift (PSI/JS divergence) alongside latency and errors.
- Build Grafana/Datadog dashboards by route and model version; place SLOs and burn-rate panels at the top, stage breakdowns in the middle, cost/safety at the bottom.
- Use automated evals (RAGAS/Evidently) on canaries and nightly batches; publish scores as time-series metrics and alert on deltas.
- Set composite alerts (e.g., eval fail rate + retriever hit drop) and autosafeguards (circuit breakers, fallback models, cost caps).
- Roll out with feature flags and Argo Rollouts canaries; auto-rollback on SLO breach or eval regressions.
Implementation checklist
- Emit traces for every AI request with `trace_id`, `model`, `model_version`, `prompt_template_hash`, and `feature_flag`.
- Expose Prometheus metrics: latency histogram, error counts, eval pass/fail, retrieval hit rate, drift PSI, token/cost, guardrail triggers.
- Create dashboards with p50/p95/p99 latency by route and version; error and refusal rates; eval pass rate; retrieval hit/MRR; drift PSI; token and cost per request.
- Add Prometheus/Datadog alerts using multi-window burn rates and composite conditions.
- Integrate evals (RAGAS/Evidently) into CI and nightly jobs; publish scores as metrics and artifact links.
- Implement guardrails (PII redaction, toxicity, jailbreak) and log triggers per request.
- Use LaunchDarkly/Unleash for model version flags; enable Argo Rollouts canaries; configure Istio circuit breakers and outlier detection.
- Define runbooks: rollback steps, provider failover, cache warmups, budget caps, and comms channels.
Questions we hear from teams
- How do we quantify hallucinations without a massive labeling team?
- Start with automated evals like RAGAS faithfulness and answer relevance on a small, curated set of golden questions. Augment with lightweight human review (50–100 samples/week). Publish pass rate as a metric (`ai_eval_pass_ratio`). Use heuristics (unsupported claims, out-of-context citations) as additional flags, but gate rollouts on the golden-set metrics.
- Which tools should we pick: Prometheus/Grafana or Datadog?
- If you already run Datadog, use it—tags and notebooks are handy. If you’re Kubernetes-native with SRE muscle, Prometheus/Grafana gives you control and lower cost. The important part is consistent labels and having both metrics and traces (OpenTelemetry) wired through the same IDs.
- How do we detect drift in unstructured text?
- Compute PSI on token/term buckets and monitor embedding centroid shift (cosine distance) relative to a rolling baseline. Evidently AI can produce drift reports you can convert into metrics. Alert on sustained deviation, not single spikes.
- What about privacy? Can we log prompts and outputs?
- Redact PII at the edge (phone, email, SSN, customer IDs). Store only hashed identifiers. Keep a sampled, heavily redacted corpus for evals with strict access controls and TTL. Never log raw credentials or secrets; verify via automated checks.
- How do we prevent runaway costs during incidents?
- Track tokens and cost per request. Put a request-level budget cap and a per-tenant daily cap. Alert on change vs 7d baseline. Add circuit breakers that disable expensive fallbacks if SLOs aren’t improving.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
