How do we quantify hallucinations without a massive labeling team?

Start with automated evals like RAGAS faithfulness and answer relevance on a small, curated set of golden questions. Augment with lightweight human review (50–100 samples/week). Publish pass rate as a metric (`ai_eval_pass_ratio`). Use heuristics (unsupported claims, out-of-context citations) as additional flags, but gate rollouts on the golden-set metrics.

Which tools should we pick: Prometheus/Grafana or Datadog?

If you already run Datadog, use it—tags and notebooks are handy. If you’re Kubernetes-native with SRE muscle, Prometheus/Grafana gives you control and lower cost. The important part is consistent labels and having both metrics and traces (OpenTelemetry) wired through the same IDs.

How do we detect drift in unstructured text?

Compute PSI on token/term buckets and monitor embedding centroid shift (cosine distance) relative to a rolling baseline. Evidently AI can produce drift reports you can convert into metrics. Alert on sustained deviation, not single spikes.

What about privacy? Can we log prompts and outputs?

Redact PII at the edge (phone, email, SSN, customer IDs). Store only hashed identifiers. Keep a sampled, heavily redacted corpus for evals with strict access controls and TTL. Never log raw credentials or secrets; verify via automated checks.

How do we prevent runaway costs during incidents?

Track tokens and cost per request. Put a request-level budget cap and a per-tenant daily cap. Alert on change vs 7d baseline. Add circuit breakers that disable expensive fallbacks if SLOs aren’t improving.

Ai-delivery · Dec 5, 2025 · 10 minute read

Stop Shipping Blind: Dashboards That Catch AI Model Rot Before Users Rage

If you’re not measuring drift, hallucination rate, and p95 latency per model version, you’re running an incident farm. Here’s the gritty playbook we use to spot AI degradation before customers do.

Alex R. Kim

Partner, GitPlumbers

20 years stabilizing production systems from monoliths to microservices to LLMs. Led SRE and platform teams at two unicorns, helped Fortune 500s ship AI with guardrails. Still has PagerDuty PTSD—and receipts.

If your dashboards can’t spot drift, hallucinations, and latency spikes before users do, you’re not observant—you’re lucky.

Back to all posts

The outage you don’t see coming

I’ve watched great teams get blindsided by AI model rot. At a fintech last quarter, a “minor” retriever tweak shipped behind a 10% flag. Within an hour, p95 latency jumped 2.3x and the hallucination complaints started trickling in. The culprit: a reranker version bump changed the candidate set, which ballooned prompt length, which pushed us into provider throttling. We didn’t lose users because the dashboard screamed first: a dip in ai_eval_pass_ratio, a spike in ai_guardrail_triggers_total, and a sudden rise in ai_prompt_tokens per request—only in the canary cohort. We froze the rollout at 10%, flipped to the previous reranker, and the graphs settled.

If your dashboards can’t spot those early tells—drift, hallucination, latency spikes—before your users feel them, you’re flying VFR in a thunderstorm.

What to measure (signals that move before churn)

You need SLIs that reflect the whole AI-enabled flow, not just the LLM call. At minimum:

Latency
- ai_inference_latency_seconds (histogram): p50/p95/p99 by route, model, model_version, provider.
- Stage timing: retrieval_ms, rerank_ms, prompt_build_ms, queue_ms, provider_latency_ms.
Errors & refusals
- ai_inference_errors_total by type (timeout, rate_limit, 5xx, provider_refusal, guardrail_block).
- ai_fallback_total when you switch models/providers.
Quality proxies
- ai_eval_pass_total / ai_eval_total -> ai_eval_pass_ratio (RAGAS faithfulness/answer relevance, task-specific checks, keyword coverage).
- ai_hallucination_flags_total from heuristics or reviewer labels.
Retrieval health (for RAG)
- ai_retrieval_hit_rate (% queries where ground truth doc appears in top-k).
- ai_mrr (mean reciprocal rank) on labeled sets.
- ai_topk_overlap between versions during canaries.
Drift
- ai_input_psi (Population Stability Index) for feature/term distributions.
- Embedding centroid shift and cosine similarity vs baseline.
Safety
- ai_guardrail_triggers_total by category: PII, jailbreak, toxicity.
- ai_pii_redaction_success_total / _attempt_total.
Cost & tokens
- ai_prompt_tokens_total, ai_completion_tokens_total, ai_cost_usd_total per request.
- Cache hit_ratio if you use PromptCache/Redis.

Tag everything with: model, model_version, provider, route, prompt_template_hash, feature_flag, and a privacy-safe tenant/segment.

Instrumentation that survives real traffic

Use OpenTelemetry for end-to-end traces and Prometheus/Datadog for metrics. Structured logs go to ELK/ClickHouse/BigQuery for deep dives. The trick is consistent IDs and tags across stages.

Trace the whole flow: request -> retrieve -> rerank -> prompt -> LLM -> postprocess -> guardrail.
Propagate trace_id, span_id, and a stable request_id through async hops and queues.
Log the prompt template hash and redacted features; never log raw PII.

# instrumentation.py
# Python example: OpenTelemetry + Prometheus for an LLM/RAG service
from time import perf_counter
from prometheus_client import Counter, Histogram, Gauge, start_http_server
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer(__name__)

llm_latency = Histogram(
    'ai_inference_latency_seconds',
    'LLM end-to-end latency',
    ['route', 'model', 'model_version', 'provider']
)
llm_errors = Counter(
    'ai_inference_errors_total',
    'LLM errors by type',
    ['route', 'model', 'model_version', 'provider', 'type']
)
llm_tokens = Counter(
    'ai_tokens_total', 'Prompt + completion tokens', ['route', 'model', 'model_version', 'kind']
)
llm_eval_pass = Counter('ai_eval_pass_total', 'Eval pass count', ['route', 'model_version'])
llm_eval_total = Counter('ai_eval_total', 'Eval total count', ['route', 'model_version'])
retrieval_hit = Gauge('ai_retrieval_hit_rate', 'RAG hit rate (windowed via recording rule)', ['route'])

start_http_server(9464)  # Prometheus scrape endpoint


def generate_llm(route, provider_client, prompt, model, model_version):
    with tracer.start_as_current_span('llm.generate') as span:
        span.set_attribute('route', route)
        span.set_attribute('model', model)
        span.set_attribute('model_version', model_version)
        start = perf_counter()
        try:
            resp = provider_client.chat.completions.create(
                model=model,
                messages=prompt,
                temperature=0.2,
                timeout=10,
            )
            dur = perf_counter() - start
            llm_latency.labels(route, model, model_version, provider_client.name).observe(dur)
            usage = getattr(resp, 'usage', None) or {}
            llm_tokens.labels(route, model, model_version, 'prompt').inc(usage.get('prompt_tokens', 0))
            llm_tokens.labels(route, model, model_version, 'completion').inc(usage.get('completion_tokens', 0))
            span.set_status(Status(StatusCode.OK))
            return resp
        except Exception as e:
            llm_errors.labels(route, model, model_version, provider_client.name, type(e).__name__).inc()
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR))
            raise

Structured logs matter for forensics and labeling:

{"ts":"2025-03-21T12:05:14Z","trace_id":"2d6c...","route":"answer_question","model":"gpt-4o","model_version":"2025-05-13","prompt_template_hash":"b7c1f3","feature_flag":"reranker_v2_canary","retrieval_docs":8,"pii_redacted":true,"eval":{"suite":"ragas_v0.9","faithfulness":0.81,"answer_relevance":0.77},"guardrail":{"triggers":["pii"],"blocked":false},"latency_ms":{"retrieval":42,"rerank":31,"llm":820}}

Dashboards that tell the truth (Grafana/Datadog layout)

Build dashboards that answer “what broke, where, and why” in one screen. The layout we ship with GitPlumbers:

Top row: SLOs and burn rate
- p95 ai_inference_latency_seconds by route and model_version.
- Error/refusal rate stacked by type.
- ai_eval_pass_ratio as a sparkline with 7d baseline.
Middle: Stage breakdown
- Retrieval/rerank/prompt/LLM timings and queue depth.
- Provider saturation: rate limits, 429/503 by provider.
Quality & drift
- RAG hit rate, MRR, top-k overlap vs control.
- PSI per key feature/term; embedding centroid shift.
Safety & cost
- Guardrail trigger categories and block rate.
- Tokens/request and ai_cost_usd_total per tenant.

PromQL snippets you’ll reuse:

# p95 latency per model version, 5m rate
histogram_quantile(
  0.95,
  sum(rate(ai_inference_latency_seconds_bucket{route="answer"}[5m])) by (le, model, model_version)
)

# Eval pass ratio (15m)
(increase(ai_eval_pass_total[15m]) / increase(ai_eval_total[15m]))

# Error/refusal rate (5m)
sum(rate(ai_inference_errors_total{type!=""}[5m])) by (type, provider)

In Datadog, tag your metrics the same way and create widgets per model_version. Lock a “canary vs control” comparison view so on-calls can spot divergence instantly.

Guardrails and evals: your early warning system

You won’t catch hallucinations with 5xx graphs. You need automated evals and safety guardrails that produce metrics.

Evals
- Use RAGAS for RAG (faithfulness, answer relevance, context precision/recall) and Evidently for drift/stability reports.
- Run small eval suites on canary traffic and nightly on sampled production data with redaction.
- Publish scores as time-series, not just CSVs.
Guardrails
- Apply jailbreak/PII/toxicity checks using Guardrails.ai, LlamaGuard, or provider moderation APIs.
- Log triggers with categories; block or safe-complete depending on policy.

Example: run evals in CI and surface a metric gate.

# ci-evals.sh
set -euo pipefail
python -m evals.run --suite ragas_v0_9 --input data/canary.jsonl --out out/metrics.json
python -m evidently --profile drift --ref data/baseline.parquet --cur data/canary.parquet --out out/drift.json

jq '.ragas.faithfulness' out/metrics.json | awk '{ if ($1 < 0.78) exit 1 }'

Wire those results back as metrics:

# publish_evals.py
import json
from prometheus_client import Counter

pass_c = Counter('ai_eval_pass_total', 'Eval pass', ['suite','model_version'])
all_c  = Counter('ai_eval_total', 'Eval total', ['suite','model_version'])

m = json.load(open('out/metrics.json'))
model_version = m.get('meta', {}).get('model_version', 'unknown')
for case in m['cases']:
    all_c.labels('ragas_v0_9', model_version).inc()
    if case['faithfulness'] >= 0.78 and case['answer_relevance'] >= 0.8:
        pass_c.labels('ragas_v0_9', model_version).inc()

Alerts that wake the right person (not the whole company)

Use SLO burn rates and composite conditions. Single-threshold alerts on p95 are noisy and late.

Multi-window burn rate for latency and error budgets (SRE-style 5m/1h windows).
Composite: alert when eval_pass_ratio drops AND retrieval_hit_rate drops (correlated failure), or when fallback_total spikes AND cost per request climbs.
Route to owners: retrieval issues to search team, provider 429s to platform.

Prometheus example:

# alerts.yaml
groups:
- name: ai-slo
  rules:
  - alert: AIHighLatencyBurnRate
    expr: |
      histogram_quantile(0.95, sum(rate(ai_inference_latency_seconds_bucket[5m])) by (le, route))
        > on(route) group_left() (slo_latency_seconds{route!=""})
    for: 10m
    labels: {severity: page}
    annotations:
      summary: "p95 latency SLO breach on {{ $labels.route }}"

  - alert: EvalAndRetrievalDegradation
    expr: |
      (1 - (increase(ai_eval_pass_total[15m]) / increase(ai_eval_total[15m]))) > 0.25
      and on(route) (avg_over_time(ai_retrieval_hit_rate[15m]) < 0.7)
    for: 15m
    labels: {severity: page}
    annotations:
      summary: "Eval pass rate + retrieval hit drop"

  - alert: HallucinationSpike
    expr: increase(ai_hallucination_flags_total[30m]) > 100
    labels: {severity: warn}
    annotations:
      summary: "Hallucination flags spiking"

Datadog equivalent: create monitors with formulas combining metrics and add change alerts (delta vs 7d baseline) for ai_tokens_total and ai_cost_usd_total.

Rollbacks, canaries, and budget caps (when—not if—things degrade)

Don’t wait for humans to click buttons during an incident. Wire safeties into the system.

Feature flags (LaunchDarkly/Unleash): select model_version and reranker behind flags; ramp gradually; log the flag in every trace.
Canaries (Argo Rollouts): 10% -> 25% -> 50% with automated pauses on SLO/eval failures.
Circuit breakers (Istio/Envoy): trip on 5xx/latency to cut over to a stable provider or a smaller model.
Budget caps: stop-the-bleed when token usage or cost spikes.

Examples:

// route.ts – choose model by flag
import { LDClient } from 'launchdarkly-node-server-sdk'
const ld = LDClient('sdk-key')

export async function modelFor(userKey: string) {
  const ctx = { key: userKey }
  const useNew = await ld.variation('model_v2_canary', ctx, false)
  return useNew ? { model: 'gpt-4o', version: '2025-05-13' } : { model: 'gpt-4o', version: '2025-03-01' }
}

# istio-destinationrule.yaml – outlier detection
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: llm-provider
spec:
  host: provider.svc.cluster.local
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    connectionPool:
      http:
        http1MaxPendingRequests: 1000
        maxRequestsPerConnection: 100

# argo-rollouts.yaml – canary with metric gates
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: rag-service
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 10m}
      - setWeight: 25
      - pause: {duration: 20m}
      analysis:
        templates:
        - templateName: eval-pass-ratio
      trafficRouting:
        istio: { virtualService: rag-vs, weight: 100 }

If you can’t auto-pause a canary when evals regress, you don’t have a canary—you have a countdown.

What we’d set up in your first week

In a week, we’ve taken teams from vibes to visibility:

Add OpenTelemetry traces across retrieval -> LLM -> postprocess with model, model_version, prompt_template_hash.
Expose Prometheus metrics for latency, error/refusals, tokens/cost, eval pass ratio, retrieval hit/MRR, drift PSI, guardrail triggers.
Stand up Grafana/Datadog dashboards: top-row SLOs, mid-row stage breakdown, bottom-row safety/cost.
Wire RAGAS/Evidently evals into CI + nightly batch; publish metrics.
Add LaunchDarkly/Unleash flags and Argo canaries; configure Istio circuit breakers.
Create alert rules with burn rates and composite conditions; route to the right on-call.
Write runbooks: rollback, provider failover, budget caps, and how to read the dashboard during an incident.

Results we typically see: 30–50% reduction in MTTR for AI incidents, 15–25% lower token spend from catching prompt bloat early, and—most important—regressions caught in canary rather than on Twitter.

Related Resources

Key takeaways

Instrument the entire AI request path with traces and per-stage metrics; tag by model, version, prompt template hash, and feature flag.
Track quality proxies (eval pass rate, hallucination flags), retrieval health (hit rate, MRR), and drift (PSI/JS divergence) alongside latency and errors.
Build Grafana/Datadog dashboards by route and model version; place SLOs and burn-rate panels at the top, stage breakdowns in the middle, cost/safety at the bottom.
Use automated evals (RAGAS/Evidently) on canaries and nightly batches; publish scores as time-series metrics and alert on deltas.
Set composite alerts (e.g., eval fail rate + retriever hit drop) and autosafeguards (circuit breakers, fallback models, cost caps).
Roll out with feature flags and Argo Rollouts canaries; auto-rollback on SLO breach or eval regressions.

Implementation checklist

Emit traces for every AI request with `trace_id`, `model`, `model_version`, `prompt_template_hash`, and `feature_flag`.
Expose Prometheus metrics: latency histogram, error counts, eval pass/fail, retrieval hit rate, drift PSI, token/cost, guardrail triggers.
Create dashboards with p50/p95/p99 latency by route and version; error and refusal rates; eval pass rate; retrieval hit/MRR; drift PSI; token and cost per request.
Add Prometheus/Datadog alerts using multi-window burn rates and composite conditions.
Integrate evals (RAGAS/Evidently) into CI and nightly jobs; publish scores as metrics and artifact links.
Implement guardrails (PII redaction, toxicity, jailbreak) and log triggers per request.
Use LaunchDarkly/Unleash for model version flags; enable Argo Rollouts canaries; configure Istio circuit breakers and outlier detection.
Define runbooks: rollback steps, provider failover, cache warmups, budget caps, and comms channels.

Questions we hear from teams

How do we quantify hallucinations without a massive labeling team?: Start with automated evals like RAGAS faithfulness and answer relevance on a small, curated set of golden questions. Augment with lightweight human review (50–100 samples/week). Publish pass rate as a metric (`ai_eval_pass_ratio`). Use heuristics (unsupported claims, out-of-context citations) as additional flags, but gate rollouts on the golden-set metrics.
Which tools should we pick: Prometheus/Grafana or Datadog?: If you already run Datadog, use it—tags and notebooks are handy. If you’re Kubernetes-native with SRE muscle, Prometheus/Grafana gives you control and lower cost. The important part is consistent labels and having both metrics and traces (OpenTelemetry) wired through the same IDs.
How do we detect drift in unstructured text?: Compute PSI on token/term buckets and monitor embedding centroid shift (cosine distance) relative to a rolling baseline. Evidently AI can produce drift reports you can convert into metrics. Alert on sustained deviation, not single spikes.
What about privacy? Can we log prompts and outputs?: Redact PII at the edge (phone, email, SSN, customer IDs). Store only hashed identifiers. Keep a sampled, heavily redacted corpus for evals with strict access controls and TTL. Never log raw credentials or secrets; verify via automated checks.
How do we prevent runaway costs during incidents?: Track tokens and cost per request. Put a request-level budget cap and a per-tenant daily cap. Alert on change vs 7d baseline. Add circuit breakers that disable expensive fallbacks if SLOs aren’t improving.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get an AI observability health check See how we wire SLOs for AI systems