What’s the minimum viable AI observability stack?

OpenTelemetry traces for each stage (retrieval, generate, moderation), Prometheus metrics for quality/safety/latency/cost, Grafana dashboards, and a daily drift job that pushes centroid and PSI metrics. Add a 5-minute synthetic eval to catch regressions between traffic spikes.

How do I measure hallucinations objectively?

Start with a golden test set and LLM-as-judge scoring (exact-match, factuality via retrieval overlap, refusal correctness). Track eval pass rate and a separate irrelevant/unsupported claim rate. Layer in shadow traffic comparisons for challengers.

Won’t guardrails nuke latency?

Inline moderation adds tens of milliseconds for most providers. Make it concurrent with retrieval when possible, and only run heavy checks (e.g., injection classifiers) on risky intents. Measure p95 before and after; if it pushes you over SLO, switch to sampled checks plus a circuit breaker.

How do I avoid leaking PII in traces/logs?

Redact at ingestion. Use a sanitizer middleware that replaces emails/phones/SSNs with hashes before exporting spans/logs. Store raw prompts encrypted and reference by ID in traces. Verify your OpenTelemetry exporter isn’t shipping raw bodies.

Ai-delivery · Oct 24, 2025 · 10 minute read

Dashboards That Catch AI Model Degradation Before Users Do

If your first signal of a failing model is a Support ticket, you don’t have observability—you have vibes. Here’s how to instrument, guard, and visualize AI systems so drift, hallucinations, and latency spikes trip alarms hours before customers notice.

Avery Park

Partner, GitPlumbers

20 years wrestling web-scale systems into shape at Stripe, Atlassian, and a couple of unicorns that shall remain nameless. Now I help teams ship AI safely without burning down SLOs.

If your first signal of a failing model is a Support ticket, you don’t have observability—you have vibes.

Back to all posts

The incident you’ve already lived

You ship a slick RAG endpoint Friday night. By Monday, churn is up and CS is swamped. No 5xx spikes, infra green. What actually happened: your vector index silently drifted after a batch ingest, the retriever started returning stale chunks, and your model hallucinated with high confidence. Nobody saw it because the only dashboards were CPU and “requests per minute.”

I’ve watched this movie at three companies (fintech, B2B SaaS, and a healthtech). The pattern is the same: great infra dashboards, zero AI-specific signals. Fixable—if you treat AI like a distributed system with quality and safety SLOs, not a black box.

Instrument the AI path like a distributed system

Start with traces. If you can’t see prompt → retrieval → model → post-processing, you’re debugging blind.

Use OpenTelemetry end-to-end. There’s a new set of semantic conventions for GenAI—use them.
Create spans for retrieval, rerank, generate, moderation, and postprocess. Add attributes you’ll actually filter on in Grafana/Loki.
Record token counts, latency buckets, cache hits, hit/miss for retrieval, and finish reasons.

Example TypeScript snippet using @opentelemetry/api and prom-client:

import { trace, SpanKind } from '@opentelemetry/api';
import { Counter, Histogram, register } from 'prom-client';

const tracer = trace.getTracer('ai-gateway');

const latency = new Histogram({
  name: 'genai_inference_latency_seconds',
  help: 'LLM end-to-end latency',
  buckets: [0.2, 0.5, 1, 2, 4, 8],
  labelNames: ['route', 'model']
});

const outputTotal = new Counter({
  name: 'genai_output_total',
  help: 'Total responses from the model',
  labelNames: ['route', 'model']
});

const outputBad = new Counter({
  name: 'genai_output_bad_total',
  help: 'Responses failing evals/guardrails',
  labelNames: ['route', 'model', 'reason']
});

export async function generate(route: string, prompt: string) {
  const span = tracer.startSpan('genai.generate', { kind: SpanKind.CLIENT });
  const start = Date.now();
  try {
    span.setAttribute('gen_ai.system', 'openai');
    span.setAttribute('gen_ai.request.model', 'gpt-4o-2024-08-06');
    span.setAttribute('gen_ai.request.temperature', 0.2);
    // call LLM...
    const { text, usage, finish_reason } = await llmCall(prompt);

    span.setAttribute('gen_ai.usage.prompt_tokens', usage.prompt_tokens);
    span.setAttribute('gen_ai.usage.completion_tokens', usage.completion_tokens);
    span.setAttribute('gen_ai.response.finish_reason', finish_reason);

    outputTotal.labels({ route, model: 'gpt-4o-2024-08-06' }).inc();
    return text;
  } catch (e) {
    span.recordException(e as Error);
    throw e;
  } finally {
    const seconds = (Date.now() - start) / 1000;
    latency.labels({ route, model: 'gpt-4o-2024-08-06' }).observe(seconds);
    span.end();
  }
}

Pro tip: add a request_id to the trace and shove it into response headers. When CS pings you with a bad answer, you can pivot from ticket → trace → raw prompt/response (redacted) in seconds.

Track the right signals (hallucination, drift, latency, cost)

If you only graph success codes and avg latency, you’ll miss the real failure modes. Your top panels should be:

Quality/Evals
- eval_pass_rate from synthetic golden prompts.
- shadow_accept_rate comparing challenger vs champion.
- refusal_rate and irrelevant_answer_rate from heuristic or LLM-as-judge.
Safety
- Toxicity/PII/moderation flags from Azure Content Safety, OpenAI moderation, or AWS Bedrock Guardrails.
- Prompt-injection detection counts (e.g., Rebuff, llama_guard).
Drift
- Embedding distribution shift (cosine distance to baseline centroid).
- PSI for prompt feature buckets (length, language, entity mix).
Latency
- p50/p95/p99 for retrieval, rerank, model, post-process spans.
- Tail timeouts and streaming first-token latency.
Cost & Throughput
- Tokens per request, cache hit rate, RPS by route, provider errors by code.

PromQL you’ll actually use:

# Eval failure rate by model (5m)
sum by(model) (rate(genai_output_bad_total[5m]))
/
sum by(model) (rate(genai_output_total[5m]))

# p95 end-to-end latency by route
histogram_quantile(0.95, sum by(le, route) (rate(genai_inference_latency_seconds_bucket[5m])))

# Safety trip rate
sum(rate(genai_safety_block_total[5m])) by (rule, route)

If your dashboard can’t answer “Is quality down? Is it safety or drift? Which stage regressed?” in 60 seconds, it’s not operational.

Build guardrails and circuit breakers that fail safe

I’ve seen teams detect bad behavior and still torch user trust because they had no safe fallback. Don’t just alert—degrade gracefully.

Put a moderation step before generation for user input, and after generation for model output.
If risk is high, fallback to a safer path: lower temperature, retrieval-only answer, or a deterministic template.
Use a circuit breaker that opens when risk scores or eval failures cross a threshold; close after a cooldown.

Example using opossum (Node) and OpenAI moderation:

import CircuitBreaker from 'opossum';

async function moderatedGenerate(input: string) {
  const risk = await openai.moderations.create({ model: 'omni-moderation-latest', input });
  if (risk.results[0].flagged) {
    // safe fallback
    return "Sorry, I can’t help with that. Here are safe resources...";
  }
  return await generate('chat', input);
}

const breaker = new CircuitBreaker(moderatedGenerate, { timeout: 8000, errorThresholdPercentage: 25, resetTimeout: 30000 });

breaker.on('open', () => outputBad.labels({ route: 'chat', model: 'gpt-4o-2024-08-06', reason: 'circuit_open' }).inc());

For enterprise, NeMo Guardrails or Bedrock Guardrails give you policy configuration. Keep policies in Git and roll them out via GitOps like any code.

Detect drift before customers do

Data shifts long before incidents show up.

Track embedding centroid shift for your corpus and recent queries.
Compute PSI (Population Stability Index) on prompt features (length, entities, domain).
Run a synthetic eval job every 5 minutes against a stable “golden set” and feed metrics to Prometheus.

Tiny Python job for PSI and embedding drift (pushes to Prometheus Pushgateway):

# requirements: numpy, prometheus_client, sentence-transformers
import numpy as np
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
from sentence_transformers import SentenceTransformer

BASE_CENTROID = np.load('baseline_centroid.npy')
BASE_HIST = np.load('baseline_prompt_len_hist.npy')  # 10-bin histogram
model = SentenceTransformer('all-MiniLM-L6-v2')

recent_prompts = load_last_1h_prompts()
emb = model.encode(recent_prompts, normalize_embeddings=True)
centroid = emb.mean(axis=0)

centroid_shift = 1 - np.dot(BASE_CENTROID, centroid)  # cosine distance

lens = np.array([len(p) for p in recent_prompts])
hist, _ = np.histogram(lens, bins=10, range=(0, 2000), density=True)
psi = np.sum((hist - BASE_HIST) * np.log((hist + 1e-9) / (BASE_HIST + 1e-9)))

reg = CollectorRegistry()
Gauge('genai_embedding_centroid_shift', 'Cosine distance to baseline', registry=reg).set(float(centroid_shift))
Gauge('genai_prompt_len_psi', 'Population Stability Index for prompt length', registry=reg).set(float(psi))

push_to_gateway('http://pushgateway:9091', job='ai-drift', registry=reg)

Set alerts on sustained drift, not blips. Pair this with retriever metrics like retrieval_hit_rate and rerank_mrr so you know if RAG quality degraded.

Dashboards ops will actually use

Build one Grafana page per AI route with four rows: Quality, Safety, Latency, Cost. Each panel answers a decision question.

Quality
- Eval pass rate over time (with deployment annotations).
- Winner graph: challenger vs champion answer accept rate.
Safety
- Moderation block rate by rule.
- Prompt-injection triggers by source.
Latency
- p50/p95/p99 per stage (retrieval → generate → post).
- First-token latency and timeout count.
Cost/Throughput
- Tokens/request and spend/hour by provider.
- Cache hit rate (e.g., redis_clip_cache_hit_ratio).

Grafana annotation for deploys (so you can correlate regressions):

curl -X POST https://grafana/api/annotations \
  -H 'Authorization: Bearer $GRAFANA_TOKEN' \
  -H 'Content-Type: application/json' \
  -d '{
    "tags": ["deploy", "ai-gateway"],
    "text": "Rolled out model=2024-10-01",
    "time": '"$(date +%s%3N)"'
  }'

Pro tip: pin a table of the top 20 worst traces by gen_ai.response.finish_reason and route so on-calls can click straight into high pain.

Canary, shadow, and auto-revert like you mean it

Stop promoting models by “vibe check.” Bake analysis into rollout.

Use shadow traffic to feed the challenger for a week. Compare evals and guardrail trigger rates.
Promote with Argo Rollouts using an analysis template that queries Prometheus. Auto-rollback on quality or latency regressions.

Minimal Rollouts analysis template gating on eval pass rate and latency:

# argo-analysis.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: genai-quality-gate
spec:
  metrics:
  - name: eval-pass-rate
    interval: 1m
    successCondition: result[0] >= 0.92
    failureLimit: 2
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          1 - (sum(rate(genai_output_bad_total{route="chat"}[5m])) / sum(rate(genai_output_total{route="chat"}[5m])))
  - name: p95-latency
    interval: 1m
    successCondition: result[0] < 1.5
    failureLimit: 2
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          histogram_quantile(0.95, sum by(le) (rate(genai_inference_latency_seconds_bucket{route="chat"}[5m])))

Wire that into a canary step. If quality drops or latency jumps, Rollouts will halt and revert without a 2 a.m. Slack war room.

What “good” looks like in production

When we install this at clients, the curve looks the same: day 1, chaos; day 7, visibility; day 21, boring.

SLOs: 95% eval pass, p95 < 1.5s, safety block rate < 0.5%, drift metrics steady.
Alerting: page on 30-min sustained quality drop > 3% or p99 > 3s; Slack for early warnings.
Outcomes: MTTR down 60–80%, bad-answer tickets down 40%+, and model rollouts become Tuesday routines, not Friday night nightmares.

If your graphs surface changes before users feel them—and your systems degrade safely when they do—you’ve turned AI from a liability into an SRE-friendly component.

Related Resources

Key takeaways

Instrument the full AI request path with OpenTelemetry, including prompt, model, retrieval, and post-processing spans.
Track quality, safety, and drift as first-class metrics—not just 200s and average latency.
Use evals and shadow traffic to create objective signals of hallucination/accuracy before GA.
Add circuit breakers and safe fallbacks that trigger on risk scores and drift thresholds.
Design Grafana dashboards around decisions: promote, rollback, degrade, or page.
Automate analysis with Argo Rollouts or Flagger and gate promotions on SLOs, not hunches.

Implementation checklist

Adopt OpenTelemetry for AI-specific spans and attributes (model, temp, token counts, retrieval hit rate).
Export Prometheus metrics for eval pass rate, safety events, drift, latency histograms, and cost.
Stand up a synthetic eval job that runs golden prompts every 5 minutes and reports metrics.
Implement safety guardrails (toxicity, PII, prompt-injection) with a circuit-breaker fallback.
Build Grafana dashboards with quality, safety, latency, and cost on a single page; add red/green annotations for deploys.
Use Argo Rollouts analysis templates to canary new models/routes with auto-rollback.
Alert on error budget burn and trend, not one-off spikes; page on p99 latency and sustained quality drops.

Questions we hear from teams

What’s the minimum viable AI observability stack?: OpenTelemetry traces for each stage (retrieval, generate, moderation), Prometheus metrics for quality/safety/latency/cost, Grafana dashboards, and a daily drift job that pushes centroid and PSI metrics. Add a 5-minute synthetic eval to catch regressions between traffic spikes.
How do I measure hallucinations objectively?: Start with a golden test set and LLM-as-judge scoring (exact-match, factuality via retrieval overlap, refusal correctness). Track eval pass rate and a separate irrelevant/unsupported claim rate. Layer in shadow traffic comparisons for challengers.
Won’t guardrails nuke latency?: Inline moderation adds tens of milliseconds for most providers. Make it concurrent with retrieval when possible, and only run heavy checks (e.g., injection classifiers) on risky intents. Measure p95 before and after; if it pushes you over SLO, switch to sampled checks plus a circuit breaker.
How do I avoid leaking PII in traces/logs?: Redact at ingestion. Use a sanitizer middleware that replaces emails/phones/SSNs with hashes before exporting spans/logs. Store raw prompts encrypted and reference by ID in traces. Verify your OpenTelemetry exporter isn’t shipping raw bodies.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a fast AI observability review See how we wire guardrails in 2 weeks