How do I measure hallucination objectively?

Use a mix: semantic similarity to a reference answer, citation coverage to required sources, and a grounding score (e.g., whether each claim is supported by retrieved docs). Start with thresholds (sim ≥ 0.72, citations ≥ 0.8) and tighten as you learn.

What’s a reasonable SLO set for a first release?

Start with p95 latency ≤ 2.5s, hallucination rate ≤ 3%, error/timeout ≤ 1%, and grounding score ≥ 0.8. Adjust by route and user segment as you get data.

Do I need LangSmith/TruLens/Ragas, or can I homebrew?

You can start with `pytest` + `sentence-transformers` + Prometheus. Larger teams benefit from hosted eval dashboards, experiment tracking, and dataset management. The harness matters more than the brand.

How do I prevent cost blowouts?

Instrument tokens per route, set budgets, and use caching (semantic and response). Enforce context window caps and prefer smaller, faster models on low-risk paths; escalate only on confidence or user intent.

What about provider outages or model regressions?

Treat providers like any dependency. Use Istio/Envoy timeouts, retries, and circuit breakers. Keep at least one fallback model and a cached fast path. Canary provider/model changes behind flags with automatic rollback.

Ai-delivery · Nov 3, 2025 · 8 minute read

Ship GenAI Without Regret: The Evaluation Harness That Keeps Features Accountable

If your genAI feature can’t prove it’s grounded, fast, and safe at every stage, it has no business in production. Here’s the evaluation harness we deploy before, during, and after release.

Alex Kim

Partner, AI Delivery at GitPlumbers

20 years shipping and rescuing systems at scale. Led platform and ML orgs through the microservices boom, Kubernetes hangovers, and the first wave of genAI in production. Ex-Uber, Shopify, and two flameout startups I’m oddly proud of.

If it isn’t measured pre-release, guarded during rollout, and monitored for drift after, it’s not a product feature—it’s a demo.

Back to all posts

The on-call page-of-shame you can avoid

I’ve watched a genAI “assistant” at a travel marketplace start inventing refund policies at 2 a.m. after a routine index rebuild. Support tickets spiked, legal woke up, and the root cause wasn’t the model—it was missing citations and a retriever that silently degraded. I’ve also seen a fintech copilot blow past p95 latency because someone “just bumped” the context window to 64k. Same story, different logo.

If you don’t design an evaluation harness from day one, you will end up firefighting hallucinations, drift, and latency spikes in prod. This is the harness we ship with clients at GitPlumbers so features stay accountable before, during, and after release.

Instrument the entire AI flow like it’s payments

If your telemetry starts and ends at the API gateway, you’re already blind. You need traces and metrics across every hop of the AI path:

Request: user intent, channel, experiment/flag, trace ID
Retrieval: query vector, top-k, source IDs, recall@k, staleness timestamp
Model: provider, model name/version, temperature, tokens (prompt/comp), p50/p95/p99 latency
Safety: moderation score, PII flags, refusal reason
Tools: which tools ran, duration, error types
Answer: groundedness score, citation coverage, final status (success/fallback/refusal)

Use OpenTelemetry to stitch it. Here’s a minimal Python example that wraps a RAG call with spans and emits Prometheus counters you can alert on:

# instrumentation.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from prometheus_client import Counter, Histogram

tracer_provider = TracerProvider()
tracer_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer(__name__)

ai_latency = Histogram("ai_end_to_end_seconds", "AI request latency", ["model","route"]) 
hallucinations = Counter("ai_hallucination_total","Model produced ungrounded answer", ["model","route"]) 

def record_ai_call(route, model, fn):
    with tracer.start_as_current_span("ai.request") as span:
        span.set_attribute("ai.model", model)
        with ai_latency.labels(model, route).time():
            result = fn()
        if not result.get("grounded", True):
            hallucinations.labels(model, route).inc()
            span.set_attribute("ai.hallucination", True)
        span.set_attribute("ai.tokens.prompt", result.get("tokens_prompt", 0))
        span.set_attribute("ai.tokens.completion", result.get("tokens_completion", 0))
        return result

Ship dashboards with breakdowns by route, model, and experiment. If you can’t answer “which prompt change increased ungrounded responses last week?” in < 60 seconds, you don’t have observability—just logs.

Pre-release: make the model pass tests, not vibes

Golden datasets and automated scoring are non-negotiable. Your CI should fail builds if quality regresses, same as unit tests.

What to include in your golden sets:

Inputs: representative user prompts by segment and channel
Context: doc IDs/snippets you expect the retriever to pull
Expected: reference answer and acceptable variants
Citations: required sources your answer must use
Safety: adversarial prompts for jailbreaks, PII leaks, brand-unsafe content

A lean pytest that checks semantic similarity and citation coverage:

# tests/test_rag_eval.py
import json
import pytest
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

with open("tests/golden.json") as f:
    GOLDEN = json.load(f)

SIM_THRESHOLD = 0.72
CITATION_MIN = 0.8  # fraction of expected citations present

@pytest.mark.parametrize("case", GOLDEN)
def test_rag(case):
    out = run_rag(case["input"])  # your orchestrator
    sim = float(util.cos_sim(model.encode(out["answer"]), model.encode(case["expected"])))
    used = set(out.get("citations", []))
    required = set(case.get("citations", []))
    coverage = len(used & required) / max(1, len(required))
    assert sim >= SIM_THRESHOLD, f"semantic sim too low: {sim}"
    assert coverage >= CITATION_MIN, f"missing citations: {used} vs {required}"

Wire this into CI. Store golden sets in Git, version them with your prompts and retriever params. For bigger teams, tools like LangSmith, TruLens, or Ragas help manage eval runs and dashboards. The point is the same: if quality slides, the pipeline blocks the merge.

During release: canary, shadow, and hard guardrails

You don’t flip a genAI feature 0→100. You ramp with guardrails that auto-stop when quality or latency goes sideways.

Feature flags: gate by cohort with LaunchDarkly/Unleash.
Shadow traffic: run new pipeline in parallel, log-only decisions until metrics say “safe.”
Canary: progressive traffic with auto-rollback on Prometheus alerts via Argo Rollouts.
Budgets: p95 latency, error rate, and hallucination rate with explicit SLOs.

Example Argo Rollouts canary that stops if hallucinations exceed 3%:

# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: genai-assistant
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: {duration: 600}
        - setWeight: 25
        - pause: {duration: 1200}
        - setWeight: 50
      analysis:
        templates:
          - templateName: ai-quality
        startingStep: 0
  selector:
    matchLabels: {app: genai-assistant}
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: ai-quality
spec:
  metrics:
    - name: hallucination-rate
      interval: 1m
      count: 5
      failureLimit: 1
      successCondition: result < 0.03
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(ai_hallucination_total[5m]))
            /
            sum(rate(http_requests_total{route="/assist"}[5m]))

Pair this with Istio timeouts and circuit breakers so the rest of your stack doesn’t drown when the LLM gets slow:

# istio-virtualservice.yaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: llm-gateway
spec:
  hosts: ["llm.internal"]
  http:
    - timeout: 3s
      retries:
        attempts: 2
        perTryTimeout: 1s
        retryOn: connect-failure,refused-stream,5xx,reset
      route:
        - destination: {host: llm-provider}

If p95 blows its budget or hallucinations tick up, rollout halts, traffic rolls back, PagerDuty rings. No meetings required.

After release: drift, feedback loops, and regression traps

Production is where models go to rot. Treat drift as a certainty, not a surprise.

Embedding/retrieval drift: monitor distribution shifts (cosine distance to reference centroids) and recall@k on synthetic queries. EvidentlyAI or WhyLabs are fine here.
Content drift: when your corpus changes (new price policies, SKU churn), detect stale docs and trigger index rebuilds with change budgets.
User feedback: thumbs up/down, refusal reasons, and freeform corrections. Close the loop: add hard negatives and re-run evals nightly.

Alert on quality deltas, not just infra. A PrometheusRule that pages on latency spikes and rising hallucinations:

# alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: genai-alerts
spec:
  groups:
    - name: ai-slo
      rules:
        - alert: AILatencyBudgetBreached
          expr: histogram_quantile(0.95, sum(rate(ai_end_to_end_seconds_bucket[5m])) by (le)) > 2.5
          for: 10m
          labels: {severity: page}
          annotations:
            summary: "AI p95 latency > 2.5s"
        - alert: AIHallucinationRateUp
          expr: (sum(rate(ai_hallucination_total[15m])) / sum(rate(http_requests_total{route="/assist"}[15m]))) > 0.03
          for: 10m
          labels: {severity: page}
          annotations:
            summary: "Hallucination rate above 3%"

Back this with a weekly regression job that runs your golden set plus fresh production samples through the harness and posts a diff to Slack. If yesterday’s prompt tweak shaved 200ms but tanked citation coverage, you should know before your CFO does.

Safety guardrails that fail closed

I’ve seen teams bolt on moderation after launch and wonder why brand risk went up. Bake guardrails into the flow and default to “safe fallback.”

Moderation: OpenAI Moderation or Perspective API at the edge; block or route to human.
PII: run Presidio detectors pre- and post-generation. Drop sensitive spans from logs.
Prompt injection/tool safety: validate tool arguments with pydantic and sandbox with tight timeouts and whitelists.
Fallbacks: when guardrails trip, degrade to deterministic flows (FAQ lookup, templates) and explain it to the user.

A minimal tool call with strict validation:

from pydantic import BaseModel, Field, ValidationError

class WeatherArgs(BaseModel):
    city: str = Field(min_length=2, max_length=80)
    units: str = Field(regex=r"^(metric|imperial)$")

def call_weather_tool(args: dict):
    try:
        valid = WeatherArgs(**args)
    except ValidationError:
        return {"error": "invalid_args", "fallback": True}
    return fetch_weather_api(valid.city, valid.units, timeout=800)  # ms

And a “fail closed” response pattern:

// frontend.ts
if (resp.guardrailTripped) {
  showNotice("We switched to a verified answer because your question touched a sensitive topic.");
  render(fallbackAnswer);
} else {
  render(resp.answer);
}

SLOs, dashboards, and runbooks: treat AI like prod, because it is

Set SLOs that reflect user and business value—not just infra comfort.

Quality: hallucination rate < 3%, grounding score ≥ 0.8, citation coverage ≥ 0.8
Latency: p95 < 2.5s end-to-end, tool p95 < 800ms
Reliability: timeout/error < 1%, refusal rate within policy bands
Cost: tokens per answer within budget by route/segment

Dashboards to ship day one:

Quality panel: sim score distribution, grounding/citation trends, top failing intents
Latency panel: p50/p95 by stage (retrieval/model/tools), queueing time, retries
Safety panel: moderation blocks, PII flags, injection attempts
Cost panel: tokens/req, provider spend by route, cache hit rate

Runbooks to print and laminate:

Rollback prompt/model
Rebuild and validate the index
Switch provider or model variant
Disable tools and force deterministic fallback
Incident comms template for support/legal

We’ve used this harness to stop a hallucination spike (12%→1.8% in 3 weeks) at a fintech by enforcing citation coverage in CI, adding an Istio timeout/circuit breaker, and canarying a retriever re-rank change behind LaunchDarkly.

What I’d do tomorrow if you gave me one sprint

Define the telemetry schema and wire OpenTelemetry → Prometheus/Grafana.
Stand up CI evals with a small golden set (50–200 cases) and block merges on regression.
Put the feature behind a flag, add Argo Rollouts canary + Istio timeouts.
Set SLOs and alerts for latency and hallucination rate; build one Grafana board.
Add PII detection and a safe fallback. Ship. Iterate weekly with regression jobs.

If it isn’t measured pre-release, guarded during rollout, and monitored for drift after, it’s not a product feature—it’s a demo.

Related Resources

Key takeaways

Instrument the entire AI flow (retrieval, model, tools) with OpenTelemetry and emit business-grade metrics: groundedness, refusal rate, token cost, and end-to-end latency.
Lock in pre-release accountability with golden datasets, deterministic test fixtures, and automated scoring (similarity + grounding + safety) in CI.
Release behind flags with canary and shadow traffic. Enforce SLO-aligned guardrails: circuit breakers, timeouts, and auto-rollback on quality regression.
Continuously detect drift in embeddings, retrieval recall, and policy safety signals. Treat hallucination rate and latency budgets as first-class SLOs.
Fail closed: when guardrails trigger, degrade gracefully to deterministic paths and explain the fallback to the user.

Implementation checklist

Define a telemetry schema: trace IDs, request IDs, user intent, retrieval docs, model params, tool calls, final answer.
Stand up OpenTelemetry across services; export to Prometheus/Grafana and your log lake (e.g., Loki/ELK).
Curate golden datasets with expected answers and citations; version them in Git and store in object storage.
Automate evals in CI with `pytest` + scoring utilities (semantic similarity, citation coverage, safety).
Gate releases behind `LaunchDarkly`/`Unleash`; use `Argo Rollouts` canaries and `Istio` circuit breakers.
Define AI SLOs: p95 latency, hallucination rate, grounding score, failure/timeout rate. Wire alerts to on-call.
Monitor drift with `EvidentlyAI`/`WhyLabs` and embedding distribution checks; retrain/re-index with change budgets.
Write runbooks for rollback, index rebuilds, prompt updates, and incident comms.

Questions we hear from teams

How do I measure hallucination objectively?: Use a mix: semantic similarity to a reference answer, citation coverage to required sources, and a grounding score (e.g., whether each claim is supported by retrieved docs). Start with thresholds (sim ≥ 0.72, citations ≥ 0.8) and tighten as you learn.
What’s a reasonable SLO set for a first release?: Start with p95 latency ≤ 2.5s, hallucination rate ≤ 3%, error/timeout ≤ 1%, and grounding score ≥ 0.8. Adjust by route and user segment as you get data.
Do I need LangSmith/TruLens/Ragas, or can I homebrew?: You can start with `pytest` + `sentence-transformers` + Prometheus. Larger teams benefit from hosted eval dashboards, experiment tracking, and dataset management. The harness matters more than the brand.
How do I prevent cost blowouts?: Instrument tokens per route, set budgets, and use caching (semantic and response). Enforce context window caps and prefer smaller, faster models on low-risk paths; escalate only on confidence or user intent.
What about provider outages or model regressions?: Treat providers like any dependency. Use Istio/Envoy timeouts, retries, and circuit breakers. Keep at least one fallback model and a cached fast path. Canary provider/model changes behind flags with automatic rollback.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Audit my genAI feature See how we ship AI safely