What tools should I start with if I have nothing today?

Start simple: promptfoo in CI for golden sets; OpenTelemetry for traces; Prometheus + Grafana for metrics; LaunchDarkly for gating; Argo Rollouts for canary. Add Ragas/TruLens for grounding scores when you introduce RAG. Layer on OPA and Pydantic for guardrails.

How do I pick thresholds for pass/fail?

Back-solve from business KPIs. If a 2% hallucination rate leads to 10% ticket escalations, your threshold is below 2%. Shadow for a week to calibrate baselines, then set conservative gates for canary and tighten after you see stability.

Won’t all this slow us down?

It speeds you up after the second incident you avoid. We’ve seen teams ship weekly instead of quarterly because the harness catches regressions early and rollbacks are automatic. Your on-call sleeps better, too.

Ai-delivery · Oct 15, 2025 · 9 minute read

The Eval Harness That Keeps Your Gen Features Honest—Before, During, and After Release

Stop praying your AI features behave in prod. Instrument them like any other critical system and make them prove it every day.

Alex Kim

Partner, GitPlumbers

20 years building and rescuing distributed systems at scale. Led SRE and platform teams through the microservices wave, lived the pager life, and now helps companies ship AI features that don’t blow up in prod.

If you can’t parse it, you can’t trust it. If you can’t trace it, you can’t debug it.

Back to all posts

The meltdown I still think about

A consumer fintech rolled out a “Explain my fee” feature powered by an LLM. Demo looked great. In prod, the model hallucinated policy details when the backend API timed out, then doubled down with confident prose. Support tickets spiked 8x, legal got nervous, and we had to kill the feature in 72 hours. The root cause wasn’t the model—it was the lack of an evaluation harness and runtime guardrails. No golden tasks, no shadowing, no canary metrics, no output validation. We were flying VFR into clouds.

If you’re shipping generative features without first-class eval + observability, you’re not doing engineering—you’re doing theater.

What an evaluation harness actually is

An evaluation harness is the glue between your model and your release process. It measures quality and safety against concrete expectations, from PR to canary to prod. The harness isn’t a single tool; it’s a system:

Offline evals: golden datasets, red-teaming, reproducible scores in CI.
Runtime observability: traces and logs for prompts, retrieval, tool calls, and outputs.
Guardrails: schema validation, policy checks, timeouts, content filters.
Release controls: feature flags, canaries, shadow traffic, automatic rollbacks.
Continuous checks: drift detection and reference tasks in prod.

Think of it like SRE for AI flows. Same discipline, different failure modes: hallucination, drift, latency spikes, cost blowups, and provider regressions.

Before release: make it prove itself in CI

You need more than “the examples look good.” Bake evals into CI with fail gates tied to thresholds that actually matter.

Curate a golden set tied to business outcomes
- For support assistants: real anonymized tickets + expected resolutions.
- For RAG: questions with ground-truth citations.
- For data extraction: schemas with strict types and edge cases.
Automate scoring with multiple signals
- Factuality/grounding: Ragas, TruLens, or a lightweight promptfoo suite.
- Schema correctness: JSON schema validation or guardrails with Pydantic.
- Toxicity/PII: Azure AI Content Safety or AWS Comprehend flags.
Fail the build if thresholds aren’t met
- Don’t ship if hallucination rate > X%, schema errors > Y%, or cost/req > Z.

Example: promptfoo + CI gate for a retrieval feature:

# .promptfoo/promptfooconfig.yaml
prompts:
  - 'Answer using only the provided context. Cite sources.'
providers:
  - id: openai:gpt-4o-mini
    config:
      temperature: 0.1
      seed: 42
tests:
  - vars:
      question: 'What is the refund window?' 
      context: 'Policy: Refunds allowed within 30 days, exceptions...' 
    assert:
      - type: similarity
        threshold: 0.8
        expected: 'Refunds allowed within 30 days'
      - type: contains
        value: '[source:'
  - vars:
      question: 'When do refunds apply for clearance items?'
      context: 'Policy: Clearance items non-refundable unless defective.'
    assert:
      - type: llm-rubric
        threshold: 0.7
        value: 'Must explicitly state clearance items are non-refundable unless defective and include citation'

Add a simple JSONL golden set for extraction with exact-match checks:

{"input": "Invoice 123 total $42.50 due 2024-10-01", "expected": {"invoice_id":"123","total":42.5,"due":"2024-10-01"}}

Run this in CI and fail if pass rate < 95%.

During rollout: instrument, canary, shadow, and auto-rollback

Treat LLM calls like RPCs that can and will fail. Wire traces, introduce flags, and make canaries earn their keep.

Feature flag the surface: LaunchDarkly/Unleash gate the UI and the inference path separately. You’ll want to disable generation but keep the rest of the flow hot.
Shadow traffic: Feed read-only copies of real requests to the new model/prompt to gather scores with zero blast radius.
Canary with real KPIs: Use Argo Rollouts or Flagger; tie progression to metrics and eval scores, not vibes.

Argo Rollouts with Prometheus guard:

# rollouts-canary.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: gen-assistant
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: {duration: 10m}
        - analysis:
            templates:
              - templateName: llm-quality
      stableService: gen-assistant-stable
      canaryService: gen-assistant-canary
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: llm-quality
spec:
  metrics:
    - name: hallucination_rate
      interval: 2m
      successCondition: result < 0.05
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(llm_output_hallucination_total{service="gen-assistant"}[5m]))
            /
            sum(rate(llm_requests_total{service="gen-assistant"}[5m]))
    - name: p95_latency
      interval: 2m
      successCondition: result < 1500
      provider:
        prometheus:
          query: histogram_quantile(0.95, sum(rate(llm_latency_ms_bucket{service="gen-assistant"}[5m])) by (le))

Wire traces with OpenTelemetry so you can see retrieval, model, and tool calls in one span tree.

// express-middleware.ts
import { context, trace, SpanStatusCode } from '@opentelemetry/api';

export async function llmTrace(req, res, next) {
  const tracer = trace.getTracer('gen-assistant');
  await tracer.startActiveSpan('llm.request', async span => {
    try {
      span.setAttribute('model', req.body.model);
      span.setAttribute('prompt_hash', hash(req.body.prompt));
      const start = Date.now();

      const retrievalSpan = tracer.startSpan('retrieval');
      const docs = await retrieve(req.body.query);
      retrievalSpan.end();

      const llmSpan = tracer.startSpan('inference');
      const out = await callLLM(req.body.prompt, docs);
      llmSpan.setAttribute('tokens_in', out.usage?.prompt_tokens || 0);
      llmSpan.setAttribute('tokens_out', out.usage?.completion_tokens || 0);
      llmSpan.end();

      const evalSpan = tracer.startSpan('self_eval');
      const score = await grade(out.text, docs);
      evalSpan.setAttribute('grounded_score', score);
      evalSpan.end();

      span.setAttribute('latency_ms', Date.now() - start);
      span.setAttribute('grounded_score', score);
      res.locals.llm = { out, score };
      span.end();
      next();
    } catch (e) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: String(e) });
      span.end();
      next(e);
    }
  });
}

Latency spikes? Add circuit breaker + backoff and a semantic cache:

# Envoy filter snippet (circuit breaker)
cluster: llm-provider
  circuit_breakers:
    thresholds:
      max_connections: 100
      max_pending_requests: 1000
      max_retries: 3

// naive semantic cache key
const key = hash(embeddings(query) + model + promptTemplateVersion);
const cached = await cache.get(key);

If p95 breaches SLO or hallucination rate rises, the rollout should halt and auto-rollback. Don’t let dashboards be advisory; make them the gate.

After release: continuous validation and drift detection

Models, prompts, embeddings, and providers drift. Your users drift, too. Keep score in prod.

Run low-volume reference tasks hourly against the live stack and track score deltas.
Monitor input distribution (embedding centroid shift) and output distribution (schema error rate, sentiment, toxicity).
Track provider version and prompt hash in every log line; correlate incidents with changes.

Prometheus + Grafana are fine. For drift, you’ll likely need a small job computing stats and exporting metrics:

# drift_job.py
import numpy as np
from scipy.stats import ks_2sample

# load last-7d embedding sample vs last-1h
last_7d = np.load('embeddings_7d.npy')
last_1h = np.load('embeddings_1h.npy')

stat, p = ks_2sample(last_7d.flatten(), last_1h.flatten())
print(f"llm_embedding_drift_ks {stat}")
print(f"llm_embedding_drift_p {p}")

Alert when the KS statistic crosses a threshold and you see quality drop on reference tasks. Trigger an auto-remediation: revert prompt, pin model version, or route to fallback provider.

# Prometheus rule
- alert: LLMDriftSuspected
  expr: llm_reference_pass_rate < 0.9 and llm_embedding_drift_ks > 0.2
  for: 15m
  labels:
    severity: page
  annotations:
    description: Reference tasks failing with embedding drift; revert prompt or pin model.

Guardrails that fail safe, not loud

I’ve seen teams trust “temperature 0” as their safety plan. Don’t. Put hard edges around outputs.

Schema enforcement for tool use and extraction

# pydantic + guardrails
from pydantic import BaseModel, condecimal

class Invoice(BaseModel):
    invoice_id: str
    total: condecimal(gt=0)
    due: str

parsed = Invoice.model_validate_json(llm_output_json)

Grounding and citations for RAG
- Reject answers without citations: if not contains('[source:') -> block.
- Re-ask with higher retrieval depth if grounding score < threshold.
Content safety and PII

# OPA policy (Rego) to block PII
package ai.output

violation[msg] {
  re_match(`(?i)ssn|social security`, input.text)
  msg := "PII detected in output"
}

Tooling timeouts and retries with idempotency keys. Don’t let a calculator or vector store hang your span and nuke UX.
Provider sandboxing: pin versions and validate outputs; fail closed to a safe canned response if validation fails.

If you can’t parse it, you can’t trust it. If you can’t trace it, you can’t debug it.

The reference architecture: what we actually ship

We’ve implemented this harness at banks, marketplaces, and SaaS teams. The pattern holds:

Inference layer: FastAPI/Express with OpenTelemetry, emitting llm_* metrics.
Retrieval: pgvector or Milvus, with query/result logs for eval.
Evaluators: promptfoo for CI, Ragas/TruLens for grounding in runtime.
Feature flags: LaunchDarkly or Unleash.
Rollouts: Argo Rollouts + Prometheus AnalysisTemplates; GitOps with ArgoCD.
Safety: OPA for policy, Pydantic schemas, provider content filters.
Observability: Prometheus + Grafana + Tempo/Jaeger traces.

Minimal structured logging (add to every response):

{
  "request_id": "b0c6...",
  "user_segment": "beta",
  "provider": "openai",
  "model": "gpt-4o-mini-2024-09-12",
  "prompt_hash": "9f12...",
  "retrieval_docs": 8,
  "latency_ms": 742,
  "tokens_in": 356,
  "tokens_out": 182,
  "cost_usd": 0.0031,
  "grounded_score": 0.83,
  "schema_valid": true,
  "hallucination": false
}

Hook this into a Grafana dashboard with panels for p50/p95 latency, cost per request, pass rate, schema error rate, and hallucination rate. The first time you catch a provider regression at rollout step 1/5, you’ll wonder how you ever shipped without it.

What to measure (and what to do when it breaks)

Quality and safety SLOs should be first-class:

Latency: p95 under 1.5s for assistive UI; under 3s for retrieval+generation; alert on burn rate.
Quality: reference task pass rate ≥ 95%; grounded score ≥ 0.8; schema error rate < 1%.
Cost: average cost/request under target; rate-limit on burst to protect bill shock.

When it breaks:

Hallucination spike → raise retrieval depth, enforce citation, fail closed to fallback answer.
Latency spike → switch provider region, reduce max output tokens, trip circuit breaker, fall back to cache.
Drift detected → pin model, roll back prompt hash, schedule re-index or retrain; open incident with owner and playbook.

This isn’t theoretical. We cut MTTR for a marketplace support bot from 3 hours to 20 minutes by correlating hallucination rate with a misconfigured retriever after an index change. The harness told us where to look.

Related Resources

Key takeaways

Bake an evaluation harness into your CI/CD and runtime. Don’t bolt it on after a PR disaster.
Track both product metrics (task success, factuality) and system metrics (latency, cost, error rate) with the same rigor as any SLO.
Gate rollouts with canary + shadow traffic and automatic rollback based on eval scores and golden tasks.
Detect drift with input/output distribution checks and reference tasks; retrain or re-prompt on explicit triggers.
Enforce guardrails at the edges: schema validation, PII blockers, tool-timeouts, and content filters—fail closed when necessary.

Implementation checklist

Define golden tasks and acceptance thresholds tied to business KPIs.
Add structured logging for every LLM call: prompt hash, model/version, latency, tokens, cost, eval scores.
Wire OpenTelemetry traces across the AI flow (retrieval, synthesis, tools).
Introduce canary and shadow deployments with automatic rollback using Argo Rollouts/Flagger.
Continuously run reference evals in prod (low-volume) to detect drift.
Add output guardrails: JSON schema validation, content safety, PII scrubbing, grounding citations.
Set SLOs for latency and quality; page on breach and budget error budget burn accordingly.

Questions we hear from teams

What tools should I start with if I have nothing today?: Start simple: promptfoo in CI for golden sets; OpenTelemetry for traces; Prometheus + Grafana for metrics; LaunchDarkly for gating; Argo Rollouts for canary. Add Ragas/TruLens for grounding scores when you introduce RAG. Layer on OPA and Pydantic for guardrails.
How do I pick thresholds for pass/fail?: Back-solve from business KPIs. If a 2% hallucination rate leads to 10% ticket escalations, your threshold is below 2%. Shadow for a week to calibrate baselines, then set conservative gates for canary and tighten after you see stability.
Won’t all this slow us down?: It speeds you up after the second incident you avoid. We’ve seen teams ship weekly instead of quarterly because the harness catches regressions early and rollbacks are automatic. Your on-call sleeps better, too.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about your eval harness See our AI Observability reference stack