What’s the minimal viable evaluation harness for a small team?

OpenTelemetry traces with prompt/model/version + Prometheus metrics for latency, error, cost + a Promptfoo CI eval on a golden set + a LaunchDarkly kill switch. Add RAGAS if you’re doing RAG and Evidently for drift once you have steady traffic.

How do I measure hallucination in production?

Sample 1–5% of traffic for human labeling each day, score with RAGAS/TruLens on RAG flows, and track a rolling hallucination rate. Use this as a canary metric in Argo; fail promotion if the labeled sample exceeds the budget.

Should I build or buy guardrails?

Start with in-house schema validation and moderation APIs. If your policy surface is complex (regulated domains, tool-use constraints), adopt NeMo Guardrails or similar for policy authoring and auditability.

What about vendor lock-in on observability?

Use OpenTelemetry everywhere. Export to Grafana Cloud/Datadog now; you can change backends later. The key is consistent span attributes and metric names.

How do I control cost without killing quality?

Track cost per 1k tokens and per request. Cap max tokens, cache with semantic hashing for frequent queries, and route low-risk paths to smaller models. Enforce budget SLOs in your canary gates.

Ai-delivery · Nov 9, 2025 · 10 minute read

The Evaluation Harness That Keeps GenAI Honest—Before, During, and After Release

If your LLM feature ships without an evaluation harness, you’re not launching a product—you’re running an experiment on your users. Here’s the instrumentation and guardrails that actually work in production.

Alex Mercer

Principal Engineer, GitPlumbers

20 years in the trenches across fintech, retail, and B2B SaaS. Led SRE and platform teams through microservices, Kubernetes, and now AI. I’ve broken and fixed LLM systems so you don’t have to.

> You don’t control the model. You control the harness. Invest there.

Back to all posts

Related Resources

Key takeaways

Ship an evaluation harness, not just a model: pre-release tests, during-release gates, post-release observability.
Instrument LLM calls with OpenTelemetry and export metrics to Prometheus/Grafana; trace prompt and model versions.
Define SLOs for hallucination rate, latency, and cost; enforce canary gates via Argo Rollouts and circuit breakers.
Continuously evaluate with golden sets and adversarial tests; automate drift detection on embeddings and content.
Use guardrails (schema validation, content filters, policy checks) and deterministic fallbacks when violations occur.

Implementation checklist

Tag every LLM call with prompt_id, prompt_version, model, temperature, and user/session IDs.
Stand up OpenTelemetry + Prometheus + Grafana for latency, error rate, token/cost, and guardrail violation metrics.
Create golden datasets and adversarial prompts; run pre-merge evals with RAGAS/TruLens/Promptfoo.
Release with shadow traffic then canary; gate promotions with Prometheus AnalysisTemplates in Argo Rollouts.
Implement circuit breakers and kill switches via feature flags; define fallback paths that degrade gracefully.
Run post-release drift checks with Evidently; alert on embedding and response distribution shifts.
Version prompts and retrieval indexes; store traces and outputs for replay and root cause analysis.
Publish an LLM scorecard weekly with SLOs, incidents, costs, and top failing prompts.

Questions we hear from teams

What’s the minimal viable evaluation harness for a small team?: OpenTelemetry traces with prompt/model/version + Prometheus metrics for latency, error, cost + a Promptfoo CI eval on a golden set + a LaunchDarkly kill switch. Add RAGAS if you’re doing RAG and Evidently for drift once you have steady traffic.
How do I measure hallucination in production?: Sample 1–5% of traffic for human labeling each day, score with RAGAS/TruLens on RAG flows, and track a rolling hallucination rate. Use this as a canary metric in Argo; fail promotion if the labeled sample exceeds the budget.
Should I build or buy guardrails?: Start with in-house schema validation and moderation APIs. If your policy surface is complex (regulated domains, tool-use constraints), adopt NeMo Guardrails or similar for policy authoring and auditability.
What about vendor lock-in on observability?: Use OpenTelemetry everywhere. Export to Grafana Cloud/Datadog now; you can change backends later. The key is consistent span attributes and metric names.
How do I control cost without killing quality?: Track cost per 1k tokens and per request. Cap max tokens, cache with semantic hashing for frequent queries, and route low-risk paths to smaller models. Enforce budget SLOs in your canary gates.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about wiring your eval harness Get our LLM Evaluation Checklist (PDF)

Related Resources

Key takeaways

Implementation checklist

Questions we hear from teams

Ready to modernize your codebase?

Related resources