The Night Your LLM Went Off-Script: Shipping Bias Detection and Fairness Monitoring That Actually Works

If you don’t instrument bias and safety like a first-class SLO, your AI will fail silently—until it fails publicly.

If you don’t instrument bias and safety like a first-class SLO, your AI will fail silently—until it fails publicly.
Back to all posts

The mess you’ve seen (and I’ve been paged for)

We shipped an LLM assistant for a fintech. First week, usage doubled. Great. Then the 2 a.m. page: responses started recommending ineligible loan products to a protected group due to a retrieval bug and a prompt tweak that boosted “persuasiveness.” Latency p99 went north of 4s, cache hit rate cratered, and hallucination rate tripled. No single bug—just missing guardrails and zero fairness visibility. I’ve seen this pattern at startups and at FAANG scale: if bias and safety aren’t first-class SLOs with real instrumentation, they fail silently until they fail publicly.

This is the blueprint we use at GitPlumbers to keep AI systems honest, fast, and defendable—without turning your on-call into an ethics board at midnight.

What actually breaks in production

You know the list, but tie it to metrics:

  • Hallucination: Answer not grounded in your corpus. Watch a measured hallucination_rate from evals or citation checks. Mitigate with RAG, citation requirements, and fallback to “I don’t know.”
  • Bias & unfairness: Disparate outcomes by protected groups or proxies. Track disparate_impact and equalized_odds deltas from a curated audit dataset, run daily.
  • Data/Concept drift: Input distributions shift (PSI/KS tests), or the mapping from input->label changes. Detect and gate rollouts.
  • Latency spikes: p95/p99 blown by model queueing, cold starts, or vendor wobble. Budget them and degrade gracefully.
  • Prompt injection & PII leaks: Inputs weaponized to exfiltrate system prompts or internal docs; outputs spilling sensitive data. Guard and redact.

If you don’t set SLOs for these, you’re telling the org they don’t matter.

Instrument everything with real traces and metrics

You can’t fix what you can’t see. Add OpenTelemetry spans around every hop: API ingress -> retrieval -> re-ranker -> model call -> post-process -> cache/store. Tag spans with the knobs that change behavior.

# app.py (FastAPI) — OTel + Prometheus + structured attributes
from fastapi import FastAPI, Request
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from prometheus_client import Counter, Histogram
import time, hashlib

app = FastAPI()
FastAPIInstrumentor().instrument_app(app)
tracer = trace.get_tracer(__name__)

REQS = Counter('ai_requests_total', 'AI requests', ['route','model','vendor'])
LAT = Histogram('ai_latency_seconds', 'AI latency', ['route','model','vendor'])
HALLUC = Counter('ai_hallucinations_total', 'Hallucinations', ['route','model'])

@app.post('/answer')
async def answer(req: Request):
    body = await req.json()
    prompt = body.get('prompt','')
    model = body.get('model','gpt-4o')
    vendor = body.get('vendor','openai')
    prompt_hash = hashlib.sha256(prompt.encode()).hexdigest()[:12]

    start = time.time()
    with tracer.start_as_current_span('llm.request') as span:
        span.set_attribute('ai.model', model)
        span.set_attribute('ai.vendor', vendor)
        span.set_attribute('ai.prompt_hash', prompt_hash)
        span.set_attribute('ai.temperature', body.get('temperature',0.2))
        span.set_attribute('ai.user_tier', body.get('tier','free'))
        # call your retrieval + model here
        answer_text, grounded, citations = await call_llm_and_check(prompt, model)
        span.set_attribute('ai.grounded', grounded)
        span.set_attribute('ai.citations_count', len(citations))

    dur = time.time() - start
    REQS.labels('/answer', model, vendor).inc()
    LAT.labels('/answer', model, vendor).observe(dur)
    if not grounded:
        HALLUC.labels('/answer', model).inc()
    return {"answer": answer_text, "citations": citations}

Log only hashes, IDs, and counters in hot paths; keep raw text in a privacy-safe store or don’t keep it at all. If you need fairness audits, log a joinable user_id_hash and do the sensitive attribute join offline in a controlled environment.

Ship OTel to your collector and Prometheus:

# otel-collector.yaml (snippet)
receivers:
  otlp:
    protocols: { http: {}, grpc: {} }
exporters:
  otlphttp/prom: { endpoint: http://prom-pushgateway:9091 }
  logging: {}
service:
  pipelines:
    traces: { receivers: [otlp], exporters: [logging] }
    metrics: { receivers: [otlp], exporters: [otlphttp/prom] }

Make fairness a build- and run-time gate, not a dashboard

Don’t argue philosophy during an incident. Decide thresholds up front and automate checks.

  • Data: Maintain an audit dataset with labels and protected attributes (or vetted proxies). Keep it in a restricted project; never in general logs.
  • Metrics: At minimum, track disparate_impact (selection rate ratios) and equalized_odds (TPR/FPR differences). Many teams use 0.8x–1.25x ratio bands.
  • Automation: Run daily in CI/CD or Airflow; fail the deploy or trigger rollback when thresholds are crossed.

Example with fairlearn for selection bias:

# fairness_audit.py — batch job
import pandas as pd
from fairlearn.metrics import selection_rate, MetricFrame

pred = pd.read_parquet('predictions_today.parquet')  # columns: user_hash, y_pred, y_true, group_attr
sr = MetricFrame(metrics=selection_rate,
                 y_true=pred['y_true'],
                 y_pred=pred['y_pred'],
                 sensitive_features=pred['group_attr'])

disp_impact = sr.by_group / sr.overall
print('Disparate impact by group:\n', disp_impact)

# simple gate
if (disp_impact.min() < 0.8) or (disp_impact.max() > 1.25):
    raise SystemExit('FAIL: Fairness threshold breached')

If you can’t store group_attr, compute it in a secure notebook and only store aggregated metrics + timestamps. Tools worth knowing: Fairlearn, IBM AIF360, Evidently fairness reports, WhyLabs/Arize/Fiddler managed monitors.

Stop hallucinations at the edge: guardrails and groundedness

I’ve seen teams rely on “it seems fine in staging.” That’s how you end up with the model inventing refund policies on Black Friday. Bake guardrails into the request path:

  1. Pre-filters: PII redaction (presidio), content safety (Azure Content Safety, OpenAI moderation, Perspective API). Reject or route to human.
  2. Retrieval grounding: RAG with strict citation requirements. If no citations found above a confidence threshold, respond with “I don’t know.”
  3. Post-checks: LLM-as-judge or regex/policy checks. Store a grounded=false flag to increment your hallucination_rate.

Simple post-check with citations requirement:

def post_check(answer_text: str, citations: list[str]) -> tuple[str,bool]:
    if len(citations) == 0:
        return ("I don’t know. I couldn’t find a source in our docs.", False)
    return (answer_text, True)

Feature-flag any prompt or retrieval change using LaunchDarkly/Unleash and roll forward only when evals hold. Canary new prompts/models to 1–5% of traffic. Tie the flag to automated rollback when hallucination_rate or fairness metrics regress.

Detect drift before customers do

Your model’s “truth” decays. Watch both input distributions and embedding space.

  • Input drift: Use PSI/KS on key features; run hourly/daily. Evidently makes this easy.
  • Embedding drift: Track cosine distance between current batch centroid and a reference window; alert on jumps.
  • Concept drift: If you have labels, monitor accuracy/TPR over time and look for breaks.

Example PSI with evidently:

from evidently.report import Report
from evidently.metrics import DataDriftPreset

ref = load_parquet('ref_window.parquet')
cur = load_parquet('today.parquet')
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref, current_data=cur)
summary = report.as_dict()['metrics'][0]['result']
if summary['dataset_drift']:  # PSI > preset threshold
    raise SystemExit('Drift detected: block rollout and page on-call')

For LLM retrieval, also track retrieval_recall@k on a labeled QA set. If recall dips, your index or embeddings changed under you.

Latency spikes: budget them and degrade gracefully

If your p99 depends on a vendor model’s bad day, you need exits.

  • SLOs: p95 < 800ms, p99 < 1.5s for tier-1 routes; error budget 1% per week. Publish it.
  • Traffic shaping: Token budgets, request queueing, and concurrency caps per tenant.
  • Adaptive routing: If queue depth grows, route to a cheaper/faster model or cached answer; stream partials.
  • Circuit breakers: Use resilience4j/Envoy/Istio for per‑upstream breakers and timeouts.

Prometheus alert rule example:

# alerts.yaml
groups:
- name: ai-slo
  rules:
  - alert: AIP99LatencyBreached
    expr: histogram_quantile(0.99, sum(rate(ai_latency_seconds_bucket[5m])) by (le,route)) > 1.5
    for: 10m
    labels: { severity: page }
    annotations:
      summary: "p99 latency > 1.5s on {{ $labels.route }}"
  - alert: HallucinationRateSpike
    expr: increase(ai_hallucinations_total[10m]) / increase(ai_requests_total[10m]) > 0.03
    for: 5m
    labels: { severity: page }
    annotations:
      summary: "Hallucination rate > 3% in last 10m"

Ship it safely: GitOps, canaries, and dashboards that matter

  • GitOps: Version prompts, retrieval configs, and guardrail policies in Git. Roll with ArgoCD/Flux; diffs are your audit trail.
  • Canary/Shadow: Shadow test challenger models; graduate with a feature flag if fairness/latency/hallucination SLOs pass.
  • Dashboards: One Grafana board with: p95/p99 latency, error rate, cache hit rate, hallucination rate, fairness disparity ratio, drift indicators, and cost per 1k requests. No 20-tab scavenger hunts.
  • Runbooks: For “Hallucination Spike,” “Fairness Threshold Breach,” “Drift Exceedance,” “Vendor Latency Outage.” Include rollback commands and owners.

If it isn’t in Git and it isn’t alerting, it doesn’t exist when you’re on-call.

What “good” looks like in 30 days

We did this with a marketplace client migrating to RAG on Azure OpenAI + Pinecone:

  • Latency p99 from 2.2s → 1.1s via caching, parallel retrieval, and adaptive routing to gpt-4o-mini on spikes.
  • Hallucination rate on eval set 6.4% → 1.1% with mandatory citations and fallback.
  • Disparate impact tightened from 0.72–1.48 → 0.88–1.12 using prompt neutralization and training data fixes; deploys now auto-block when outside band.
  • Drift detection caught a vendor embedding version flip within 2 hours; rollout paused automatically.

Could you DIY this? Sure. Will it take three quarters and five outages? Also sure. We’ve already stepped on the rakes.

Related Resources

Key takeaways

  • Treat bias and safety like SLOs with budgets, alerts, and runbooks—not feel‑good dashboards.
  • Instrument every AI hop with OpenTelemetry and ship metrics to Prometheus/Grafana; tag spans with model/version/prompt hash/temperature.
  • Automate fairness audits offline with Fairlearn or AIF360; track disparity ratios and fail builds or rollbacks when thresholds are exceeded.
  • Mitigate hallucination and prompt injection using retrieval grounding, policy guardrails, LLM-as-judge, and safe fallbacks.
  • Detect drift with PSI/KL tests and embedding drift metrics; gate rollouts with champion/challenger and feature flags.
  • Control latency spikes with circuit breakers, queueing, adaptive model routing, and p95/p99 SLOs tied to customer impact.

Implementation checklist

  • Define SLOs for fairness, hallucination rate, and latency; publish them next to uptime SLOs.
  • Instrument with OpenTelemetry across API -> retrieval -> model -> post-processing; export to Prometheus/Grafana.
  • Log joinable IDs to compute fairness offline; avoid storing protected attributes in raw logs.
  • Automate daily fairness reports with Fairlearn; alert on disparate impact ratio < 0.8 or your org’s threshold.
  • Add real-time guardrails: content filters, citation checks, and “I don’t know” fallbacks.
  • Detect drift with Evidently or Alibi Detect; block deploys when PSI exceeds threshold.
  • Use feature flags to canary model/prompt changes; roll back automatically on KPI regressions.
  • Write runbooks for incident classes: hallucination spike, latency p99 breach, drift exceedance.

Questions we hear from teams

How do we monitor fairness without storing sensitive attributes in logs?
Log only a stable join key (e.g., user_id_hash). In a restricted analytics project, join that key to sensitive attributes under access controls, compute fairness aggregates (disparate impact, equalized odds), and export only metrics/time series. Never persist raw attributes in request logs.
What thresholds should we use for fairness?
Start with disparate impact between 0.8 and 1.25 and equalized odds gaps < 0.1 absolute, then tune to your domain and risk tolerance. Crucially, encode thresholds in code and CI/CD, not in a slide deck.
Do we need human-in-the-loop for safety?
For high-risk actions (financial advice, medical, compliance), yes—route low-confidence or unsafe outputs to review. For medium-risk flows, automated guardrails + strong fallbacks (including answering “I don’t know”) are usually sufficient.
Which tools do you recommend for a v1?
OpenTelemetry for tracing, Prometheus/Grafana for metrics, Fairlearn for fairness audits, Evidently for drift, LaunchDarkly/Unleash for flags, and a managed LLM provider with content safety APIs. Add WhyLabs/Arize/Fiddler when you need managed monitoring or cross-model comparisons.
How do we test hallucination?
Maintain a labeled QA set and compute groundedness via citation checks and spot review. Use LLM-as-judge carefully with calibration, and backstop with rule-based checks (required citations, policy keywords). Track hallucination_rate over time and gate deploys on it.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get an AI Safety & Fairness Assessment See how we harden RAG systems

Related resources