How do we know if routing to smaller models is safe?

Create a golden dataset that mirrors your top tasks and risks, then A/B routes against it. Set thresholds (e.g., task success within 2% of baseline, safety blocks <1%). Gate rollouts in CI/CD; start with 5–10% traffic canaries and watch guardrail metrics.

Is vLLM actually cheaper than managed APIs?

At steady, predictable volume with batch-friendly workloads, vLLM’s throughput and KV cache can be 2–5x better per GPU than naive hosting, often beating API costs. If your load is spiky and compliance allows, managed APIs with strict budgets might win overall due to elasticity and lower ops overhead.

Where do teams overspend first?

No caching, no batching, and sending everything to a premium model. Also: sloppy RAG (duplicate chunks), missing timeouts, and retries without jitter. These are table stakes fixes before you touch quantization or distillation.

How do we detect model drift in production?

Track distribution shifts (embedding stats, answer length, topic mix), quality on rolling golden sets, and rising safety block rates. Alert on deltas beyond thresholds and trigger re-evals. Tools like WhyLabs/Arize help; you can also roll your own with Prometheus + offline jobs.

What if our legal team blocks spot instances?

Keep on-demand for baseline capacity and restrict spot to overflow. Use priority classes and preemption-safe draining so you don’t violate SLOs. If spot is a hard no, squeeze cost out of routing, caching, and quantization—those typically deliver bigger wins anyway.

Ai-delivery · Nov 29, 2025 · 10 minute read

Stop Burning GPUs: Cost Controls for AI Inference That Don’t Tank Quality

Q: Is vLLM actually cheaper than managed APIs?

At steady, predictable volume with batch-friendly workloads, vLLM’s throughput and KV cache can be 2–5x better per GPU than naive hosting, often beating API costs. If your load is spiky and compliance allows, managed APIs with strict budgets might win overall due to elasticity and lower ops overhead.

Q: Where do teams overspend first?

No caching, no batching, and sending everything to a premium model. Also: sloppy RAG (duplicate chunks), missing timeouts, and retries without jitter. These are table stakes fixes before you touch quantization or distillation.

Q: How do we detect model drift in production?

Track distribution shifts (embedding stats, answer length, topic mix), quality on rolling golden sets, and rising safety block rates. Alert on deltas beyond thresholds and trigger re-evals. Tools like WhyLabs/Arize help; you can also roll your own with Prometheus + offline jobs.

Q: What if our legal team blocks spot instances?

Keep on-demand for baseline capacity and restrict spot to overflow. Use priority classes and preemption-safe draining so you don’t violate SLOs. If spot is a hard no, squeeze cost out of routing, caching, and quantization—those typically deliver bigger wins anyway.

I’ve watched teams 10x their AI bills overnight without moving a single KPI. Here’s how to design cost optimization into your AI stack—without shipping garbage.

Alex Kim

Partner, GitPlumbers

20 years shipping and fixing distributed systems. Ex-Stripe SRE, led platform teams through the microservices boom, and now rescues AI-enabled apps bleeding money on GPUs. I prefer Grafana boards to vibes.

Your GPU bill is a product decision, not an infra accident.

Back to all posts

The $220k invoice and the missing gauges

I’ve seen this movie three times: a team ships a helpful AI assistant, traffic climbs, and the CFO calls about the surprise cloud bill. One client burned six figures in a month because a chatty microservice retried on 429 without jitter and never cached a single response. Worse: no one could say where the tokens were going. They had dashboards for CPU and heap—nothing for tokens, cache hit rate, or model selection.

If you can’t answer “what’s our cost per successful task at P95 < 800ms and < 1% safety blocks?” you’re not running AI in production—you’re vibe coding with GPUs.

This is the playbook we implement at GitPlumbers to slash spend without sacrificing quality.

Design for unit economics, not vibes

You can’t optimize what you don’t measure. Start with business-level unit economics, then bind them to technical SLOs.

KPIs that matter:
- Cost per successful task (not per request)
- P50/P95 latency and timeout rate
- Cache hit rate and token savings
- Safety block rate and false-positive rate
- Quality score from evals (exact-match, BLEU/ROUGE for gen, task success for agents)
SLOs (example):
- P95 latency ≤ 800ms for retrieval Q&A; error rate < 0.5%
- Quality ≥ baseline within 2% on golden set; safety blocks < 1%
Budgets:
- Tokens/request target by route (e.g., ≤ 1.5k tokens for search answers, ≤ 6k for synthesis)
- GPU utilization target 55–70% at steady state to keep room for spikes

Write these down, put them in Grafana, and treat them like real constraints, not “goals.”

Instrument everything: tokens, latency, and cache hits

Your traces must cross the boundary into the model. Use OpenTelemetry spans for every AI call. Emit custom Prometheus metrics for tokens, cache efficacy, model route, and guardrail outcomes.

# instrumentation.py
from time import time
from opentelemetry import trace
from prometheus_client import Counter, Histogram

tracer = trace.get_tracer(__name__)
TOKENS_IN = Counter('ai_tokens_in_total', 'Input tokens', ['model','route'])
TOKENS_OUT = Counter('ai_tokens_out_total', 'Output tokens', ['model','route'])
CACHE_HIT = Counter('ai_cache_hits_total', 'Cache hits', ['route'])
REQS = Counter('ai_requests_total', 'AI requests', ['model','route','status'])
LAT = Histogram('ai_latency_seconds', 'Latency', ['model','route'])

def call_model(client, prompt, route):
    start = time()
    with tracer.start_as_current_span('ai.call', attributes={'route': route}):
        try:
            resp = client.complete(prompt)
            TOKENS_IN.labels(client.model, route).inc(resp.usage.prompt_tokens)
            TOKENS_OUT.labels(client.model, route).inc(resp.usage.completion_tokens)
            LAT.labels(client.model, route).observe(time()-start)
            REQS.labels(client.model, route, 'ok').inc()
            return resp.text
        except Exception:
            REQS.labels(client.model, route, 'err').inc()
            raise

Scrape and graph it. Yes, boring plumbing. It’s also how you stop fires.

# prometheus-scrape.yaml
scrape_configs:
  - job_name: 'ai-service'
    kubernetes_sd_configs: [ { role: pod } ]
    relabel_configs:
      - action: keep
        source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        regex: 'true'
    metric_relabel_configs:
      - source_labels: [route]
        action: replace
        target_label: ai_route

Add trace links to logs so a single user complaint can be traced to a model, prompt, and route.

Kill waste before you buy more GPUs

Nine times out of ten, spend drops when you remove obvious waste.

Cache responses. If the same query pops up, don’t pay twice.
Batch where latency allows. Tokenizers and GPUs love batches.
Stream tokens to the client. Perceived latency wins you headroom for batching.
Retrieval hygiene. Bad RAG recalls cost tokens and hurt quality—dedupe, chunk right, and filter.

# cache.py
import hashlib, json, redis
r = redis.Redis(host='redis', port=6379)

def cache_key(route, prompt):
    h = hashlib.sha256(prompt.encode()).hexdigest()[:16]
    return f"{route}:{h}"

def cached_or_run(route, prompt, ttl=600):
    key = cache_key(route, prompt)
    if (val := r.get(key)):
        CACHE_HIT.labels(route).inc()
        return json.loads(val)
    resp = call_model(client, prompt, route)
    r.setex(key, ttl, json.dumps(resp))
    return resp

For self-hosting, pick an inference stack that doesn’t fight you:

vLLM for high-throughput, paged KV cache and continuous batching.
NVIDIA Triton when you need mixed backends (PyTorch, TensorRT) and strict SLAs.
Ray Serve/KServe for orchestration and autoscaling.

Enable their caches and tune batch sizes—don’t leave defaults.

# vLLM example
command:
  - "python"
  - "-m"
  - "vllm.entrypoints.api_server"
  - "--model=/models/mistral-7b-instruct"
  - "--gpu-memory-utilization=0.7"
  - "--max-num-seqs=64"          # batch size
  - "--max-num-batched-tokens=4096"
  - "--disable-log-requests"

Right-size the model per request

Stop sending every request to the most expensive model. Route based on difficulty and risk.

Tiered models: small/quantized for easy classification; bigger for reasoning.
Distillation: train a smaller model on your prompts/answers to handle the 80% path.
Quantization: bitsandbytes 4/8-bit, AWQ, or TensorRT-LLM can cut inference cost 30–60% with tolerable quality loss.

A simple router goes a long way.

// router.ts
import { z } from 'zod';

const Route = z.enum(['cheap','standard','premium']);

export function chooseRoute(input: { task: string; risk: 'low'|'med'|'high'; contextSize: number; }) {
  if (input.risk === 'high' || input.contextSize > 6000) return Route.enum.premium; // e.g., gpt-4o or Llama 70B
  if (input.task.includes('classify') || input.contextSize < 800) return Route.enum.cheap; // e.g., q4_0 quantized 7B
  return Route.enum.standard; // e.g., 13B distilled
}

Measure quality by route. If cheap underperforms on the golden set by >2%, flip the feature flag and investigate. This is where LaunchDarkly or Unleash earns its keep.

Guardrails that catch hallucinations, drift, and spikes

I’ve watched teams save 40% on spend and then lose it all cleaning up bad outputs. Guardrails are cost controls because they prevent downstream rework.

Hallucination control:
- Use retrieval grounding; show sources; reject when source_confidence < threshold.
- Add a self-check prompt step for high-risk flows; it’s cheaper than a support ticket.
Drift detection:
- Track answer distribution, embedding drift, and safety block rate trend. Sudden changes mean retraining or route tweak.
Latency spikes:
- Add circuit breakers and bulkheads so one model’s hiccup doesn’t take the app down. Retry with backoff, then fallback to a smaller model or cached answer.

# guardrails.py
from tenacity import retry, stop_after_attempt, wait_random_exponential

@retry(stop=stop_after_attempt(3), wait=wait_random_exponential(multiplier=0.1, max=1.5))
def grounded_answer(query, docs):
    prompt = f"Use only these docs to answer. If insufficient, say 'I don\'t know'.\n{docs}"
    resp = call_model(client, prompt, route='standard')
    if not has_citations(resp):
        return { 'status': 'insufficient', 'text': "I don't know." }
    return { 'status': 'ok', 'text': resp }

# circuit breaker-like fallback
try:
    result = grounded_answer(q, retrieved_docs)
except Exception:
    result = cached_or_run('cheap', fallback_prompt(q))

And yes, you need evals. Build a harness and use canaries.

# run-evals.sh
pytest -q tests/evals/ --model-route standard --golden-set data/gold.csv \
  --metrics exact,bleu,task_success --thresholds exact=0.85 task_success=0.9

Gate deployments in CI/CD (ArgoCD, GitOps) on evals passing and SLO budgets staying green.

Infra choices that won’t bite you at 3 a.m.

Managed vs self-hosted: If your traffic is spiky and compliance allows, OpenAI/Azure OpenAI/Anthropic with strict budgets and retries is fine. If you need lower p95 or data control, self-host vLLM/Triton on K8s.
Autoscaling: HPA with GPU metrics; keep warm pods around to avoid cold start spikes.
Spot/preemptible: Use them for overflow with capacity buffers and quick drains.

# gpu-hpa.yaml (KServe or generic Deployment with NVIDIA DCGM metrics)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-hpa
spec:
  minReplicas: 2
  maxReplicas: 12
  metrics:
  - type: Pods
    pods:
      metric:
        name: nvidia_gpu_duty_cycle
      target:
        type: AverageValue
        averageValue: 70%

Add a watchdog that pages only when budgets break.

# cost-watchdog.sh (pseudo; wire to CUR/BigQuery or provider API)
TARGET_CPD=3500  # cost per day
ACTUAL=$(python scripts/tokens_cost.py --since 24h)
if (( $(echo "$ACTUAL > $TARGET_CPD * 1.15" | bc -l) )); then
  curl -X POST $PAGERDUTY_EVENT --data "{\"summary\":\"AI spend >115% budget\"}"
fi

Finally, keep your vector DB bills sane. Chunk to ~512–1k tokens, dedupe aggressively, and store embeddings in a tiered index (hot vs cold). Half the RAG spend I see is duplicate documents and over-chunky pages.

What “good” looks like in 30 days

We’ve run this play at growth-stage SaaS and stodgy enterprises alike. Typical 30-day outcomes:

25–60% lower inference cost via caching, batching, and routing
P95 latency down 20–40% with streaming and warm pools
0.5–1.5% improvement on golden-set task success after cleaning retrieval
On-call pages drop because circuit breakers and fallbacks do their job

The trick isn’t a single silver bullet. It’s a boring, well-instrumented flywheel.

If it’s not in a dashboard with an SLO, it’s a rumor.

If I had to start from zero tomorrow

Define SLOs and budgets (latency, quality, tokens/request). Put them in Grafana.
Add OTel traces + Prom custom metrics for tokens, cache, route, guardrail status.
Enable Redis response cache; measure hit rate and token savings.
Introduce a tiered router (cheap/standard/premium) with feature flags.
Add guardrails: retrieval grounding, safety filters, circuit breakers, and fallbacks.
Stand up vLLM or Triton with batching and KV cache tuned; HPA on GPU duty cycle.
Build a minimal eval harness and gate deploys with a canary / A/B.

When the graphs stabilize and quality holds, start experimenting with quantization and distillation to take the next bite out of spend.

If you want a second set of eyes, GitPlumbers lives in this trench—instrumentation, guardrails, and code rescue after the AI-fueled fire drill.

Related Resources

Key takeaways

Know your unit economics: tokens, cache hit rate, per-request cost, and P95 latency tied to SLOs.
Instrument everything: tokens in/out, model/route chosen, cache hits, guardrail outcomes, and error reasons.
Kill waste first: caching, batching, streaming, and retrieval hygiene before buying more GPUs.
Right-size the model per request with routing, quantization, and distillation—then measure quality impact.
Guardrails must be first-class: circuit breakers, evals, canaries, and policy checks reduce risk and rework.
Pick infra intentionally: inference servers, autoscaling, and spot capacity with resiliency and observability.
Make drift and hallucinations visible: eval harnesses, golden sets, and rollback paths wired to alerts.

Implementation checklist

Define SLOs for latency and quality; tie budgets to token and GPU utilization metrics.
Add OpenTelemetry traces around every AI call; emit custom metrics for tokens, cache, and routing.
Implement response caching with Redis; set TTLs per use case and log cache effectiveness.
Batch requests where latency budgets allow; stream partial tokens to keep UX responsive.
Introduce a model router with fallbacks; include quantized and distilled variants.
Add guardrails: content filters, injection checks, circuit breakers, and automatic fallbacks.
Adopt canaries and A/B for model changes; evaluate with golden datasets before 100% rollout.
Right-size infra: use vLLM/Triton, GPU-aware autoscaling, and spot instances with buffer capacity.

Questions we hear from teams

How do we know if routing to smaller models is safe?: Create a golden dataset that mirrors your top tasks and risks, then A/B routes against it. Set thresholds (e.g., task success within 2% of baseline, safety blocks <1%). Gate rollouts in CI/CD; start with 5–10% traffic canaries and watch guardrail metrics.
Is vLLM actually cheaper than managed APIs?: At steady, predictable volume with batch-friendly workloads, vLLM’s throughput and KV cache can be 2–5x better per GPU than naive hosting, often beating API costs. If your load is spiky and compliance allows, managed APIs with strict budgets might win overall due to elasticity and lower ops overhead.
Where do teams overspend first?: No caching, no batching, and sending everything to a premium model. Also: sloppy RAG (duplicate chunks), missing timeouts, and retries without jitter. These are table stakes fixes before you touch quantization or distillation.
How do we detect model drift in production?: Track distribution shifts (embedding stats, answer length, topic mix), quality on rolling golden sets, and rising safety block rates. Alert on deltas beyond thresholds and trigger re-evals. Tools like WhyLabs/Arize help; you can also roll your own with Prometheus + offline jobs.
What if our legal team blocks spot instances?: Keep on-demand for baseline capacity and restrict spot to overflow. Use priority classes and preemption-safe draining so you don’t violate SLOs. If spot is a hard no, squeeze cost out of routing, caching, and quantization—those typically deliver bigger wins anyway.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a cost-and-quality guardrail review Read how we rescued an AI feature post-launch