What’s the fastest way to get a per-request cost number?

Export `tokens_in` and `tokens_out` to Prometheus and multiply by current provider prices at ingestion time. Add the total as `req_cost_usd{model,route}` in the same metric family. Display it per request in Grafana and in your trace viewer (Tempo/Jaeger) with span attributes.

How do we know a smaller/quantized model is ‘good enough’?

Build an offline eval harness with task-specific metrics (exact match/F1/BLEU for structured tasks; Ragas precision/faithfulness for RAG). Set acceptance gates (e.g., ≤1% accuracy delta, ≤0.5% hallucination rise). Canary 5–10% live traffic with `Argo Rollouts` and compare p95/p99 and user-rated quality before full cutover.

What causes latency spikes and how do we stop them?

Common culprits: bursty traffic with no queue, unbounded concurrency, cold starts, and long prompts. Fix with queue-aware autoscaling (KEDA on queue depth), Istio timeouts/outlier detection, capped concurrency per pod, and smaller prompts via RAG/prompt compression.

Can we use spot instances for real-time inference?

Yes, but only if you overprovision capacity, use rapid rebalancing, and accept higher cost variance. It’s safer for batch/embeddings. For low-latency APIs, keep a baseline on on-demand and let spot cover overflow with quick failover.

Ai-delivery · Oct 3, 2025 · 9 minute read

The GPU Bill That Ate Your Roadmap: Instrument, Gate, and Route LLMs Without Losing Quality

You don’t need a bigger cluster—you need better visibility, smarter routing, and guardrails that prevent expensive mistakes.

Back to all posts

The GPU Bill That Ate Your Roadmap: Instrument, Gate, and Route LLMs Without Losing Quality

Key takeaways

Implementation checklist