Stop Paying for Idle Tokens: Cost‑Optimizing AI Compute Without Breaking Quality
You don’t need a bigger GPU budget—you need instrumentation, routing, and guardrails that keep quality high while killing waste.
“Your GPU budget is not a strategy. Instrumentation, routing, and guardrails are.”Back to all posts
The bill spiked, the quality didn’t: what went wrong
I’ve watched teams ship a decent AI feature on gpt-4 and then wake up to a 4× bill, p95 latency crossing 3s, and customer complaints about “weird answers.” The root cause is usually the same: we over-provisioned compute while under-instrumenting quality. If you can’t see tokens, context size, cache hit rate, routing decisions, and safety outcomes per request, you’re flying blind.
This is the playbook we use at GitPlumbers to cut AI compute spend 25–60% without sacrificing quality—or your weekends.
Instrument everything: tokens, latency, quality, safety
If it touches an LLM, it gets a span. No exceptions. You want per-tenant, per-feature visibility across the entire RAG/agent chain.
- Trace every step: retrieval, re-ranking, prompt build, model call, post-processing, safety checks.
- Emit metrics:
tokens_in,tokens_out,total_cost_usd,latency_ms,cache_hit,safety_block,quality_score(from eval harness). - Tag spans:
tenant_id,feature,model,route,prompt_hash,context_bytes.
// typescript: express + openai + otel + prometheus client
import { OpenAI } from 'openai';
import { MeterProvider } from '@opentelemetry/sdk-metrics';
import { trace, context, SpanStatusCode } from '@opentelemetry/api';
import prom from 'prom-client';
import Redis from 'ioredis';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const redis = new Redis(process.env.REDIS_URL!);
// Prom metrics
const llmTokens = new prom.Counter({ name: 'llm_tokens_total', help: 'LLM tokens by direction', labelNames: ['dir','model','tenant'] });
const llmLatency = new prom.Histogram({ name: 'llm_latency_seconds', help: 'LLM latency', labelNames: ['model','route','tenant'], buckets: [0.1,0.2,0.5,1,2,3,5,8] });
const safetyBlocks = new prom.Counter({ name: 'safety_block_total', help: 'Blocks by policy', labelNames: ['policy','tenant'] });
export async function answer(req, res) {
const tracer = trace.getTracer('ai-service');
const { tenantId, question } = req.body;
const span = tracer.startSpan('qa_request', { attributes: { tenantId, feature: 'qa' } });
try {
const promptKey = `qa:${tenantId}:${Buffer.from(question).toString('base64')}`;
const cached = await redis.get(promptKey);
if (cached) return res.json({ fromCache: true, answer: JSON.parse(cached) });
// lightweight safety pre-check (PII regex or content filter)
if (/ssn:\s*\d{3}-\d{2}-\d{4}/i.test(question)) {
safetyBlocks.inc({ policy: 'pii_precheck', tenant: tenantId });
span.setStatus({ code: SpanStatusCode.ERROR, message: 'PII detected' });
return res.status(400).json({ error: 'PII detected' });
}
const start = process.hrtime.bigint();
// route to cheapest viable model
const model = question.length < 280 ? 'gpt-4o-mini' : 'gpt-4o';
const resp = await openai.chat.completions.create({
model,
messages: [{ role: 'system', content: 'Be concise.' }, { role: 'user', content: question }],
temperature: 0.2,
});
const end = process.hrtime.bigint();
const sec = Number(end - start) / 1e9;
llmLatency.observe({ model, route: 'length_based', tenant: tenantId }, sec);
const usage = resp.usage ?? { prompt_tokens: 0, completion_tokens: 0 };
llmTokens.inc({ dir: 'in', model, tenant: tenantId }, usage.prompt_tokens);
llmTokens.inc({ dir: 'out', model, tenant: tenantId }, usage.completion_tokens);
const answer = resp.choices[0].message?.content ?? '';
await redis.setex(promptKey, 300, JSON.stringify(answer));
span.setAttribute('model', model);
span.setAttribute('tokens_in', usage.prompt_tokens);
span.setAttribute('tokens_out', usage.completion_tokens);
span.end();
res.json({ answer });
} catch (e) {
span.recordException(e);
span.setStatus({ code: SpanStatusCode.ERROR, message: String(e) });
span.end();
res.status(500).json({ error: 'llm_error' });
}
}Scrape the metrics with Prometheus and plot Grafana dashboards by model, route, tenant, and feature.
# prometheus scrape config
scrape_configs:
- job_name: 'ai-service'
static_configs:
- targets: ['ai-service.default.svc.cluster.local:9464']Route to the cheapest model that meets the SLO
I’ve seen teams hardcode gpt-4 everywhere because “quality.” Then 70% of traffic is trivial queries. Use a router that picks the smallest model that satisfies the use case and escalates based on uncertainty.
- Start with tiers:
lite(e.g.,gpt-4o-mini,claude-3-haiku),standard(gpt-4o,claude-3.5-sonnet),heavy(tool-use, long context). - Confidence-based escalation: classify intent and estimate difficulty; only escalate on low confidence.
- Shadow + canary: run new routes in shadow, then canary 5–10% with Argo Rollouts/Istio.
# argo rollouts for routing service (canary 10%)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: llm-router
spec:
replicas: 6
strategy:
canary:
canaryService: llm-router-canary
stableService: llm-router-stable
steps:
- setWeight: 10
- pause: { duration: 10m }
- setWeight: 25
- pause: { duration: 20m }Escalation should be a data-backed exception, not a developer’s fear of a support ticket.
Kill context bloat: prompt and RAG optimization
Most waste isn’t model choice—it’s context abuse. I’ve seen 30KB prompts feeding 200B tokens/month.
- Slim the prompt: remove “style” fluff; move long-lived instructions to the system prompt once.
- Retrieve less: top‑k=3–5, use a re-ranker (Cohere Rerank,
bge-reranker) instead of k=20. - Chunk smarter: 200–500 token chunks with overlap tuned by evals.
- Cache everything: prompt templates (by hash), retrieval results (by query embedding), and final answers (short TTL).
- Deduplicate: semantic cache before the model.
# python: semantic cache + RAG trim
from redis import Redis
from sentence_transformers import SentenceTransformer, util
import json
redis = Redis.from_url("redis://...")
embedder = SentenceTransformer('all-MiniLM-L6-v2')
SEM_THRESHOLD = 0.92
def semantic_get(key, text):
e = embedder.encode([text], normalize_embeddings=True)
cand = redis.json().get(key)
if not cand: return None
if util.cos_sim(e, cand['embedding']).item() > SEM_THRESHOLD:
return cand['value']
return None
def semantic_set(key, text, value, ttl=300):
e = embedder.encode([text], normalize_embeddings=True)
redis.json().set(key, '$', { 'embedding': e.tolist(), 'value': value })
redis.expire(key, ttl)
# trim RAG context
def trim_context(chunks, max_tokens=1500):
ranked = sorted(chunks, key=lambda c: c['score'], reverse=True)
out, total = [], 0
for c in ranked:
if total + c['tokens'] > max_tokens: break
out.append(c['text']); total += c['tokens']
return outIf you self-host, use vLLM or TGI with quantization and tensor parallelism. Right-size the context window—don’t pay for 128k if you rarely exceed 8k.
# vLLM tuned for throughput
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x7B-Instruct \
--tensor-parallel-size 2 \
--max-num-batched-tokens 8192 \
--enforce-eager \
--quantization awqLatency spikes? Batch, stream, and tune concurrency
Queues blow up at peak because we fire 1:1 requests and block on long generations. Fix the pipeline.
- Batching: combine homogeneous requests. Bedrock/OpenAI batch endpoints or your own micro-batcher.
- Server-side streaming: render partials to users; reduce p95 perceived latency and timeouts.
- Concurrency controls: backpressure and per-tenant rate limits; autoscale on queue depth or RPS.
- HTTP/2 + keepalive: reduce connection churn.
# OpenAI batch (example): prepare file of requests and submit
openai batches create \
-m gpt-4o-mini \
--input-file ./requests.jsonl \
--metadata purpose="nightly_report_summaries"# Istio rate limit with Envoy Filter (per-tenant header)
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: tenant-ratelimit
spec:
configPatches:
- applyTo: HTTP_FILTER
match: { context: SIDECAR_INBOUND }
patch:
operation: INSERT_BEFORE
value:
name: envoy.filters.http.ratelimit
typed_config:
'@type': type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: llm
request_type: bothGuardrails for hallucination, drift, and safety
Prompt nagging (“don’t make things up”) is not a control. Put checks in the path.
- Hallucination: require citations for RAG answers; verify citations exist in retrieved docs; fallback to “I don’t know” on low grounding score.
- Drift: nightly evals on a fixed benchmark set; flag degradation and auto-rollback router weights.
- Safety/PII: run content filters (Azure AI Content Safety, LlamaGuard) and PII redaction pre- and post-model.
- Tool safety: enforce allowlists and argument schemas; timeouts and circuit breakers.
// simple grounding check
function grounded(answer: string, sources: string[]): boolean {
return sources.some(s => answer.includes(extractKeyPhrase(s)));
}
if (!grounded(modelAnswer, retrievedSnippets)) {
safetyBlocks.inc({ policy: 'ungrounded', tenant: tenantId });
return { answer: "I don't have enough information to answer confidently.", safe: false };
}For structured guardrails, Guardrails, Rebuff, or a JSON schema validator work well. Keep it deterministic and log every block with reasons.
Make AI SLOs first-class and observable
What you don’t measure, you’ll pay for. Treat AI flows like any other critical path.
- SLOs that matter:
- p95 latency per feature: e.g.,
<= 1.5sfor autocomplete,<= 3sfor support answers. - Error rate:
<= 0.5%(timeouts, 5xx, safety blocks you accept as “errors” or “intentional denials”). - Cost SLO:
tokens_per_request_p50andusd_per_sessionbudgets. - Quality SLO: benchmark accuracy/grounding via eval harness.
- p95 latency per feature: e.g.,
- Dashboards: one pane for latency, error, cost, quality; drill down by tenant and route.
- Release process: shadow new models/routes, run offline evals, then canary. Roll back on SLO breach.
# SLO (Sloth) generating Prometheus recording rules
apiVersion: monitoring.coreos.com/v1alpha1
kind: PrometheusRule
metadata:
name: ai-slo
spec:
groups:
- name: ai
rules:
- record: slo:llm_latency_p95
expr: histogram_quantile(0.95, sum(rate(llm_latency_seconds_bucket[5m])) by (le))
- alert: LlmLatencyP95TooHigh
expr: slo:llm_latency_p95 > 3
for: 10m
labels: { severity: 'page' }Cost governance: budgets, quotas, and kill switches
This is where a lot of teams get religion. Put hard controls around spend.
- Budgets per tenant with soft alerts at 80% and hard blocks at 100%.
- OPA/Rego policies that check expected cost before executing a request (based on prompt+context tokens).
- Per-tenant concurrency + rate limits: keep noisy neighbors from wrecking your SLOs and wallet.
- Feature flags: kill switch for specific flows and models via
Unleash/LaunchDarkly.
# OPA/Rego: deny when projected tokens exceed quota
package ai.quota
default allow = true
allow { projected_tokens <= input.tenant.quota_remaining }
projected_tokens := input.prompt_tokens + input.context_tokens + input.max_completion_tokens
violation[msg] {
not allow
msg := sprintf("quota exceeded: need %v, have %v", [projected_tokens, input.tenant.quota_remaining])
}Wire policy checks into the router. Log every denial with tenant/contact so finance and CS aren’t blindsided.
Results we actually see
- Reduced token spend 30–55% with confidence-based routing and context trimming.
- Cut p95 latency 25–40% via caching, batching, and streaming.
- Fewer incidents and better MTTR with proper SLOs + canary/rollback.
- Safety incidents down >80% after in-path guardrails.
If your codebase is half AI-generated “vibe code” and half legacy glue, we’ve done the vibe code cleanup before. GitPlumbers can stand up the instrumentation, router, and guardrails in weeks, not quarters.
Key takeaways
- Instrument every AI call with token, latency, quality, and safety signals—no metrics, no mercy.
- Route to the cheapest model that meets the SLO; escalate only on uncertainty, not vibes.
- Kill context bloat. Retrieve less, chunk better, cache aggressively, and dedupe prompts.
- Batch, stream, and tune concurrency to avoid latency spikes and over-provisioning.
- Codify guardrails for hallucination, drift, and safety; don’t rely on prompt nagging.
- Make AI SLOs first-class (p95 latency, error rate, budget burn). Canary everything.
- Put hard budgets and per-tenant quotas behind policy, not hallway conversations.
Implementation checklist
- Add OpenTelemetry spans around every LLM/RAG step with token counters and user/tenant tags.
- Ship Prometheus metrics: llm_tokens_total, llm_requests_total, llm_latency_seconds, safety_block_total.
- Stand up a router with confidence-based model selection and cache-first semantics.
- Slim prompts and RAG context; add semantic and result caching via Redis/Vector DB.
- Enable batching/streaming; right-size concurrency with autoscaling on backpressure.
- Deploy content safety, PII redaction, and hallucination/consistency checks in-line.
- Define SLOs and wire canaries with Argo Rollouts; enable shadow traffic before cutovers.
- Enforce cost quotas via OPA/Rego; add per-tenant circuit breakers and kill switches.
Questions we hear from teams
- How do we prove cost savings without risking quality regressions?
- Shadow new routes against production traffic, collect side-by-side eval metrics (accuracy, grounding, CSAT proxies), and only canary once quality meets or exceeds baseline. Track cost per request/session as a first-class metric in the dashboard. Roll back automatically on SLO breach.
- What’s the quickest win if we have nothing today?
- Add token and latency instrumentation plus a Redis result cache. It’s a 2–3 day change that usually drops spend 10–20% and cuts p95 by a few hundred milliseconds—before you touch models or prompts.
- Self-host or vendor APIs?
- If your workload is spiky and you’re early, stick to OpenAI/Anthropic/Vertex/Bedrock. If you have steady, high-throughput workloads and strong infra chops, `vLLM`/`TGI` with quantization on A100/H100 can save 30–50%—but only if you batch and keep utilization high.
- How do we detect model drift?
- Run nightly/weekly evals on a fixed benchmark set and compare to a moving window of prod samples. Alert on statistically significant drops. Version prompts, retrieval configs, and models in Git (GitOps with ArgoCD) so you can bisect and roll back.
- We have a lot of AI-generated code—does that change the plan?
- Yes. “Vibe coding” leaves dead paths and hidden costs. Start with a code rescue: delete unused chains/agents, consolidate routers, and add tests. Then layer in instrumentation and guardrails. We’ve cleaned this up for clients in weeks, not quarters.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
