How do I pick the first bottleneck to chase?

Trace a single hot user journey end-to-end and find the largest contiguous slice of time on the critical path. Tackle CPU vs. I/O vs. network separately. If you can’t classify it in 30 minutes with profiles and traces, your observability is the real blocker.

What’s a realistic improvement target?

For a single playbook cycle, 20–50% p95 improvement is common without major rewrites. Bigger wins come from removing whole round-trips (N+1s), batching, and establishing mesh guardrails.

We’re small—do we really need Istio/ArgoCD?

You need the behaviors—timeouts, retries, circuit breakers, GitOps—not necessarily the heaviest tools. NGINX or an API gateway can do timeouts; simple workflows can GitOps via Terraform + CI. Start lean, keep the discipline.

What about AI-written code performance pitfalls?

We routinely see AI generate chatty ORMs, quadratic JSON munging, and unbounded concurrency. Add profiling to your PR template and watch for allocation-heavy code. A quick AI code refactoring pass can claw back your error budget.

How do I prove the fix to the business?

Before/after dashboards with the same load profile, annotated deploys, and SLO/error budget math. Tie it to revenue proxies: conversion rate, abandonment, or orders/min. If p95 drops 40% and orders/min rise, the conversation is easy.

Guides · Nov 28, 2025 · 10 minute read

The Performance Playbooks We Run When Prod Is Melting: CPU, I/O, Locks, and the Service Mesh

No silver bullets. Just repeatable playbooks with checkpoints, metrics, and configs you can ship today.

Avery Cole

Partner, GitPlumbers

20 years in the trenches from monoliths on bare metal to Istio on EKS. Led SRE and platform teams through three replatforms, too many incident retros, and more AI-generated footguns than I care to admit.

If you don’t baseline and classify the bottleneck first, every fix is just performance theater.

Back to all posts

Related Resources

Key takeaways

Define SLOs and repeatable baselines before tuning; otherwise you’re just moving noise around.
Use targeted playbooks for CPU, I/O/DB, contention, and network—don’t mix fixes without a hypothesis.
Instrument first: tracing + profiling + load gen beats hunches every time.
Bake guardrails (timeouts, retries, circuit breakers) into the mesh and app, then GitOps them.
Automate verification: load-test gates, burn-rate alerts, and canaries keep you honest post-merge.

Implementation checklist

Write down the SLO and error budget for the user journey you’re tuning.
Capture a 10–15 minute load profile with tracing and CPU profiles.
Classify the bottleneck: CPU, I/O/DB, lock/contention, or network.
Apply the corresponding playbook with checkpoints and rollback plan.
Verify improvement with the same load and dashboards, then GitOps the config.
Add guardrails: burn-rate alert, canary rollout, and a runbook entry.

Questions we hear from teams

How do I pick the first bottleneck to chase?: Trace a single hot user journey end-to-end and find the largest contiguous slice of time on the critical path. Tackle CPU vs. I/O vs. network separately. If you can’t classify it in 30 minutes with profiles and traces, your observability is the real blocker.
What’s a realistic improvement target?: For a single playbook cycle, 20–50% p95 improvement is common without major rewrites. Bigger wins come from removing whole round-trips (N+1s), batching, and establishing mesh guardrails.
We’re small—do we really need Istio/ArgoCD?: You need the behaviors—timeouts, retries, circuit breakers, GitOps—not necessarily the heaviest tools. NGINX or an API gateway can do timeouts; simple workflows can GitOps via Terraform + CI. Start lean, keep the discipline.
What about AI-written code performance pitfalls?: We routinely see AI generate chatty ORMs, quadratic JSON munging, and unbounded concurrency. Add profiling to your PR template and watch for allocation-heavy code. A quick AI code refactoring pass can claw back your error budget.
How do I prove the fix to the business?: Before/after dashboards with the same load profile, annotated deploys, and SLO/error budget math. Tie it to revenue proxies: conversion rate, abandonment, or orders/min. If p95 drops 40% and orders/min rise, the conversation is easy.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Turn your war room notes into real playbooks Get Vibe Coding Help for AI-generated performance issues

Related Resources

Key takeaways

Implementation checklist

Questions we hear from teams

Ready to modernize your codebase?

Related resources