The “One Tiny PR” That Made Checkout 800ms Slower (and Nobody Noticed for 6 Months)
Gradual performance regressions don’t show up as outages. They show up as churn, lower conversion, and a support queue full of “site feels slow” tickets. Here’s how to detect them early—automatically—using user-facing metrics that map to revenue.
Gradual performance regressions aren’t outages. They’re a tax on every customer interaction—and they compound until someone finally notices the revenue graph.Back to all posts
The slow leak nobody puts in the postmortem
I’ve watched teams with spotless uptime still bleed revenue because the app got a little slower every week. No SEV-1s. No pager storms. Just:
- Checkout completion drops 1–2% over a quarter
- Organic search slides because CWV fails on mobile
- Support tickets: “It’s laggy now” (aka: you can’t reproduce it on your MacBook on office Wi‑Fi)
The killer is that gradual regressions look “normal” in day-to-day development. A new npm dependency adds 40KB. A “temporary” debug log becomes permanent. An ORM upgrade changes a query plan. An AI-generated helper function adds an extra round trip to Redis “for safety.”
If you want to prevent this class of failure, you need performance regression detection that behaves like a seatbelt: it’s there every PR, every deploy, and it stops small mistakes from becoming expensive trends.
Use metrics your CFO would care about (even if they don’t know the acronyms)
If performance work isn’t tied to user impact, it dies in prioritization. The metrics that win budget are the ones you can map to conversion, retention, and support cost.
A pragmatic starter set:
- Core Web Vitals (p75):
LCP,INP,CLSby page template and device class - Backend latency:
p95/p99for key endpoints (login, cart, checkout, search) - Business funnel timings: time to add-to-cart, time to purchase, search-to-result time
- Failure amplification: error rate + latency (slow errors are conversion killers)
Two rules I’ve learned the hard way:
- Percentiles beat averages. p95 tells you what your real customers feel.
- Tag everything by route + release SHA. If you can’t answer “what deploy did this,” you’re doing archaeology, not engineering.
Callout: Performance regressions are rarely uniform. One route, one segment (Android WebView), one geo, one experiment bucket. Instrumentation needs enough labels to isolate without exploding cardinality.
CI gates: stop obvious regressions before they hit real users
CI isn’t perfect for performance (shared runners, noisy neighbors), but it’s excellent at catching directional regressions: bundle bloat, obvious LCP/TTI hits, API slowdowns from a new query.
Lighthouse CI for front-end regressions
Run Lighthouse against main and against the PR build, and fail if key metrics regress beyond a tolerance.
# .github/workflows/perf.yml
name: perf-regression
on:
pull_request:
jobs:
lighthouse:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npm run build
- run: npm run start & npx wait-on http://localhost:3000
- name: Lighthouse CI
run: |
npm install -g @lhci/cli@0.13.x
lhci autorun --config=./lighthouserc.json
// lighthouserc.json
{
"ci": {
"collect": {
"url": [
"http://localhost:3000/",
"http://localhost:3000/checkout"
],
"numberOfRuns": 3,
"settings": {
"preset": "desktop"
}
},
"assert": {
"assertions": {
"categories:performance": ["error", { "minScore": 0.75 }],
"largest-contentful-paint": ["error", { "maxNumericValue": 2500 }],
"total-blocking-time": ["warn", { "maxNumericValue": 300 }]
}
}
}
}Make it survivable:
- Run 3–5 times and take the median to reduce flake
- Use deltas vs
mainwhen possible, not only hard thresholds - Gate on the pages that matter: home is vanity; checkout is payroll
k6 smoke tests for API regressions
You don’t need a full load test per PR. You need a cheap tripwire that catches “this endpoint went from 120ms to 400ms.”
// k6/smoke.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
vus: 5,
duration: '30s',
thresholds: {
http_req_failed: ['rate<0.01'],
http_req_duration: ['p(95)<300']
}
};
export default function () {
const res = http.get(`${__ENV.BASE_URL}/api/checkout/quote`);
check(res, {
'status is 200': (r) => r.status === 200
});
sleep(1);
}Run it in CI against a preview environment. Keep it short, deterministic, and focused on tier‑1 endpoints.
Production regression detection: real-user metrics, real-world chaos
CI catches “obvious.” Production catches “truth.” You need both.
Ship RUM for Web Vitals (and tie it to deploys)
If you’re on the web, measure LCP/INP/CLS p75 in production with a release tag.
// rum/webVitals.ts
import { onLCP, onINP, onCLS } from 'web-vitals';
type Vital = { name: string; value: number };
function send(v: Vital) {
navigator.sendBeacon(
'/rum',
JSON.stringify({
...v,
path: location.pathname,
release: (window as any).__RELEASE_SHA__,
device: /Mobi/.test(navigator.userAgent) ? 'mobile' : 'desktop'
})
);
}
onLCP((m) => send({ name: 'LCP', value: m.value }));
onINP((m) => send({ name: 'INP', value: m.value }));
onCLS((m) => send({ name: 'CLS', value: m.value }));Then aggregate by page template (not raw URL) to avoid cardinality blowups. The win is being able to say: “Release a1b2c3 increased checkout INP p75 by 80ms on mobile.” That’s actionable.
Alert on deltas, not noise
Absolute thresholds are useful (“p95 > 1s is bad”), but gradual degradation needs baseline comparison.
In Prometheus/Grafana terms, record p95 latency per route and alert when it’s up meaningfully vs a trailing baseline.
# prometheus recording + alert (illustrative)
groups:
- name: api-latency
rules:
- record: route:http_request_duration_seconds:p95
expr: |
histogram_quantile(0.95,
sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)
- name: api-regressions
rules:
- alert: ApiLatencyRegression
expr: |
(route:http_request_duration_seconds:p95)
>
(avg_over_time(route:http_request_duration_seconds:p95[7d]) * 1.15)
for: 15m
labels:
severity: page
annotations:
summary: "p95 latency regressed >15% vs 7d baseline for {{ $labels.route }}"Key detail: tie alerts to deploys. If you use ArgoCD, Flux, Spinnaker, or plain old GitHub deploys, emit a deploy event and overlay it on latency graphs. Otherwise you’re staring at squiggles.
Budgets that don’t become theater: how to keep gates from getting disabled
I’ve seen performance gates get added with enthusiasm… and removed two sprints later because they were flaky or blocked “important” launches.
Here’s what actually works:
- Start with warn-only for 2 weeks. Calibrate noise and fix flaky tests.
- Gate only tier‑1 routes. Checkout, auth, search—things customers feel.
- Use a two-tier budget:
- Per-PR delta: e.g., no more than +5% on p95 in smoke tests
- Weekly trend: e.g., no more than +10% week-over-week in production p75/p95
- Define an escape hatch with a cost. Allow override, but require:
- a Jira ticket
- an owner
- a rollback plan
This keeps you honest without turning performance into a religion.
Optimizations that reliably move the needle (with measurable outcomes)
When GitPlumbers gets called in, it’s often after months of “it’s only 50ms.” The fixes that pay back fastest are boring—and that’s why they work.
1) Kill payload bloat
- Add
bundle sizebudgets (Webpack/Next.js analyzers) - Remove accidental polyfills and duplicate libraries (
moment+dayjsis a classic) - Prefer server-side aggregation over chatty clients
Measurable outcomes we routinely see:
- 10–30% smaller JS bundles → improved LCP on mid-tier Android
- Lower INP from less main-thread work
2) Cache like you mean it (CDN + app-level)
- Use
Cache-Controlcorrectly for static assets (public, max-age=31536000, immutable) - Add CDN caching for anonymous GETs (with safe
Varyheaders) - Cache expensive backend reads with explicit TTLs and stampede protection
Outcomes:
- TTFB down 50–200ms on cacheable pages
- Origin load reduction that delays your next infra spend
3) Stop query plan roulette
Performance regressions love ORM upgrades and “harmless” migrations.
- Track slow queries (
pg_stat_statementsfor Postgres) - Use
EXPLAIN (ANALYZE, BUFFERS)on regressions - Add or adjust indexes intentionally; avoid “index everything” cargo cult
- Pin/query-shape hot paths (yes, sometimes raw SQL is the adult move)
Outcomes:
- p95 API latency improvements of 2–10x on a single hot endpoint
- Reduced DB CPU and fewer noisy-neighbor incidents
4) Put timeouts and circuit breakers where AI-generated code forgot them
A lot of AI-assisted code “works” but quietly turns retries into latency multipliers.
- Set client timeouts (
axios,fetch, gRPC) explicitly - Cap retries; use exponential backoff with jitter
- Use bulkheads/circuit breakers (Envoy,
resilience4j, or service mesh policies)
Outcomes:
- Better tail latency (p99) under partial failures
- Lower support volume during provider blips
How to roll this out without derailing delivery
If you try to boil the ocean, you’ll get nothing but meetings.
A rollout plan that’s survived real orgs:
- Week 1: pick KPIs + add deploy markers + build the first dashboard
- Week 2: add CI gates (warn-only), fix flake, document escape hatch
- Week 3: enforce gates on 1–2 tier‑1 routes, add baseline regression alerting
- Week 4: expand coverage, start a weekly “perf triage” (30 minutes, strict)
The business payoff shows up faster than you’d think when you focus on customer paths:
- Higher conversion from faster checkout
- Better SEO and ad landing page quality scores
- Fewer “it’s slow” tickets (which are expensive because they’re hard to reproduce)
If you’re already deep in the swamp—legacy frontend, microservices, and a bit of vibe-coded glue—GitPlumbers can help you set up regression detection that engineers won’t disable and product will actually care about.
Next step: make one critical path (checkout/search/login) impossible to slow down silently.
Key takeaways
- Gradual performance degradation is a product and revenue problem, not a “nice-to-have” engineering metric.
- Use user-facing metrics (CWV, p95/p99, conversion funnel timings) and treat them as release gates.
- Regression detection needs two layers: CI (prevent obvious foot-guns) and production (catch real-user reality).
- Budgets must be statistical (percentiles, deltas, variance), not a single hard-coded number that flakes.
- Instrumentation and alerting should point to the commit, route, and change type—not just “latency is up.”
- A small set of repeatable optimizations (payload, caching, DB plan stability, front-end budgets) usually pays back within weeks.
Implementation checklist
- Pick 3–5 user-facing performance KPIs (e.g., LCP p75, INP p75, API p95, checkout p95, error rate).
- Define a regression policy: allowed delta per KPI per PR and per week.
- Add CI gates: Lighthouse CI for front-end, k6 smoke for APIs, bundle-size checks.
- Baseline against `main` or last release and compare deltas—not absolute numbers only.
- Ship RUM (Web Vitals + route tags) and correlate with deploy SHA.
- Create Prometheus recording rules for p95/p99 and alert on sustained deltas vs 7-day baseline.
- Roll out with canaries and auto-rollback on budget breach.
- Review regression trends weekly with product + eng; treat performance like reliability.
Questions we hear from teams
- Should we gate on synthetic metrics or real-user metrics?
- Both. Use CI synthetic gates (Lighthouse/k6) to block obvious regressions before merge, then use production RUM and p95/p99 SLOs to catch real-world issues CI can’t model (devices, networks, third-party tags, geo).
- How do we avoid flaky performance tests in CI?
- Run multiple iterations and take the median, limit pages/endpoints to tier‑1 flows, compare PR deltas vs `main` rather than absolute numbers only, and start warn-only while you calibrate variance.
- What’s a reasonable regression budget?
- For CI smoke tests, start with a small allowed delta like +5% on p95 duration for tier‑1 endpoints. In production, alert on sustained deltas like +10–15% vs a 7-day baseline for at least 15–30 minutes to avoid noise.
- We’re drowning in dashboards—what’s the minimal setup?
- One dashboard: CWV p75 by page template + API p95 for tier‑1 endpoints + deploy markers. One alert: sustained p95 delta vs baseline on checkout/search. Add more only after those are stable and trusted.
- How does GitPlumbers typically engage on this?
- We usually start with a short diagnostic: instrument the critical path, add deploy correlation, set budgets, and ship the first CI+prod regression gates. Then we tackle the top 2–3 bottlenecks (payload, caching, DB plan, or tail-latency resilience) with measurable before/after results.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
