Should we gate on synthetic metrics or real-user metrics?

Both. Use CI synthetic gates (Lighthouse/k6) to block obvious regressions before merge, then use production RUM and p95/p99 SLOs to catch real-world issues CI can’t model (devices, networks, third-party tags, geo).

How do we avoid flaky performance tests in CI?

Run multiple iterations and take the median, limit pages/endpoints to tier‑1 flows, compare PR deltas vs `main` rather than absolute numbers only, and start warn-only while you calibrate variance.

What’s a reasonable regression budget?

For CI smoke tests, start with a small allowed delta like +5% on p95 duration for tier‑1 endpoints. In production, alert on sustained deltas like +10–15% vs a 7-day baseline for at least 15–30 minutes to avoid noise.

We’re drowning in dashboards—what’s the minimal setup?

One dashboard: CWV p75 by page template + API p95 for tier‑1 endpoints + deploy markers. One alert: sustained p95 delta vs baseline on checkout/search. Add more only after those are stable and trusted.

How does GitPlumbers typically engage on this?

We usually start with a short diagnostic: instrument the critical path, add deploy correlation, set budgets, and ship the first CI+prod regression gates. Then we tackle the top 2–3 bottlenecks (payload, caching, DB plan, or tail-latency resilience) with measurable before/after results.

Performance-optimization · Feb 3, 2026 · 8 minute read

The “One Tiny PR” That Made Checkout 800ms Slower (and Nobody Noticed for 6 Months)

Gradual performance regressions don’t show up as outages. They show up as churn, lower conversion, and a support queue full of “site feels slow” tickets. Here’s how to detect them early—automatically—using user-facing metrics that map to revenue.

GitPlumbers Team

Software Recovery & Performance Engineering

We’re the folks teams call after the “quick optimization” turned into a six-month slowdown. We’ve shipped and stabilized systems across Postgres, Redis, Kafka, Kubernetes, Next.js, and too many service meshes to brag about. We focus on measurable user-facing performance, safe delivery, and undoing technical debt (including AI-assisted code that looked fine in review).

Gradual performance regressions aren’t outages. They’re a tax on every customer interaction—and they compound until someone finally notices the revenue graph.

Back to all posts

The slow leak nobody puts in the postmortem

I’ve watched teams with spotless uptime still bleed revenue because the app got a little slower every week. No SEV-1s. No pager storms. Just:

Checkout completion drops 1–2% over a quarter
Organic search slides because CWV fails on mobile
Support tickets: “It’s laggy now” (aka: you can’t reproduce it on your MacBook on office Wi‑Fi)

The killer is that gradual regressions look “normal” in day-to-day development. A new npm dependency adds 40KB. A “temporary” debug log becomes permanent. An ORM upgrade changes a query plan. An AI-generated helper function adds an extra round trip to Redis “for safety.”

If you want to prevent this class of failure, you need performance regression detection that behaves like a seatbelt: it’s there every PR, every deploy, and it stops small mistakes from becoming expensive trends.

Use metrics your CFO would care about (even if they don’t know the acronyms)

If performance work isn’t tied to user impact, it dies in prioritization. The metrics that win budget are the ones you can map to conversion, retention, and support cost.

A pragmatic starter set:

Core Web Vitals (p75): LCP, INP, CLS by page template and device class
Backend latency: p95/p99 for key endpoints (login, cart, checkout, search)
Business funnel timings: time to add-to-cart, time to purchase, search-to-result time
Failure amplification: error rate + latency (slow errors are conversion killers)

Two rules I’ve learned the hard way:

Percentiles beat averages. p95 tells you what your real customers feel.
Tag everything by route + release SHA. If you can’t answer “what deploy did this,” you’re doing archaeology, not engineering.

Callout: Performance regressions are rarely uniform. One route, one segment (Android WebView), one geo, one experiment bucket. Instrumentation needs enough labels to isolate without exploding cardinality.

CI gates: stop obvious regressions before they hit real users

CI isn’t perfect for performance (shared runners, noisy neighbors), but it’s excellent at catching directional regressions: bundle bloat, obvious LCP/TTI hits, API slowdowns from a new query.

Lighthouse CI for front-end regressions

Run Lighthouse against main and against the PR build, and fail if key metrics regress beyond a tolerance.

# .github/workflows/perf.yml
name: perf-regression
on:
  pull_request:

jobs:
  lighthouse:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run build
      - run: npm run start & npx wait-on http://localhost:3000

      - name: Lighthouse CI
        run: |
          npm install -g @lhci/cli@0.13.x
          lhci autorun --config=./lighthouserc.json

// lighthouserc.json
{
  "ci": {
    "collect": {
      "url": [
        "http://localhost:3000/",
        "http://localhost:3000/checkout"
      ],
      "numberOfRuns": 3,
      "settings": {
        "preset": "desktop"
      }
    },
    "assert": {
      "assertions": {
        "categories:performance": ["error", { "minScore": 0.75 }],
        "largest-contentful-paint": ["error", { "maxNumericValue": 2500 }],
        "total-blocking-time": ["warn", { "maxNumericValue": 300 }]
      }
    }
  }
}

Make it survivable:

Run 3–5 times and take the median to reduce flake
Use deltas vs main when possible, not only hard thresholds
Gate on the pages that matter: home is vanity; checkout is payroll

k6 smoke tests for API regressions

You don’t need a full load test per PR. You need a cheap tripwire that catches “this endpoint went from 120ms to 400ms.”

// k6/smoke.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  vus: 5,
  duration: '30s',
  thresholds: {
    http_req_failed: ['rate<0.01'],
    http_req_duration: ['p(95)<300']
  }
};

export default function () {
  const res = http.get(`${__ENV.BASE_URL}/api/checkout/quote`);
  check(res, {
    'status is 200': (r) => r.status === 200
  });
  sleep(1);
}

Run it in CI against a preview environment. Keep it short, deterministic, and focused on tier‑1 endpoints.

Production regression detection: real-user metrics, real-world chaos

CI catches “obvious.” Production catches “truth.” You need both.

Ship RUM for Web Vitals (and tie it to deploys)

If you’re on the web, measure LCP/INP/CLS p75 in production with a release tag.

// rum/webVitals.ts
import { onLCP, onINP, onCLS } from 'web-vitals';

type Vital = { name: string; value: number };

function send(v: Vital) {
  navigator.sendBeacon(
    '/rum',
    JSON.stringify({
      ...v,
      path: location.pathname,
      release: (window as any).__RELEASE_SHA__,
      device: /Mobi/.test(navigator.userAgent) ? 'mobile' : 'desktop'
    })
  );
}

onLCP((m) => send({ name: 'LCP', value: m.value }));
onINP((m) => send({ name: 'INP', value: m.value }));
onCLS((m) => send({ name: 'CLS', value: m.value }));

Then aggregate by page template (not raw URL) to avoid cardinality blowups. The win is being able to say: “Release a1b2c3 increased checkout INP p75 by 80ms on mobile.” That’s actionable.

Alert on deltas, not noise

Absolute thresholds are useful (“p95 > 1s is bad”), but gradual degradation needs baseline comparison.

In Prometheus/Grafana terms, record p95 latency per route and alert when it’s up meaningfully vs a trailing baseline.

# prometheus recording + alert (illustrative)
groups:
  - name: api-latency
    rules:
      - record: route:http_request_duration_seconds:p95
        expr: |
          histogram_quantile(0.95,
            sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: api-regressions
    rules:
      - alert: ApiLatencyRegression
        expr: |
          (route:http_request_duration_seconds:p95)
            >
          (avg_over_time(route:http_request_duration_seconds:p95[7d]) * 1.15)
        for: 15m
        labels:
          severity: page
        annotations:
          summary: "p95 latency regressed >15% vs 7d baseline for {{ $labels.route }}"

Key detail: tie alerts to deploys. If you use ArgoCD, Flux, Spinnaker, or plain old GitHub deploys, emit a deploy event and overlay it on latency graphs. Otherwise you’re staring at squiggles.

Budgets that don’t become theater: how to keep gates from getting disabled

I’ve seen performance gates get added with enthusiasm… and removed two sprints later because they were flaky or blocked “important” launches.

Here’s what actually works:

Start with warn-only for 2 weeks. Calibrate noise and fix flaky tests.
Gate only tier‑1 routes. Checkout, auth, search—things customers feel.
Use a two-tier budget:
- Per-PR delta: e.g., no more than +5% on p95 in smoke tests
- Weekly trend: e.g., no more than +10% week-over-week in production p75/p95
Define an escape hatch with a cost. Allow override, but require:
- a Jira ticket
- an owner
- a rollback plan

This keeps you honest without turning performance into a religion.

Optimizations that reliably move the needle (with measurable outcomes)

When GitPlumbers gets called in, it’s often after months of “it’s only 50ms.” The fixes that pay back fastest are boring—and that’s why they work.

1) Kill payload bloat

Add bundle size budgets (Webpack/Next.js analyzers)
Remove accidental polyfills and duplicate libraries (moment + dayjs is a classic)
Prefer server-side aggregation over chatty clients

Measurable outcomes we routinely see:

10–30% smaller JS bundles → improved LCP on mid-tier Android
Lower INP from less main-thread work

2) Cache like you mean it (CDN + app-level)

Use Cache-Control correctly for static assets (public, max-age=31536000, immutable)
Add CDN caching for anonymous GETs (with safe Vary headers)
Cache expensive backend reads with explicit TTLs and stampede protection

Outcomes:

TTFB down 50–200ms on cacheable pages
Origin load reduction that delays your next infra spend

3) Stop query plan roulette

Performance regressions love ORM upgrades and “harmless” migrations.

Track slow queries (pg_stat_statements for Postgres)
Use EXPLAIN (ANALYZE, BUFFERS) on regressions
Add or adjust indexes intentionally; avoid “index everything” cargo cult
Pin/query-shape hot paths (yes, sometimes raw SQL is the adult move)

Outcomes:

p95 API latency improvements of 2–10x on a single hot endpoint
Reduced DB CPU and fewer noisy-neighbor incidents

4) Put timeouts and circuit breakers where AI-generated code forgot them

A lot of AI-assisted code “works” but quietly turns retries into latency multipliers.

Set client timeouts (axios, fetch, gRPC) explicitly
Cap retries; use exponential backoff with jitter
Use bulkheads/circuit breakers (Envoy, resilience4j, or service mesh policies)

Outcomes:

Better tail latency (p99) under partial failures
Lower support volume during provider blips

How to roll this out without derailing delivery

If you try to boil the ocean, you’ll get nothing but meetings.

A rollout plan that’s survived real orgs:

Week 1: pick KPIs + add deploy markers + build the first dashboard
Week 2: add CI gates (warn-only), fix flake, document escape hatch
Week 3: enforce gates on 1–2 tier‑1 routes, add baseline regression alerting
Week 4: expand coverage, start a weekly “perf triage” (30 minutes, strict)

The business payoff shows up faster than you’d think when you focus on customer paths:

Higher conversion from faster checkout
Better SEO and ad landing page quality scores
Fewer “it’s slow” tickets (which are expensive because they’re hard to reproduce)

If you’re already deep in the swamp—legacy frontend, microservices, and a bit of vibe-coded glue—GitPlumbers can help you set up regression detection that engineers won’t disable and product will actually care about.

Next step: make one critical path (checkout/search/login) impossible to slow down silently.

Related Resources

Key takeaways

Gradual performance degradation is a product and revenue problem, not a “nice-to-have” engineering metric.
Use user-facing metrics (CWV, p95/p99, conversion funnel timings) and treat them as release gates.
Regression detection needs two layers: CI (prevent obvious foot-guns) and production (catch real-user reality).
Budgets must be statistical (percentiles, deltas, variance), not a single hard-coded number that flakes.
Instrumentation and alerting should point to the commit, route, and change type—not just “latency is up.”
A small set of repeatable optimizations (payload, caching, DB plan stability, front-end budgets) usually pays back within weeks.

Implementation checklist

Pick 3–5 user-facing performance KPIs (e.g., LCP p75, INP p75, API p95, checkout p95, error rate).
Define a regression policy: allowed delta per KPI per PR and per week.
Add CI gates: Lighthouse CI for front-end, k6 smoke for APIs, bundle-size checks.
Baseline against `main` or last release and compare deltas—not absolute numbers only.
Ship RUM (Web Vitals + route tags) and correlate with deploy SHA.
Create Prometheus recording rules for p95/p99 and alert on sustained deltas vs 7-day baseline.
Roll out with canaries and auto-rollback on budget breach.
Review regression trends weekly with product + eng; treat performance like reliability.

Questions we hear from teams

Should we gate on synthetic metrics or real-user metrics?: Both. Use CI synthetic gates (Lighthouse/k6) to block obvious regressions before merge, then use production RUM and p95/p99 SLOs to catch real-world issues CI can’t model (devices, networks, third-party tags, geo).
How do we avoid flaky performance tests in CI?: Run multiple iterations and take the median, limit pages/endpoints to tier‑1 flows, compare PR deltas vs `main` rather than absolute numbers only, and start warn-only while you calibrate variance.
What’s a reasonable regression budget?: For CI smoke tests, start with a small allowed delta like +5% on p95 duration for tier‑1 endpoints. In production, alert on sustained deltas like +10–15% vs a 7-day baseline for at least 15–30 minutes to avoid noise.
We’re drowning in dashboards—what’s the minimal setup?: One dashboard: CWV p75 by page template + API p95 for tier‑1 endpoints + deploy markers. One alert: sustained p95 delta vs baseline on checkout/search. Add more only after those are stable and trusted.
How does GitPlumbers typically engage on this?: We usually start with a short diagnostic: instrument the critical path, add deploy correlation, set budgets, and ship the first CI+prod regression gates. Then we tackle the top 2–3 bottlenecks (payload, caching, DB plan, or tail-latency resilience) with measurable before/after results.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a regression-detection baseline for your critical path See how we fix performance and technical debt safely

The slow leak nobody puts in the postmortem

Use metrics your CFO would care about (even if they don’t know the acronyms)

CI gates: stop obvious regressions before they hit real users

Lighthouse CI for front-end regressions

k6 smoke tests for API regressions

Production regression detection: real-user metrics, real-world chaos

Ship RUM for Web Vitals (and tie it to deploys)

Alert on deltas, not noise

Budgets that don’t become theater: how to keep gates from getting disabled

Optimizations that reliably move the needle (with measurable outcomes)

1) Kill payload bloat

2) Cache like you mean it (CDN + app-level)

3) Stop query plan roulette

4) Put timeouts and circuit breakers where AI-generated code forgot them

How to roll this out without derailing delivery

Related Resources

Key takeaways

Implementation checklist

Questions we hear from teams

Ready to modernize your codebase?

Related resources