How do we avoid flaky performance tests in CI?

Stabilize the environment: fixed CPU/mem runners, network emulation (e.g., Chrome’s ‘slow 4G’), seeded data, warm caches, and 3–5 Lighthouse runs with median selection. For k6, use steady-state stages and minimum sample sizes. Fail only on stable metrics (p95) with small tolerances (e.g., 3–5%).

Synthetic (Lighthouse/k6) or RUM (real-user monitoring)?

Both. Synthetic catches regressions pre-merge and pre-prod. RUM tells you what real devices and networks feel. Use synthetic in CI; use RUM to gate canaries and monitor SLOs. Automate decisions in both places.

Can we run perf checks on every PR without blowing up build times?

Yes: run a fast Lighthouse (1–2 URLs, 3 runs) and a k6 smoke (2–3 minutes). Nightly, run deeper scenarios. Parallelize jobs. If builds are still slow, only block on deltas > N% and let small signals warn but not fail.

What about backend services without a UI?

Use k6/Gatling with thresholds on `http_req_duration` and `error_rate`. Add OpenTelemetry spans and Prometheus histograms. Gate merges on p95. In prod, canary with Argo Rollouts + Prometheus queries.

We don’t have Prometheus/Argo—can this still work?

Yes. Use whatever you have: Datadog Synthetics + CI, New Relic SLOs as gates, LaunchDarkly for canary flags. The principle is the same: automate checks on user-facing metrics and abort when they regress.

Performance-optimization · Dec 12, 2025 · 10 minute read

The Perf “Improvement” That Tanked Conversion: Automating Tests That Prove Real Gains

Stop celebrating green Lighthouse badges and start gating deploys on user-facing metrics that move revenue.

Alec Morgan

Partner, Principal Engineer at GitPlumbers

20 years scaling systems from LAMP stacks to Kubernetes and back again. Led SRE and platform teams at two unicorns, fixed a lot of “AI optimizations” that melted p95, and still profiles code on Fridays.

If you can’t prove the optimization moved p95 or conversion, it’s just a refactor with better vibes.

Back to all posts

The day Lighthouse went green and revenue went red

We shipped a “performance sprint” at a retail client—first paint looked fantastic on staging. Lighthouse: 95+. Everyone high‑fived. Prod rollout? Checkout conversion dipped 1.8 points. Why? We improved lab metrics while quietly increasing p95 server latency under load and shoving third‑party tags earlier in the critical path. I’ve seen this movie more than once.

What actually works: automate performance tests that validate optimizations against metrics users feel and the business cares about. Gate merges and rollouts on those numbers. No hero dashboards, no manual rituals—just boring, reliable automation.

Measure what users and the business feel

Optimize what you can measure end‑to‑end and tie to money.

User-centric web metrics: LCP, INP, CLS, TTFB. RUM from @vercel/analytics, Elastic RUM, or Sentry Performance beats lab‑only.
Service latency: p95/p99 per endpoint. Prometheus histograms or OpenTelemetry traces (http.server.duration).
Business KPIs: checkout conversion, search click-through, funnel abandonment, AOV. Pull from Snowflake/Looker mixed with telemetry.
Error budget burn: SLOs for key journeys—burn rate alerts beat absolute thresholds.

Map them explicitly:

LCP <= 2.5s on PDP → +X% product views to add‑to‑cart.
p95 /api/checkout/pay < 400ms → fewer timeouts → conversion lift.
INP < 200ms on cart → less rage clicking → higher completion.

If you can’t draw a line from the metric to revenue, it’s a vanity metric.

Wire perf into CI like you wire tests

Set minimum bars and fail the PR if they regress. This is table stakes now.

Tools: Lighthouse CI for web vitals in lab; k6 for API latency and throughput; GitHub Actions/GitLab CI to orchestrate.
Data: Seed with prod‑like fixtures, stable network emulation, consistent container sizes.
Thresholds: Start strict enough to catch regressions, loosen only with evidence.

Example GitHub Action that builds a preview, runs Lighthouse CI assertions, then a k6 smoke with thresholds:

name: perf-ci
on:
  pull_request:
jobs:
  perf:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - name: Build and start preview
        run: |
          npm run build
          npm run start & npx wait-on http://localhost:3000
      - name: Lighthouse CI
        run: npx @lhci/cli autorun --config=./.lighthouserc.json
      - name: k6 smoke
        uses: grafana/k6-action@v0.2.0
        with:
          filename: test/perf/smoke.js
        env:
          BASE_URL: http://localhost:3000

.lighthouserc.json:

{
  "ci": {
    "collect": { "url": ["http://localhost:3000/"], "numberOfRuns": 3 },
    "assert": {
      "assertions": {
        "categories.performance": ["error", { "minScore": 0.9 }],
        "largest-contentful-paint": ["error", { "maxNumericValue": 2500 }],
        "interactive": ["warn", { "maxNumericValue": 3800 }]
      }
    },
    "upload": { "target": "temporary-public-storage" }
  }
}

k6 script with p95 and error thresholds:

import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 50 },
    { duration: '3m', target: 200 },
    { duration: '2m', target: 0 },
  ],
  thresholds: {
    http_req_failed: ['rate<0.01'],
    http_req_duration: ['p(95)<400'],
  },
};

export default function () {
  const res = http.get(`${__ENV.BASE_URL || 'http://localhost:3000'}/api/catalog`);
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(1);
}

This won’t tell you everything—but it will stop obvious regressions before they hit staging.

Guard rollouts in prod with real traffic

Lab is necessary; prod is truth. Use canaries plus automated analysis to validate improvements under actual user behavior.

Canary strategy: Argo Rollouts or Flagger to shift traffic gradually.
Automated analysis: Query Prometheus for p95, error rate, even LCP from RUM exporters.
Abort on regression: Roll back automatically if thresholds break.

Minimal AnalysisTemplate + canary snippet:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: p95-latency-check
spec:
  metrics:
  - name: p95-latency
    interval: 1m
    count: 5
    successCondition: result < 0.4
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="checkout",version="canary"}[1m])) by (le))
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  replicas: 6
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 2m}
      - analysis:
          templates:
          - templateName: p95-latency-check
      - setWeight: 30
      - pause: {duration: 2m}
      - analysis:
          templates:
          - templateName: p95-latency-check
      - setWeight: 100

We’ve also wired RUM LCP into Prometheus via web-vitals exporters; same pattern applies. If the canary’s LCP p75 > 2.5s or checkout p95 > 400ms for two intervals, Rollouts aborts. No pager, no debate.

Three optimizations that consistently move needles (and how to prove it)

I don’t care how clever the code is—if we can’t show measured gains, it’s just vibe coding. Here are fixes that survive real traffic, with tests to validate.

Cache the boring stuff (CDN + API)

Add Cache-Control and stale-while-revalidate for product lists and CMS content. Front door: Cloudflare/Akamai; back door: Redis edge or app layer.

// Express example
app.get('/api/catalog', async (req, res) => {
  res.set('Cache-Control', 'public, max-age=60, stale-while-revalidate=300');
  const data = await getCatalog();
  res.json(data);
});

Validate with k6 under load, assert TTFB and p95 improved. We routinely see 40–60% p95 latency reduction on cached endpoints and 25–35% LCP improvement on list pages.

Fix query plans and N+1s

Add covering indexes; eliminate N+1 in GraphQL with dataloader and batch APIs.

-- Speed account order history by filtering/sorting on indexed columns
CREATE INDEX CONCURRENTLY idx_orders_account_created
  ON orders (account_id, created_at DESC)
  INCLUDE (total, status);

Validate with a targeted k6 scenario for /api/orders?account_id=… and assert p95 < 300ms. In one case, p99 dropped from 1.2s → 320ms, checkout conversion +2.3 pts.

Defer third‑party and ship less JS

Lazy‑load non‑critical tags and preconnect critical CDNs.

// React – defer 3P tag to idle
useEffect(() => {
  requestIdleCallback(() => {
    const s = document.createElement('script');
    s.src = 'https://example-ads.js';
    s.defer = true;
    document.body.appendChild(s);
  });
}, []);

<link rel="preconnect" href="https://cdn.shop.com" crossorigin />

Split bundles and compress: Brotli, react.lazy, image formats (AVIF, WebP). Validate with Lighthouse CI budgets: LCP <= 2.5s, total-byte-weight < 300KB for mobile. Typical wins: LCP p75 −400–800ms, INP −50–150ms.

Prove each change with before/after CI runs plus a canary in prod. If conversion or funnel abandonment doesn’t improve, roll back and try the next lever.

Make it stick with SLOs, dashboards, and budgets as gates

Tie deploy decisions to error budgets—not gut feel.

SLOs: e.g., checkout p95 < 400ms and availability >= 99.9% monthly. Track burn rate with Prometheus or Sloth.
Recording rule to simplify queries:

# Prometheus rule
- record: service:p95_latency_seconds
  expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le))

CI budget check: lightweight guard that queries Prometheus before deploying to prod.

THRESH=0.4
VAL=$(curl -s "http://prom:9090/api/v1/query?query=service:p95_latency_seconds" | jq -r '.data.result[0].value[1]')
awk -v v="$VAL" -v t="$THRESH" 'BEGIN { exit !(v < t) }' || { echo "p95 $VAL >= $THRESH"; exit 1; }

Dashboards: Put LCP p75, INP p75, p95 per key endpoint, and conversion on the same Grafana row. If an engineer can’t see perf and revenue together, it won’t drive behavior.

A pragmatic rollout plan (4–6 weeks, small team)

Inventory critical flows and define SLOs (checkout, PDP, search). Document current LCP, INP, p95.
Add tracing (OpenTelemetry) and Prometheus histograms to those endpoints.
Land CI perf gates: Lighthouse CI assertions, k6 with thresholds. Stabilize with prod‑like data and network emulation.
Introduce canary analysis (Argo Rollouts) for the top two services.
Ship one optimization per layer: CDN caching, DB index/N+1 fix, bundle split. Validate each with CI and canary.
Publish a single Grafana board with perf + conversion. Make it the deploy gate.

You’ll catch regressions in PRs, abort bad rollouts automatically, and bank measurable wins in under two sprints.

Lessons learned (and what I’d do differently next time)

Don’t chase perfect lab scores. Set realistic budgets aligned with device mix and traffic.
Control variability. Ephemeral envs with known data beat staging that changes hourly.
Keep thresholds tight, but only on stable metrics. INP can be noisy; use p75 instead of p95 initially.
Bind perf flags to canary ramps (LaunchDarkly/Unleash). You want a kill switch tied to metrics, not intuition.
Treat “AI‑generated optimizations” like interns’ code: review, benchmark, and backtest. We do a lot of vibe code cleanup at GitPlumbers after “AI refactors” quietly doubled TTFB.

If you need a second set of hands to wire this up without grinding your team to a halt, that’s literally what we do at GitPlumbers.

Related Resources

Key takeaways

Green lab scores mean nothing if p95 and conversion don’t move; gate changes on user-facing metrics.
Automate performance checks in CI with Lighthouse CI and k6 and fail fast on regressions.
Use canaries with automated analysis (Argo Rollouts + Prometheus) to validate impact in real traffic.
Tie perf to dollars: set SLOs for key flows (search, PDP, checkout) and watch error budget burn.
Optimize what matters: caching, query plans, payloads/3P tags—prove gains with thresholds and deltas.
Keep it boring and repeatable: ephemeral envs, trace sampling, consistent data, and tight thresholds.

Implementation checklist

Instrument user-facing metrics: LCP, INP, TTFB, p95/p99 service latency, error rate.
Stand up CI perf gates: Lighthouse CI assertions and k6 thresholds.
Create ephemeral envs seeded with prod-like data; stabilize test variability.
Add canary analysis in prod (Argo Rollouts or Flagger) with Prometheus queries.
Define SLOs for top journeys and wire error budget burn to deploy decisions.
Target three concrete optimizations and verify gains with automated tests before/after.
Publish dashboards that show perf deltas alongside conversion and abandonment.
Document a rollback plan that triggers on perf regressions automatically.

Questions we hear from teams

How do we avoid flaky performance tests in CI?: Stabilize the environment: fixed CPU/mem runners, network emulation (e.g., Chrome’s ‘slow 4G’), seeded data, warm caches, and 3–5 Lighthouse runs with median selection. For k6, use steady-state stages and minimum sample sizes. Fail only on stable metrics (p95) with small tolerances (e.g., 3–5%).
Synthetic (Lighthouse/k6) or RUM (real-user monitoring)?: Both. Synthetic catches regressions pre-merge and pre-prod. RUM tells you what real devices and networks feel. Use synthetic in CI; use RUM to gate canaries and monitor SLOs. Automate decisions in both places.
Can we run perf checks on every PR without blowing up build times?: Yes: run a fast Lighthouse (1–2 URLs, 3 runs) and a k6 smoke (2–3 minutes). Nightly, run deeper scenarios. Parallelize jobs. If builds are still slow, only block on deltas > N% and let small signals warn but not fail.
What about backend services without a UI?: Use k6/Gatling with thresholds on `http_req_duration` and `error_rate`. Add OpenTelemetry spans and Prometheus histograms. Gate merges on p95. In prod, canary with Argo Rollouts + Prometheus queries.
We don’t have Prometheus/Argo—can this still work?: Yes. Use whatever you have: Datadog Synthetics + CI, New Relic SLOs as gates, LaunchDarkly for canary flags. The principle is the same: automate checks on user-facing metrics and abort when they regress.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a pragmatic perf testing pipeline in 2 sprints See how we cut checkout p95 by 45%