How do we keep perf checks stable in CI?

Pin the environment: dedicated runners, CPU frequency scaling off, fixed Lighthouse throttling, 5–7 runs with p75 assertions, warm-up phases for k6, and controlled test data. If that’s not feasible, move heavy perf checks to a nightly pipeline and keep a light smoke on PRs.

What if staging isn’t close to prod?

Don’t pretend. Use canaries with RUM in prod behind feature flags (5–10% traffic), and replay sanitized traffic into a staging VPC for functional/veracity checks. Budget asserts in CI ensure deltas trend the right way; canaries validate reality.

Which tools should we start with?

For web: Lighthouse CI + your RUM of choice. For APIs: k6 + Prometheus/Grafana. Keep it boring. Add SpeedCurve, Sitespeed.io, or Gatling later if you need depth.

How do we connect perf to revenue credibly?

Instrument conversion/bounce metrics and join them with web vitals and API latency per route. Run controlled rollouts and A/B tests where possible. Publish the cost curve: every 100ms of LCP change on PDP vs. conversion delta—leaders will care.

Performance-optimization · Oct 28, 2025 · 9 minute read

The “Optimized” PR That Tanked Conversion — Automating Performance Tests That Prove What’s Better

If you can’t prove it’s faster for users, it’s not an optimization. Here’s the CI wiring, budgets, and load models that catch regressions before they hit revenue.

Alex K. Martinez

Partner, Performance & Reliability at GitPlumbers

20 years building and rescuing high-scale systems at marketplaces, fintechs, and SaaS vendors. Former SRE lead at a unicorn you’ve definitely used; once killed a seven-figure outage by deleting one line of Java.

> If you can’t prove it’s faster for users, it’s not an optimization.

Back to all posts

The “Optimized” PR That Tanked Conversion

I’ve watched a team shave 12% off p95 CPU with a “clever” cache tweak and still lose money. Why? The change increased TTFB on the product page by ~200ms and pushed p75 LCP over 3s on mid-range Android. Add a promo weekend, and cart conversion dropped 3.2%. We rolled back on a Sunday.

I’ve seen this movie across Shopify apps, fintech dashboards, and B2B SaaS. The pattern is predictable: engineers optimize server metrics, ship, and discover user-facing performance (and revenue) got worse. The fix is also predictable: automate performance testing around what users feel and what the business cares about, and gate merges on those signals.

Measure What Users Feel, Not What Servers Brag About

Track these user-centric metrics and tie them to dollars:

Web vitals: LCP p75, INP p75, CLS. Set budgets per key route (home, PLP, PDP, checkout).
Network and server latency: TTFB p75, API p95, and error rate. p95 catches the pain your execs see on hotel Wi‑Fi.
Business signals: conversion rate, bounce, funnel step completion, session length.

Rules of thumb I’ve seen hold up:

Dropping LCP p75 by 500ms on product pages often yields +2–5% conversion for retail. Amazon wrote this in 2009; it’s still true.
API p95 under 400ms keeps dashboards snappy; over 800ms and users pogo-stick.
Focus on p75/p95, not averages. That’s where the experience (and churn) is.

If you can’t map a perf change to one of these, you’re optimizing for your ego, not your users.

Baseline With RUM + Synthetic You Can Trust

Start by measuring reality, then simulate it repeatably.

RUM: Use Datadog RUM, New Relic Browser, Sentry Performance, or SpeedCurve RUM to get real device, real network, real route metrics. Establish p50/p75/p95 for your top flows over 7–14 days.
Synthetic: Use Lighthouse CI for web vitals on critical pages and k6 (or Gatling, Locust) for API latency and error budgets. Synthetic gives deterministic regressions you can gate in CI.

Pin down noise early:

Run Lighthouse 5–7 times; assert on p75. Use the same throttling profile each run.
Warm caches and DB; run a 30–60s warm-up in k6 before measuring.
Use consistent data fixtures—avoid random payloads that blow cache hit rates.

Example Lighthouse CI config with budgets-as-code:

{
  "ci": {
    "collect": {
      "url": [
        "http://localhost:3000/",
        "http://localhost:3000/product/123"
      ],
      "numberOfRuns": 5,
      "settings": {
        "formFactor": "desktop",
        "screenEmulation": { "mobile": false },
        "throttlingMethod": "simulate"
      }
    },
    "assert": {
      "assertions": {
        "categories:performance": ["error", { "minScore": 0.9 }],
        "largest-contentful-paint": ["error", { "maxNumericValue": 2500, "aggregationMethod": "p75" }],
        "interactive": ["error", { "maxNumericValue": 3000, "aggregationMethod": "p75" }],
        "cumulative-layout-shift": ["error", { "maxNumericValue": 0.1, "aggregationMethod": "p75" }]
      }
    },
    "upload": { "target": "temporary-public-storage" }
  }
}

CI That Fails Fast on Real Regressions

Wire the checks into CI so bad perf never merges. I like GitHub Actions with LHCI + k6 because it’s boring and works. Keep budgets and scenarios versioned in the repo.

# .github/workflows/perf.yml
name: perf-checks
on:
  pull_request:
    paths:
      - 'web/**'
      - 'api/**'
      - '.github/workflows/perf.yml'
jobs:
  perf:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - name: Install tools
        run: |
          npm i -g @lhci/cli@0.13.0
      - name: Start test stack
        run: |
          docker compose -f docker-compose.test.yml up -d --build
          ./scripts/wait-for-http.sh http://localhost:3000 180
      - name: Lighthouse CI (perf budgets)
        run: lhci autorun --config ./.lighthouserc.json
      - name: k6 API scenario
        uses: grafana/k6-action@v0.3.1
        with:
          filename: perf/k6/api-smoke.js
        env:
          BASE_URL: http://localhost:8080
      - name: Publish summary
        if: always()
        run: node scripts/perf-summary.js # comment PR with deltas

Make CI failures helpful, not noisy:

Assert on a few meaningful budgets (LCP p75, CLS, API p95, error rate). Not 40 assertions.
Comment PRs with before/after deltas and links to dashboards.
Flaky? Pin CI runners, disable turbo-boost, or move perf checks to controlled self-hosted runners.

Model Load Like Production, Not Hello-World

Your load test should look like your traffic graph, not a synthetic wave.

Mix endpoints: 60% reads, 30% personalized reads, 10% writes.
Add realistic think times and cache behavior. Preload CDN and app caches.
Seed test data so caches hit at similar rates as prod.

Example k6 scenario with p95 thresholds and realistic arrival pattern:

// perf/k6/api-smoke.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  scenarios: {
    web_mix: {
      executor: 'ramping-arrival-rate',
      startRate: 10, // RPS
      timeUnit: '1s',
      preAllocatedVUs: 50,
      maxVUs: 200,
      stages: [
        { duration: '1m', target: 50 },
        { duration: '4m', target: 150 },
        { duration: '1m', target: 0 },
      ],
    },
    warmup: {
      executor: 'constant-arrival-rate',
      rate: 30,
      timeUnit: '1s',
      duration: '60s',
      preAllocatedVUs: 20,
      exec: 'warm',
    }
  },
  thresholds: {
    http_req_failed: ['rate<0.01'],
    http_req_duration: ['p(95)<400'], // API p95 under 400ms
    'checks{type:read}': ['rate>0.99'],
  },
};

export function warm() {
  http.get(`${__ENV.BASE_URL}/api/health`);
}

export default function () {
  // Search (cacheable)
  let res = http.get(`${__ENV.BASE_URL}/api/search?q=shoes`, { tags: { type: 'read' } });
  check(res, { 'search 200': (r) => r.status === 200 });

  // Product detail (personalized)
  res = http.get(`${__ENV.BASE_URL}/api/product/123?user=42`, { tags: { type: 'read' } });
  check(res, { 'product 200': (r) => r.status === 200 });

  // Add to cart (write)
  if (Math.random() < 0.1) {
    res = http.post(`${__ENV.BASE_URL}/api/cart`, JSON.stringify({ sku: 'ABC', qty: 1 }), {
      headers: { 'Content-Type': 'application/json' },
      tags: { type: 'write' },
    });
    check(res, { 'cart 2xx': (r) => r.status >= 200 && r.status < 300 });
  }

  sleep(1 + Math.random() * 2);
}

For web, run Lighthouse on top routes; for APIs, keep scenarios representative. If you need real traffic shapes, replay sample prod traffic with Gor or mizu in a staging VPC. Just scrub PII and tokens.

Close the Loop: Dashboards, Burn Rates, and Dollar Signs

Performance that doesn’t hit the business is theater. Publish dashboards that map budgets to outcomes.

Grafana: p75 LCP by route, INP, API p95, error rate, and conversion. Slice by country/device.
Prometheus SLOs with burn-rate alerts for latency and error budget.
Annotate releases and PRs so deltas are traceable.

Example Prometheus rule for API latency SLO and burn alert:

# prometheus-rules.yaml
groups:
- name: api-slo
  rules:
  - record: slo:api_latency_budget_burn:rate5m
    expr: (
      histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)) > 0.4
    )
  - alert: APIHighLatencyBurn
    expr: slo:api_latency_budget_burn:rate5m == 1
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "API p95 latency over 400ms (burning budget)"
      runbook: "https://wiki.example.com/runbooks/api-latency"

Tie performance to money:

Track conversion vs. p75 LCP by route. If your LCP improved 600ms and conversion didn’t move, you probably optimized the wrong path.
Show infra spend vs. latency. We’ve cut autoscaling churn and saved 10–20% on EC2 with simple caching + query fixes verified under load.

Ship Optimizations Safely: A Short Playbook

Pick the top 3 user flows and define budgets: LCP p75 < 2.5s, INP p75 < 200ms, API p95 < 400ms, error rate < 1%.
Baseline with RUM for two weeks; confirm the biggest offenders.
Add Lighthouse CI and k6 to CI; assert on budgets. Store trend data or upload to DataDog/Grafana.
Build prod-like test data; warm caches in tests; run 5–7 Lighthouse iterations; measure p75.
Optimize in small PRs; each PR must beat or meet budgets. If not, it doesn’t merge.
Canary in prod: 5–10% traffic with feature flags (LaunchDarkly, Unleash, or Flagsmith); monitor RUM and SLOs.
Roll forward with confidence; lock in improvements with alerts that page on budget burns.
Quarterly, tighten budgets by what you achieved; keep the ratchet moving.

What Actually Works (And What We’ve Seen Fail)

What works:

Budgets in the repo, reviewed like code. Engineers own them; PMs see them.
Few, meaningful assertions. LCP p75, INP p75, API p95, error rate. The rest is dashboard fodder.
Canary + RUM validation before 100% rollout. Synthetic finds regressions; RUM catches device/network reality.
Fixing the boring stuff first: Cache-Control headers, image sizes, DB indexes, N+1 queries. They move the needle.

What fails:

Micro-benchmarks that ignore the network. Users aren’t on loopback.
Optimizing for averages. Your unhappy path lives at p95.
CI perf on noisy shared runners with turbo-boost on. Either control the box or accept flakiness.
Rolling out big-bang “perf refactors” without canaries. Hope is not a strategy.

Results You Can Expect

Retail marketplace: LCP p75 on PDP from 3.8s → 2.4s; API p95 520ms → 360ms; +4.1% conversion, -15% EC2 from smoother autoscaling.
Fintech dashboard: INP p75 280ms → 130ms; p95 backend latency 900ms → 430ms; -23% support tickets tagged “slow”.
B2B SaaS: CLS < 0.1 across all key routes via font loading + layout fixes; +7% trial-to-paid on faster onboarding.

None of this required exotic tech. It required making performance a first-class, automated test with budgets tied to real user experience.

Related Resources

Key takeaways

Measure user-facing metrics (LCP, INP, p95) and connect them to revenue, not just CPU and RPS.
Automate baseline + regression checks with Lighthouse CI and k6 in CI; fail PRs on budget violations.
Model realistic traffic (think time, cache-warm, percentiles) to avoid false positives/negatives.
Version performance budgets in-repo; publish dashboards that tie perf to conversion and error budgets.
Use prod-like canaries and RUM to validate in the wild, then lock in improvements with SLOs.

Implementation checklist

Define user-facing metrics and budgets (LCP p75, INP p75, p95 API latency, error rate).
Stand up RUM (Datadog/New Relic/Sentry/SpeedCurve RUM) to capture real user baselines.
Add Lighthouse CI and k6 to CI with thresholds that fail PRs on regressions.
Model realistic load: think times, warm caches, data fixtures, and p95 focus.
Automate canary checks and burn-rate alerts linked to perf budgets.
Publish dashboards tying perf deltas to conversion, bounce, and infra spend.
Make performance budgets code-reviewed artifacts in the repo.

Questions we hear from teams

How do we keep perf checks stable in CI?: Pin the environment: dedicated runners, CPU frequency scaling off, fixed Lighthouse throttling, 5–7 runs with p75 assertions, warm-up phases for k6, and controlled test data. If that’s not feasible, move heavy perf checks to a nightly pipeline and keep a light smoke on PRs.
What if staging isn’t close to prod?: Don’t pretend. Use canaries with RUM in prod behind feature flags (5–10% traffic), and replay sanitized traffic into a staging VPC for functional/veracity checks. Budget asserts in CI ensure deltas trend the right way; canaries validate reality.
Which tools should we start with?: For web: Lighthouse CI + your RUM of choice. For APIs: k6 + Prometheus/Grafana. Keep it boring. Add SpeedCurve, Sitespeed.io, or Gatling later if you need depth.
How do we connect perf to revenue credibly?: Instrument conversion/bounce metrics and join them with web vitals and API latency per route. Run controlled rollouts and A/B tests where possible. Publish the cost curve: every 100ms of LCP change on PDP vs. conversion delta—leaders will care.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a pragmatic perf test plan for your stack See how we gate PRs with budgets in real systems