The Optimization Isn’t Real Until CI Says So: Automating Performance Proof with User-Centric Metrics
If it doesn’t move LCP, INP, and checkout latency for real users, it’s not an optimization. Wire performance tests into CI/CD and gate merges with budgets tied to revenue.
Performance isn’t faster until the customer says it is—and CI enforces it.Back to all posts
The release that felt faster but cost us revenue
I’ve watched teams “speed up” an app and ship a regression. One fintech client shaved 20% off server CPU after a gRPC hop refactor. Perf graphs looked great in Grafana… until revenue dipped. Turns out we added a blocking font load and lazy-loaded the hero image incorrectly. LCP got worse on mid-tier Android in LATAM, checkout time went up p95, and conversions fell 3.1% in two days. No one noticed until finance asked why the daily run rate was off.
I’ve seen this fail at unicorns and mom-and-pop SaaS: people optimize what’s easy to measure (CPU, GC, container cost) and ignore what users feel (LCP, INP, p95 checkout latency). The fix is boring but effective: automate performance tests around user-facing metrics and make merges fail if we don’t improve—or if we regress.
If your CI doesn’t gate on user-centric budgets, your “optimization” is a rumor.
Measure what users feel, not what servers brag about
Servers love to brag about CPU and throughput. Users care about when the page draws and when taps respond. Track these:
- Core Web Vitals:
LCP(<2.5s p75),INP(<200ms p75),CLS(<0.1 p75). - TTFB: p75 under ~200–400ms depending on region.
- Journey latency: p95 time from product page to payment success.
- API p95/p99 by endpoint and region (e.g.,
/cart/checkoutp95 < 800ms). - Apdex per user flow (search, add-to-cart, pay), not just per service.
Map metrics to money:
- 100ms faster
LCPon product pages raised a retailer’s conversion by 1.6% (our client; 8-figure annual run rate gain). - Reducing p95 checkout from 1.2s to 700ms dropped abandonment by 2.4pp.
Set budgets per route/API and device class. Example budgets:
LCPp75: Desktop 1.8s, Mid-Android 2.5s, Low-end 3.0s.INPp75: 150ms on product pages; 200ms on account pages.API /pricingp95: 300ms NA/EU; 450ms APAC.
These budgets become gates in CI and canary analysis. No green, no merge.
Make tests repeatable and truthful
Synthetic tests catch regressions fast; Real User Monitoring (RUM) proves impact. Use both.
- Synthetic:
Lighthouse CIfor Web Vitals,k6for API latency and throughput. Run on fixed hardware or cloud workers with network shaping. - RUM:
web-vitalslibrary in your app; ship toPrometheusor a vendor (SpeedCurve, Splunk RUM, Datadog). Slice by device, OS, region, and release version. - Truthful data: Test with production-like payload sizes, images, and cache headers. Use real cookies/feature flags. Warm and cold cache scenarios.
- Traffic replay for APIs: Sample production requests with
gor(GoReplay) or service mesh mirrors (IstiomirrorPercentage) against staging/canary.
Keep it cheap and sane:
- Run quick smoke perf checks on PRs, deeper runs nightly or on release candidates.
- Use small, deterministic datasets and seeded test accounts to remove noise.
Wire it into CI/CD with hard budgets
Here’s the combo we deploy a lot at GitPlumbers: Lighthouse CI for web, k6 for APIs, Prometheus for budgets/SLOs, enforced through CI and canary.
A minimal k6 script with thresholds that fail the build on regression:
// tests/perf/api-smoke.js
import http from 'k6/http';
import { check, group, sleep } from 'k6';
export const options = {
thresholds: {
'http_req_waiting{page:home}': ['p(95)<300'],
'http_req_waiting{page:product}': ['p(95)<400'],
'checks': ['rate>0.99']
},
scenarios: {
smoke: {
executor: 'constant-vus',
vus: 10,
duration: '1m'
}
}
};
export default function () {
group('home', () => {
const res = http.get(`${__ENV.BASE_URL}/` , { tags: { page: 'home' } });
check(res, { '200': r => r.status === 200 });
});
group('product', () => {
const res = http.get(`${__ENV.BASE_URL}/product/123`, { tags: { page: 'product' } });
check(res, { '200': r => r.status === 200 });
});
sleep(1);
}A Lighthouse CI config that enforces Web Vitals budgets:
// lighthouserc.json
{
"ci": {
"collect": {
"url": ["https://staging.example.com/", "https://staging.example.com/product/123"],
"numberOfRuns": 3,
"settings": { "preset": "desktop" }
},
"assert": {
"assertions": {
"categories:performance": ["error", { "minScore": 0.9 }],
"metrics/largest-contentful-paint": ["error", { "maxNumericValue": 2500 }],
"metrics/cumulative-layout-shift": ["error", { "maxNumericValue": 0.1 }],
"metrics/interactive": ["warn", { "maxNumericValue": 3800 }]
}
}
}
}A GitHub Actions job that runs both and fails the PR if budgets blow up:
name: perf-gates
on: [pull_request]
jobs:
web-vitals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20' }
- run: npm ci
- name: Run Lighthouse CI
run: npx @lhci/cli autorun --upload.target=temporary-public-storage
api-latency:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run k6
uses: grafana/k6-action@v0.3.0
with:
filename: tests/perf/api-smoke.js
env:
BASE_URL: https://staging.example.comIf a change bumps LCP over 2.5s p75 or pushes API p95 beyond budget, the PR turns red. You didn’t “optimize”; you broke the budget. Fix it or flag-gate it.
Validate optimizations with experiments, not vibes
Pick an optimization. Prove it. Then ship.
- Hypothesis: “Serve images as
AVIFwithsrcsetand propersizeswill dropLCP20% on product pages.” - Implement behind a
feature flag(e.g.,LaunchDarkly,Unleash). - Synthetic proof:
LHCIdrops LCP from 2.9s → 2.1s on staging. - Canary to 5% of traffic, mobile-only, via
Argo Rollouts. - RUM proof:
web-vitalsshows p75 LCP 2.8s → 2.2s on mid Android. - Business impact: conversion +1.4%, add-to-cart +2.1% in 48 hours.
- Roll out to 100%, then ratchet budgets tighter by 10%.
We often add edge improvements:
- TTFB: Cache HTML for anon users at the CDN with
stale-while-revalidate. Saved a publisher 90ms p75 TTFB globally. - JavaScript diet: Split
vendors.jsby route; defer 3rd-party tags. INP p75 dropped 30–60ms. - API hot paths: Precompute price breakdowns into Redis;
/pricingp95 fell from 480ms → 190ms.
Document results in a perf changelog that ties code diffs to metric shifts and revenue. Future-you will thank you when finance asks “what changed?”
Guardrails: SLOs, canaries, and auto-rollback
CI gates stop bad merges; SLOs stop bad rollouts. Two concrete pieces:
- Prometheus alert based on RUM (not server CPU):
# prom-rules.yaml
groups:
- name: perf-slo
rules:
- record: slo:lcp_p75_seconds
expr: histogram_quantile(0.75, sum(rate(web_vitals_lcp_bucket{service="web",env="prod"}[5m])) by (le))
- alert: WebVitalsLCPSLOViolation
expr: slo:lcp_p75_seconds > 2.5
for: 10m
labels:
severity: page
annotations:
summary: "LCP p75 above SLO ({{ $value }}s > 2.5s)"
runbook: https://gitplumbers.com/runbooks/web-lcp- Argo Rollouts analysis that aborts a canary when LCP degrades:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: web-lcp
spec:
metrics:
- name: lcp-p75
interval: 1m
successCondition: result < 2.5
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.75, sum(rate(web_vitals_lcp_bucket{service="web",env="canary"}[5m])) by (le))If your canary pushes LCP above 2.5s for 10 minutes, rollout halts automatically. No 2am firefight.
A one-week rollout plan
We’ve set this up in a week for teams from Series A startups to Fortune 100.
- Day 1: Pick journeys and KPIs. Define budgets for product page, search, checkout; and
/pricing,/cart,/ordersAPIs. Agree on p75/p95 targets by region and device class. - Day 2: Add
web-vitalsto the app. Start shipping RUM to Prometheus with release labels. - Day 3: Add
Lighthouse CIto PRs for key pages. Fail ifLCP > 2.5sor performance score < 0.9. - Day 4: Add
k6smoke with p95 thresholds for key APIs. Run against staging env seeded with realistic data. - Day 5: Create Prometheus SLO alerts and a Grafana dashboard that shows budgets vs actuals by release.
- Day 6: Wire Argo Rollouts or Flagger for canary + analysis based on RUM metrics.
- Day 7: Pilot an optimization (e.g., image format switch). Validate via CI → canary → RUM. Publish the perf changelog and tighten budgets 10%.
What I’d tell my past self: start with budgets and CI gates, not dashboards. Dashboards are for browsing; gates are for shipping safely.
Key takeaways
- If it doesn’t improve user-facing metrics (LCP, INP, p95 checkout latency), it didn’t happen.
- Automate performance checks in CI/CD with hard budgets that block merges.
- Use both synthetic tests and RUM; validate on canary before full rollout.
- Tie budgets to business targets (Apdex, conversion, retention) and measure the impact.
- Make rollback automatic when SLOs drift—no heroics required.
Implementation checklist
- Define user-centric KPIs: LCP, INP, CLS, p95 API latency, Apdex per journey.
- Create performance budgets per route/API and per region/device class.
- Add Lighthouse CI for web vitals in PRs; fail builds on budget breaches.
- Add k6 for API p95/p99 thresholds; run on ephemeral env or canary.
- Collect RUM (e.g., via Boomerang/Web-Vitals) and export to Prometheus.
- Use Argo Rollouts or Flagger to canary with automated analysis/abort.
- Wire Prometheus alerts to SLOs, not CPU/GC noise.
- Publish a weekly perf scorecard that maps tech changes to revenue/SLAs.
Questions we hear from teams
- Why not just rely on server-side metrics (CPU, GC, qps)?
- Because users don’t feel CPU. They feel LCP, INP, and checkout latency. Server metrics are necessary but insufficient. We’ve seen servers look healthy while RUM showed LCP p75 blowing past 3s on mid Android due to a font swap and 3rd-party tags.
- Synthetic or RUM—do I need both?
- Yes. Synthetic (Lighthouse, k6) is fast and deterministic for CI gating. RUM proves impact on real devices, networks, and geos, and powers SLO-based canaries and rollbacks.
- Won’t performance tests slow down CI?
- Run quick smoke checks (30–90 seconds) on PRs and deeper tests nightly. Gate only critical routes/APIs per PR; everything else can be async. The cost of a perf regression in prod is higher than a 2–3 minute CI job.
- How do I pick budgets?
- Start with Google’s Web Vitals guidance (LCP <2.5s, INP <200ms) and your current p75/p95. Set budgets 10–20% tighter than current, then ratchet down after each win. Make them per route, per device class, and per region.
- What about microservices backends?
- Budget at the edge of the user journey. Then add contracts for hot APIs (p95/p99 by region). Use k6 with tags per endpoint, and validate the journey in canary with RUM. Don’t drown in 200 service-level charts—users don’t care which microservice was slow.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
