The Pager Didn’t Go Off—Your Checkout Still Got Slow: Monitoring That Catches Bottlenecks Before Users Do
Most teams alert on CPU and 500s, then act surprised when conversion drops. Here’s how to wire performance monitoring to user-facing metrics, catch bottlenecks early, and ship optimizations with measurable business impact.
If your dashboards say “healthy” while conversion drops, you’re not monitoring performance—you’re monitoring infrastructure.Back to all posts
The slow-burn outage nobody pages you for
I’ve watched this movie at least a dozen times: ops dashboards are green, CPU is fine, error rate is flat… and meanwhile checkout latency creeps from 350ms to 1.2s p95 over two weeks. No one notices until the growth team shows up with the slide: conversion down 4%.
This is the performance failure mode modern stacks are great at hiding. Auto-scaling masks inefficiency, retries smear errors into “slowness,” and your alerting is still stuck in a 2012 mindset: “page me when we’re down.” Users don’t care that you’re “up.” They care that the site feels fast.
If you want to catch bottlenecks before users do, you need monitoring that’s wired to user-perceived performance and business impact, not just infrastructure vitals.
Measure what users feel (and what the business bleeds)
Start with a small set of user-facing SLIs you can defend in a prioritization meeting.
- Core Web Vitals (RUM, not just lab)
LCP(Largest Contentful Paint): “did the page load fast enough?”INP(Interaction to Next Paint): “does it respond when I click?”CLS(Cumulative Layout Shift): “did the UI jump around and make me miss?”
- Backend latency percentiles for key journeys
p50,p95,p99for endpoints likePOST /checkout,POST /login,GET /search
- Apdex (simple, effective for leadership)
- Example: satisfied
<300ms, tolerating<1.2s, frustrated>1.2s
- Example: satisfied
- Availability at the journey level
- Not “service up,” but “checkout success rate” and “payment authorization success rate”
Now connect SLIs to business KPIs. You don’t need a PhD model—just get directionally honest:
p95 checkout latency↔ conversion rate, revenue/sessionsearch latency↔ product views/session, add-to-cart rateINPregressions ↔ bounce rate, support tickets (“button is broken”)
The only performance dashboard that survives QBR season is the one that shows latency and dollars on the same screen.
Instrumentation that actually helps you debug (RUM + tracing + metrics)
I’ve seen teams buy an APM, ship an agent, and still spend hours guessing. The missing piece is correlation: the slow user session should point to the exact trace, query, and downstream call.
A stack that works (and is vendor-flexible):
- RUM: Grafana Faro, Datadog RUM, New Relic Browser, Sentry Performance
- Tracing:
OpenTelemetry→ Grafana Tempo / Jaeger / Datadog APM - Metrics: Prometheus → Grafana (with exemplars)
- Logs: Loki / ELK (but don’t make logs your primary performance tool)
Concrete example: OpenTelemetry in Node.js (Express)
// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
// capture headers needed for correlation (be careful with PII)
headersToSpanAttributes: {
requestHeaders: ['user-agent', 'x-request-id', 'traceparent'],
},
},
})],
});
sdk.start();Make sure you propagate correlation IDs end-to-end:
traceparent(W3C) across gateways, services, and queues- A stable
x-request-idfor log correlation - A user/session identifier in RUM (hashed, no PII) that can be linked to traces
Prometheus histogram for p95 you can alert on
If you can’t compute p95 reliably, you’ll argue about performance forever.
# prometheus alert example
- alert: CheckoutLatencyP95Regression
expr: |
histogram_quantile(0.95,
sum(rate(http_server_request_duration_seconds_bucket{route="POST /checkout"}[5m]))
by (le)
) > 0.9
for: 10m
labels:
severity: page
annotations:
summary: "Checkout p95 latency > 900ms for 10m"
runbook: "https://gitplumbers.com/runbooks/checkout-latency"This is better than CPU alerts because it matches what users experience.
Detect bottlenecks early: burn-rate alerts and regression tripwires
Teams usually set static thresholds (“page me at 2s”). That’s how you miss slow creep.
What actually works in practice:
- Error-budget burn rate for latency SLOs
- Example SLO: “
POST /checkoutp95 < 800ms, 99% of the time, rolling 30 days” - Alert when you’re burning budget fast (fast/slow windows)
- Example SLO: “
- Regression detection against a baseline
- Alert if
p95is +20% week-over-week or +15% since last deploy
- Alert if
- Segmentation so you don’t average away pain
- Break down by
region,device,browser,customer tier,feature flag
- Break down by
A pattern I like: one dashboard panel per journey showing:
- p95 latency (with deployment markers)
- success rate
- throughput
- conversion (or the closest proxy)
- A link to “top traces” for the slowest 1% (Tempo/Jaeger)
When you can click from “checkout p95 spiked at 10:42” to “top span is SELECT ... taking 480ms,” you’ve won.
Optimization techniques that move the needle (with measurable outcomes)
This is where performance work usually derails: people chase micro-optimizations while the real bottleneck is a bad query or an accidental cache bypass.
Here are high-ROI fixes I’ve seen repeatedly, with the kinds of outcomes you can expect.
1) Kill the N+1s and add the index you’re pretending you don’t need
If you’re on PostgreSQL, the winning loop is: trace → identify slow span → EXPLAIN (ANALYZE, BUFFERS) → index/shape query.
EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM order_items
WHERE order_id = $1
ORDER BY created_at DESC
LIMIT 50;
-- Typical fix
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_order_items_order_id_created_at
ON order_items (order_id, created_at DESC);Measurable outcomes I’ve seen in real systems:
- DB time per request down 40–80%
p95for checkout path down 200–600ms- Fewer lock/contention incidents during peak
2) Cache the boring stuff (and stop stampedes)
Common failure: teams add Redis, then accidentally create a stampede at TTL boundaries.
What works:
- Request coalescing (single-flight)
- Stale-while-revalidate for non-critical data
- Tiered caching: CDN → edge → app cache → DB
Outcomes:
- Origin request volume down 30–70%
p95latency down 100–400ms for read-heavy endpoints- Cost reduction: fewer DB replicas and smaller autoscaling footprint
3) Move work off the critical path
The fastest request is the one that doesn’t do the work.
- Push email, analytics, and webhook fanout to a queue (
SQS,Kafka,RabbitMQ) - Precompute aggregates (nightly job + incremental updates)
- Use timeouts + circuit breakers so one slow downstream doesn’t poison everything
Outcomes:
- Tail latency (
p99) collapses when downstream jitter is isolated - MTTR improves because incidents are smaller and localized
4) Frontend wins that show up in revenue dashboards
If you’re only watching backend APM, you’re missing the biggest lever: render + interactivity.
- Ship less JavaScript (tree-shake, split chunks, remove zombie dependencies)
- Set performance budgets (
bundle size,LCP,INP) per route - Use CDN caching and correct
Cache-Controlfor static assets
Typical outcomes when done seriously:
LCPdown 0.5–1.5s on mid-tier devices- Bounce rate down 2–8% (varies by traffic source)
- Conversion lifts that finance actually notices
Make performance regressions hard to ship (CI budgets + canary SLO gates)
Relying on “we’ll monitor after deploy” is how you end up with a slow site and a lot of meeting invites.
CI performance smoke with k6
k6 run --vus 20 --duration 30s \
-e BASE_URL=https://staging.example.com \
perf/checkout_smoke.js// perf/checkout_smoke.js
import http from 'k6/http';
import { check } from 'k6';
export default function () {
const res = http.post(`${__ENV.BASE_URL}/checkout`, JSON.stringify({ items: [1,2,3] }), {
headers: { 'Content-Type': 'application/json' },
});
check(res, {
'status is 200': (r) => r.status === 200,
'TTFB under 800ms': (r) => r.timings.waiting < 800,
});
}Canary gate on SLOs
- Deploy 5% canary
- Require
p95and error rate to stay within guardrails for 15–30 minutes - Auto-roll back if burn rate spikes
This is boring. It also prevents the “Friday night surprise” that turns into a Monday postmortem.
What GitPlumbers does when performance is already on fire
At GitPlumbers, we usually get called after the pain is visible: conversion is down, cloud spend is up, and nobody trusts the graphs. The fastest path to sanity is consistent:
- Define journey-level SLIs/SLOs leadership can align on
- Instrument with
OpenTelemetryand wire RUM → traces → logs - Build a bottleneck backlog ranked by user impact (not engineer vibes)
- Ship a few high-ROI fixes (DB, caching, critical path) and lock them in with CI gates
If you want a second set of eyes from people who’ve debugged this in everything from creaky Rails monoliths to service-mesh sprawl, we’re here.
Next step: pick one journey (usually checkout or login) and make it observable end-to-end in a week. Everything gets easier from there.
Key takeaways
- If you’re not alerting on user-facing SLIs (p95, Apdex, Core Web Vitals), you’re optimizing blind.
- Tie performance alerts to business KPIs (conversion, revenue per session, churn) so prioritization is obvious.
- Use OpenTelemetry to correlate slow requests to traces, DB queries, and downstream calls—then fix the highest-impact path.
- Prevent regressions with performance budgets in CI and canary SLO gates; don’t rely on “we’ll watch it after deploy”.
- Most high-ROI wins are boring: caching, DB indexing, eliminating N+1s, right-sizing payloads, and moving work off the critical path.
Implementation checklist
- Define 3–5 user-facing SLIs: `p95`/`p99` for key endpoints, Apdex, Core Web Vitals (`LCP`, `INP`, `CLS`).
- Create 1–2 SLOs per critical journey (e.g., checkout) with an error budget and alert on burn rate.
- Instrument services with `OpenTelemetry` traces + metrics; propagate `traceparent` through gateways and workers.
- Add RUM to capture real device/network behavior; segment by geo/device/browser.
- Build a “top user journeys” dashboard: latency, errors, and conversion on the same screen.
- Add exemplars linking Grafana metrics → Tempo traces → logs (`Loki`) for one-click root cause.
- Set alerts on *changes* (regression detection): `p95` +20% over baseline, not just absolute thresholds.
- Add CI performance budgets (bundle size, LCP lab checks) and a load test smoke (`k6`) per release.
- Run a weekly bottleneck review: top 3 regressions, top 3 wins, and time-to-fix (MTTR) trend.
Questions we hear from teams
- Should we start with RUM or backend APM?
- Start with the journey you care about most (usually checkout/login). If you can only do one first, ship backend tracing/metrics with `OpenTelemetry` so you can debug quickly, then add RUM to catch device/network realities and segment the pain. The win is correlation, not picking a tool.
- What percentile should we alert on: p95 or p99?
- Use **p95** for primary alerting (less noisy) and track **p99** for “VIP pain” and tail-latency work. Tail issues often come from GC pauses, DB lock contention, or downstream jitter—p99 is where those show up.
- How do we avoid alert fatigue with performance monitoring?
- Alert on **burn rate** and **regressions**, not every small spike. Tie alerts to a specific journey SLO, add segmentation (region/device), and require a minimum duration (e.g., 10–15 minutes) so you page on sustained user impact.
- How do we prove performance work is worth it?
- Put latency and business KPIs side-by-side: `p95 checkout` next to conversion/revenue/session. Then run controlled rollouts (canary or feature flag) and compare cohorts. Even a simple before/after with deployment markers is usually enough to stop the “performance is a nice-to-have” debate.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
