Should we start with RUM or backend APM?

Start with the journey you care about most (usually checkout/login). If you can only do one first, ship backend tracing/metrics with `OpenTelemetry` so you can debug quickly, then add RUM to catch device/network realities and segment the pain. The win is correlation, not picking a tool.

What percentile should we alert on: p95 or p99?

Use **p95** for primary alerting (less noisy) and track **p99** for “VIP pain” and tail-latency work. Tail issues often come from GC pauses, DB lock contention, or downstream jitter—p99 is where those show up.

How do we avoid alert fatigue with performance monitoring?

Alert on **burn rate** and **regressions**, not every small spike. Tie alerts to a specific journey SLO, add segmentation (region/device), and require a minimum duration (e.g., 10–15 minutes) so you page on sustained user impact.

How do we prove performance work is worth it?

Put latency and business KPIs side-by-side: `p95 checkout` next to conversion/revenue/session. Then run controlled rollouts (canary or feature flag) and compare cohorts. Even a simple before/after with deployment markers is usually enough to stop the “performance is a nice-to-have” debate.

Performance-optimization · Dec 17, 2025 · 8 minute read

The Pager Didn’t Go Off—Your Checkout Still Got Slow: Monitoring That Catches Bottlenecks Before Users Do

Most teams alert on CPU and 500s, then act surprised when conversion drops. Here’s how to wire performance monitoring to user-facing metrics, catch bottlenecks early, and ship optimizations with measurable business impact.

GitPlumbers Editorial (20-year delivery veteran)

Principal Engineer, GitPlumbers

I’ve spent two decades shipping and unbreaking systems across Rails monoliths, JVM stacks, Kubernetes fleets, and the current wave of AI-generated code. At GitPlumbers, we fix performance and reliability problems the “silver bullet” consultants leave behind—by instrumenting what users feel, tracing what code does, and shipping changes you can measure.

If your dashboards say “healthy” while conversion drops, you’re not monitoring performance—you’re monitoring infrastructure.

Back to all posts

The slow-burn outage nobody pages you for

I’ve watched this movie at least a dozen times: ops dashboards are green, CPU is fine, error rate is flat… and meanwhile checkout latency creeps from 350ms to 1.2s p95 over two weeks. No one notices until the growth team shows up with the slide: conversion down 4%.

This is the performance failure mode modern stacks are great at hiding. Auto-scaling masks inefficiency, retries smear errors into “slowness,” and your alerting is still stuck in a 2012 mindset: “page me when we’re down.” Users don’t care that you’re “up.” They care that the site feels fast.

If you want to catch bottlenecks before users do, you need monitoring that’s wired to user-perceived performance and business impact, not just infrastructure vitals.

Measure what users feel (and what the business bleeds)

Start with a small set of user-facing SLIs you can defend in a prioritization meeting.

Core Web Vitals (RUM, not just lab)
- LCP (Largest Contentful Paint): “did the page load fast enough?”
- INP (Interaction to Next Paint): “does it respond when I click?”
- CLS (Cumulative Layout Shift): “did the UI jump around and make me miss?”
Backend latency percentiles for key journeys
- p50, p95, p99 for endpoints like POST /checkout, POST /login, GET /search
Apdex (simple, effective for leadership)
- Example: satisfied <300ms, tolerating <1.2s, frustrated >1.2s
Availability at the journey level
- Not “service up,” but “checkout success rate” and “payment authorization success rate”

Now connect SLIs to business KPIs. You don’t need a PhD model—just get directionally honest:

p95 checkout latency ↔ conversion rate, revenue/session
search latency ↔ product views/session, add-to-cart rate
INP regressions ↔ bounce rate, support tickets (“button is broken”)

The only performance dashboard that survives QBR season is the one that shows latency and dollars on the same screen.

Instrumentation that actually helps you debug (RUM + tracing + metrics)

I’ve seen teams buy an APM, ship an agent, and still spend hours guessing. The missing piece is correlation: the slow user session should point to the exact trace, query, and downstream call.

A stack that works (and is vendor-flexible):

RUM: Grafana Faro, Datadog RUM, New Relic Browser, Sentry Performance
Tracing: OpenTelemetry → Grafana Tempo / Jaeger / Datadog APM
Metrics: Prometheus → Grafana (with exemplars)
Logs: Loki / ELK (but don’t make logs your primary performance tool)

Concrete example: OpenTelemetry in Node.js (Express)

// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-http': {
      // capture headers needed for correlation (be careful with PII)
      headersToSpanAttributes: {
        requestHeaders: ['user-agent', 'x-request-id', 'traceparent'],
      },
    },
  })],
});

sdk.start();

Make sure you propagate correlation IDs end-to-end:

traceparent (W3C) across gateways, services, and queues
A stable x-request-id for log correlation
A user/session identifier in RUM (hashed, no PII) that can be linked to traces

Prometheus histogram for p95 you can alert on

If you can’t compute p95 reliably, you’ll argue about performance forever.

# prometheus alert example
- alert: CheckoutLatencyP95Regression
  expr: |
    histogram_quantile(0.95,
      sum(rate(http_server_request_duration_seconds_bucket{route="POST /checkout"}[5m]))
      by (le)
    ) > 0.9
  for: 10m
  labels:
    severity: page
  annotations:
    summary: "Checkout p95 latency > 900ms for 10m"
    runbook: "https://gitplumbers.com/runbooks/checkout-latency"

This is better than CPU alerts because it matches what users experience.

Detect bottlenecks early: burn-rate alerts and regression tripwires

Teams usually set static thresholds (“page me at 2s”). That’s how you miss slow creep.

What actually works in practice:

Error-budget burn rate for latency SLOs
- Example SLO: “POST /checkout p95 < 800ms, 99% of the time, rolling 30 days”
- Alert when you’re burning budget fast (fast/slow windows)
Regression detection against a baseline
- Alert if p95 is +20% week-over-week or +15% since last deploy
Segmentation so you don’t average away pain
- Break down by region, device, browser, customer tier, feature flag

A pattern I like: one dashboard panel per journey showing:

p95 latency (with deployment markers)
success rate
throughput
conversion (or the closest proxy)
A link to “top traces” for the slowest 1% (Tempo/Jaeger)

When you can click from “checkout p95 spiked at 10:42” to “top span is SELECT ... taking 480ms,” you’ve won.

Optimization techniques that move the needle (with measurable outcomes)

This is where performance work usually derails: people chase micro-optimizations while the real bottleneck is a bad query or an accidental cache bypass.

Here are high-ROI fixes I’ve seen repeatedly, with the kinds of outcomes you can expect.

1) Kill the N+1s and add the index you’re pretending you don’t need

If you’re on PostgreSQL, the winning loop is: trace → identify slow span → EXPLAIN (ANALYZE, BUFFERS) → index/shape query.

EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM order_items
WHERE order_id = $1
ORDER BY created_at DESC
LIMIT 50;

-- Typical fix
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_order_items_order_id_created_at
ON order_items (order_id, created_at DESC);

Measurable outcomes I’ve seen in real systems:

DB time per request down 40–80%
p95 for checkout path down 200–600ms
Fewer lock/contention incidents during peak

2) Cache the boring stuff (and stop stampedes)

Common failure: teams add Redis, then accidentally create a stampede at TTL boundaries.

What works:

Request coalescing (single-flight)
Stale-while-revalidate for non-critical data
Tiered caching: CDN → edge → app cache → DB

Outcomes:

Origin request volume down 30–70%
p95 latency down 100–400ms for read-heavy endpoints
Cost reduction: fewer DB replicas and smaller autoscaling footprint

3) Move work off the critical path

The fastest request is the one that doesn’t do the work.

Push email, analytics, and webhook fanout to a queue (SQS, Kafka, RabbitMQ)
Precompute aggregates (nightly job + incremental updates)
Use timeouts + circuit breakers so one slow downstream doesn’t poison everything

Outcomes:

Tail latency (p99) collapses when downstream jitter is isolated
MTTR improves because incidents are smaller and localized

4) Frontend wins that show up in revenue dashboards

If you’re only watching backend APM, you’re missing the biggest lever: render + interactivity.

Ship less JavaScript (tree-shake, split chunks, remove zombie dependencies)
Set performance budgets (bundle size, LCP, INP) per route
Use CDN caching and correct Cache-Control for static assets

Typical outcomes when done seriously:

LCP down 0.5–1.5s on mid-tier devices
Bounce rate down 2–8% (varies by traffic source)
Conversion lifts that finance actually notices

Make performance regressions hard to ship (CI budgets + canary SLO gates)

Relying on “we’ll monitor after deploy” is how you end up with a slow site and a lot of meeting invites.

CI performance smoke with `k6`

k6 run --vus 20 --duration 30s \
  -e BASE_URL=https://staging.example.com \
  perf/checkout_smoke.js

// perf/checkout_smoke.js
import http from 'k6/http';
import { check } from 'k6';

export default function () {
  const res = http.post(`${__ENV.BASE_URL}/checkout`, JSON.stringify({ items: [1,2,3] }), {
    headers: { 'Content-Type': 'application/json' },
  });

  check(res, {
    'status is 200': (r) => r.status === 200,
    'TTFB under 800ms': (r) => r.timings.waiting < 800,
  });
}

Canary gate on SLOs

Deploy 5% canary
Require p95 and error rate to stay within guardrails for 15–30 minutes
Auto-roll back if burn rate spikes

This is boring. It also prevents the “Friday night surprise” that turns into a Monday postmortem.

What GitPlumbers does when performance is already on fire

At GitPlumbers, we usually get called after the pain is visible: conversion is down, cloud spend is up, and nobody trusts the graphs. The fastest path to sanity is consistent:

Define journey-level SLIs/SLOs leadership can align on
Instrument with OpenTelemetry and wire RUM → traces → logs
Build a bottleneck backlog ranked by user impact (not engineer vibes)
Ship a few high-ROI fixes (DB, caching, critical path) and lock them in with CI gates

If you want a second set of eyes from people who’ve debugged this in everything from creaky Rails monoliths to service-mesh sprawl, we’re here.

Next step: pick one journey (usually checkout or login) and make it observable end-to-end in a week. Everything gets easier from there.

Related Resources

Key takeaways

If you’re not alerting on user-facing SLIs (p95, Apdex, Core Web Vitals), you’re optimizing blind.
Tie performance alerts to business KPIs (conversion, revenue per session, churn) so prioritization is obvious.
Use OpenTelemetry to correlate slow requests to traces, DB queries, and downstream calls—then fix the highest-impact path.
Prevent regressions with performance budgets in CI and canary SLO gates; don’t rely on “we’ll watch it after deploy”.
Most high-ROI wins are boring: caching, DB indexing, eliminating N+1s, right-sizing payloads, and moving work off the critical path.

Implementation checklist

Define 3–5 user-facing SLIs: `p95`/`p99` for key endpoints, Apdex, Core Web Vitals (`LCP`, `INP`, `CLS`).
Create 1–2 SLOs per critical journey (e.g., checkout) with an error budget and alert on burn rate.
Instrument services with `OpenTelemetry` traces + metrics; propagate `traceparent` through gateways and workers.
Add RUM to capture real device/network behavior; segment by geo/device/browser.
Build a “top user journeys” dashboard: latency, errors, and conversion on the same screen.
Add exemplars linking Grafana metrics → Tempo traces → logs (`Loki`) for one-click root cause.
Set alerts on *changes* (regression detection): `p95` +20% over baseline, not just absolute thresholds.
Add CI performance budgets (bundle size, LCP lab checks) and a load test smoke (`k6`) per release.
Run a weekly bottleneck review: top 3 regressions, top 3 wins, and time-to-fix (MTTR) trend.

Questions we hear from teams

Should we start with RUM or backend APM?: Start with the journey you care about most (usually checkout/login). If you can only do one first, ship backend tracing/metrics with `OpenTelemetry` so you can debug quickly, then add RUM to catch device/network realities and segment the pain. The win is correlation, not picking a tool.
What percentile should we alert on: p95 or p99?: Use **p95** for primary alerting (less noisy) and track **p99** for “VIP pain” and tail-latency work. Tail issues often come from GC pauses, DB lock contention, or downstream jitter—p99 is where those show up.
How do we avoid alert fatigue with performance monitoring?: Alert on **burn rate** and **regressions**, not every small spike. Tie alerts to a specific journey SLO, add segmentation (region/device), and require a minimum duration (e.g., 10–15 minutes) so you page on sustained user impact.
How do we prove performance work is worth it?: Put latency and business KPIs side-by-side: `p95 checkout` next to conversion/revenue/session. Then run controlled rollouts (canary or feature flag) and compare cohorts. Even a simple before/after with deployment markers is usually enough to stop the “performance is a nice-to-have” debate.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about end-to-end performance monitoring See GitPlumbers performance & scalability work