Status Page Green, Revenue Red: Synthetic Monitors That Predict Incidents and Gate Rollouts
Stop waiting for Twitter to tell you you’re down. Build synthetics that behave like pissed‑off users, surface leading indicators, and automatically pause bad rollouts.
If a synthetic can’t point you to a trace and a rollback in two clicks, it’s not done.Back to all posts
The night our status page was green while checkout was dead
I’ve lived this one more than once. Kubernetes happy. CPU fine. Ping checks green. And yet, revenue flatlined because the token refresh endpoint returned 401s to users with an expired session cookie. No one paged us because the 200 OK health probe was still passing. Twitter did.
We fixed it by treating synthetics as first‑class users. Not a ping to /healthz, but a scripted journey: land → login → add to cart → checkout. We tagged every request, pushed metrics for each step, and made those signals block bad rollouts. Since then, we haven’t learned about incidents from customers.
Measure what predicts pain, not what flatters dashboards
The golden signals still matter, but synthetics should emit leading indicators that move before the outage hits the status page.
Track these for each journey and step:
- Success ratio (last 5–10m):
successful_steps / total_steps. Early warning when auth or payments start flaking. - Tail latency (p95/p99): High tails predict user rage quits long before averages budge.
- Retry rate and 429/503 rate: Retry storms and soft timeouts precede hard failures.
- Auth anomalies: Spikes in
401/403from login/token refresh are canaries for expired certs, bad feature flags, or IdP drift. - Dependency lag: TLS handshake duration, DNS resolution time, queue depth, and DB pool saturation hint at downstream pain.
- Front‑end UX markers:
LCP/TTIand console error rate on key screens.
Ignore these vanity metrics in synthetics:
- Average response time without percentiles
- CPU/memory of the synthetic runner
- “Pages visited” without success semantics
Design synthetics like a pissed‑off user
If you’ve only got budget for a handful, pick the money paths:
- Consumer: login, browse, search, add to cart, checkout
- B2B: SSO (SAML/OIDC), report export, webhook delivery
- Platform: API key create, write, read, list
Principles that keep them useful and stable:
- Trace and tag: Inject
traceparentandsynthetic=trueso you can click from alert → trace → service. - Deterministic data: Dedicated test accounts, idempotent cart SKUs, non‑expiring payment tokens in non‑prod, and synthetic SKUs hidden from real users.
- Time budgets: Step timeouts that match real UX expectations (e.g., login p95 < 1.2s, checkout p95 < 2s).
- Multi‑region: Run from where users are. Latency regressions hide in “close to cluster” probes.
- Noisy dependency tolerance: A flaky CDN PoP shouldn’t page you alone. Use multi‑signal correlation or require consecutive failures.
Here’s a Playwright example that injects traceparent and tags requests. It measures step latencies and logs a structured result.
// playwright-synthetic.spec.ts
import { test, expect } from '@playwright/test';
function randHex(n: number) { return [...crypto.getRandomValues(new Uint8Array(n))].map(b => b.toString(16).padStart(2, '0')).join(''); }
function makeTraceparent() { return `00-${randHex(16)}${randHex(16)}-${randHex(8)}01`; }
const ENV = process.env.ENV || 'prod';
const RELEASE = process.env.RELEASE || 'unknown';
test('checkout flow', async ({ context, page }) => {
const traceparent = makeTraceparent();
await context.setExtraHTTPHeaders({
'traceparent': traceparent,
'x-synthetic': 'true',
'x-release': RELEASE,
});
const timings: Record<string, number> = {};
const step = async (name: string, fn: () => Promise<void>) => {
const t0 = Date.now();
await fn();
timings[name] = Date.now() - t0;
};
await step('load_home', async () => {
await page.goto('https://shop.example.com');
await expect(page.getByRole('banner')).toBeVisible();
});
await step('login', async () => {
await page.click('text=Sign in');
await page.fill('#email', process.env.SYN_USER!);
await page.fill('#password', process.env.SYN_PASS!);
await page.click('button[type=submit]');
await expect(page.getByText('Welcome')).toBeVisible();
});
await step('add_to_cart', async () => {
await page.click('[data-sku="SYNTHETIC-SKU"]');
await page.click('text=Add to cart');
await expect(page.getByText('1 item')).toBeVisible();
});
await step('checkout', async () => {
await page.click('text=Checkout');
await expect(page.getByText('Order confirmed')).toBeVisible({ timeout: 8000 });
});
console.log(JSON.stringify({ env: ENV, release: RELEASE, traceparent, timings }));
});Run it from a controlled runner (k8s CronJob, Checkly, or CI) with stable secrets from Vault/Secrets Manager.
Tag, trace, and store: make telemetry debuggable
Synthetics are only useful if the failure is actionable. That means first‑class telemetry.
- Propagate context: Your services should respect W3C Trace Context. When the synthetic injects
traceparent, your API gateway/app should start a trace and add span attributes:synthetic=true,flow=checkout,release=<sha>,canary=true|false. - Export to TSDB: Push synthetic metrics (success, latency, retries) to Prometheus, Mimir, or Datadog with tags. Prefer remote write or StatsD/OTLP from the runner.
- Dashboards that answer “why”: One panel per step: success %, p95, error types, links to example traces. If your graph doesn’t have a “Click to trace” exemplar, you’re wasting time.
A lightweight k6 runner can produce Prometheus metrics you can gate on. Note the W3C trace header and tags for env/flow/release.
// checkout-synthetic.k6.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Trend, Rate } from 'k6/metrics';
export const options = {
scenarios: {
continuous: { executor: 'constant-vus', vus: 1, duration: '10m' },
},
tags: { env: `${__ENV.ENV || 'prod'}`, flow: 'checkout', release: `${__ENV.RELEASE || 'unknown'}` },
thresholds: {
'synthetic_flow_success': ['rate>0.995'],
'step_checkout_p95': ['p(95)<2000'],
},
};
const success = new Rate('synthetic_flow_success');
const step_home = new Trend('step_home');
const step_login = new Trend('step_login');
const step_checkout = new Trend('step_checkout');
function traceparent() {
function hex(bytes) { return [...crypto.getRandomValues(new Uint8Array(bytes))].map(b => ('0'+b.toString(16)).slice(-2)).join(''); }
return `00-${hex(16)}-${hex(8)}-01`;
}
export default function () {
const headers = { traceparent: traceparent(), 'x-synthetic': 'true', 'x-release': `${__ENV.RELEASE || 'unknown'}` };
const r1 = http.get('https://shop.example.com', { headers });
step_home.add(r1.timings.duration);
const r2 = http.post('https://shop.example.com/api/login', { user: __ENV.SYN_USER, pass: __ENV.SYN_PASS }, { headers });
step_login.add(r2.timings.duration);
const r3 = http.post('https://shop.example.com/api/checkout', { sku: 'SYNTHETIC-SKU' }, { headers });
step_checkout.add(r3.timings.duration);
const ok = check(r3, { 'order ok': (res) => res.status === 200 });
success.add(ok);
sleep(5);
}Enable Prometheus remote write receiver and ship k6 metrics straight in:
# Prometheus
--web.enable-remote-write-receiver
# Run k6 and push metrics with tags
ENV=prod RELEASE=$(git rev-parse --short HEAD) \
k6 run --out experimental-prometheus-rw=http://prom.monitoring:9090/api/v1/write checkout-synthetic.k6.jsPrometheus recording rules that create the gates you’ll reuse everywhere:
# recording-rules.yaml
groups:
- name: synthetic
rules:
- record: synthetic:flow_success:ratio5m
expr: |
sum(rate(synthetic_flow_success{env="prod",flow="checkout"}[5m]))
/
sum(rate(synthetic_flow_success{env="prod",flow="checkout"}[5m])>0)
- record: synthetic:checkout_p95_ms:5m
expr: histogram_quantile(0.95, sum(rate(step_checkout_bucket{env="prod"}[5m])) by (le))Make synthetics gate your rollouts (not just page you)
This is where most teams stop short. Don’t. Progressive delivery without synthetic gates is a hope‑and‑pray strategy.
With Argo Rollouts, add an AnalysisTemplate that queries Prometheus for synthetic success and latency, and pause/rollback automatically when it degrades.
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: checkout-synthetic-gate
spec:
args:
- name: release
metrics:
- name: synthetic-success
interval: 30s
count: 10
successCondition: result[0] >= 0.995
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring.svc.cluster.local:9090
query: |
synthetic:flow_success:ratio5m{flow="checkout",env="prod",release="{{args.release}}"}
- name: checkout-p95
interval: 30s
count: 10
successCondition: result[0] < 2000
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring.svc.cluster.local:9090
query: |
synthetic:checkout_p95_ms:5m{env="prod",release="{{args.release}}"}Wire it into a canary Rollout:
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: shop-frontend
spec:
strategy:
canary:
steps:
- setWeight: 20
- pause: { duration: 60 }
- analysis:
templates:
- templateName: checkout-synthetic-gate
args:
- name: release
valueFrom:
fieldRef: { fieldPath: metadata.labels['rollouts-pod-template-hash'] }
- setWeight: 50
- pause: { duration: 60 }
- analysis:
templates:
- templateName: checkout-synthetic-gate
args:
- name: release
valueFrom:
fieldRef: { fieldPath: metadata.labels['rollouts-pod-template-hash'] }When your canary trips the synthetic gate—rollback happens without a war room. I’ve seen this save a Friday afternoon more than once.
Triage without thrash: alerts that lead to answers
Paging on “synthetic failed” is noise. Page on user‑visible pain with links to answers.
- Alert on burn, gate on leading indicators: Page when your
checkout success SLOis burning fast; gate rollouts on synthetic success/latency thresholds. - Enrich alerts: Include env, flow, region, release, and a link to a trace search for
trace.state.synthethic=true AND release=<sha>. - Runbooks that work: The alert should link to a short, versioned runbook: probable causes (IdP, payments, cache), dashboards, and rollback command.
Datadog/Grafana alert title example:
name: "[PROD][CHECKOUT][CANARY {{release}}] Synthetic gate failing (p95>2s or success<99.5%)"
message: |
Flow: checkout
Release: {{release}}
Trace search: https://tempo.example.com/search?query=synthetic%3Dtrue%20release%3D{{release}}
Runbook: https://internal.wiki/runbooks/checkoutKeep it boring: control flake and cost
The fastest way to lose trust in synthetics is flaky checks and surprise bills.
- Stable data: Rotate synthetic credentials with automation; ensure carts reset. Use feature flags to expose synthetic SKUs across environments.
- Timeouts and retries: Mirror client behavior—retry idempotent requests with jitter. Don’t retry non‑idempotent steps like “charge card”.
- Schedule smart: Run critical flows continuously at low VUs; run broader coverage on a 5–10 minute cadence. Burst during rollouts only.
- Cost control: Prefer open‑source runners (k6, Playwright) for high‑frequency checks; use SaaS synthetics (Checkly, Datadog) for multi‑region and managed uptime.
- Governance: PR‑review synthetic changes, tag by service owner, and delete dead flows monthly. Nothing ages like a forgotten login script.
Starter kit you can steal today
If you want a quick start without analysis paralysis:
- Write one Playwright test for your top revenue flow. Inject
traceparentandsynthetic=true. - Add a k6 probe that hits your API steps with tags, pushing to Prometheus remote write.
- Create two recording rules:
synthetic:flow_success:ratio5mandsynthetic:<flow>_p95_ms:5m. - Wire an Argo Rollouts
AnalysisTemplatethat requiressuccess>=99.5%andp95<2sduring canary. - Add a PagerDuty alert on SLO burn; send synthetic failures to Slack with trace/runbook links.
Minimal Prometheus Blackbox Exporter probe for a cheap HTTP canary (still useful for TLS/DNS regressions):
# blackbox.yaml
modules:
https_2xx:
prober: http
http:
method: GET
preferred_ip_protocol: ip4
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
tls_config:
insecure_skip_verify: falseAnd a ServiceMonitor for it:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: shop-blackbox
spec:
endpoints:
- interval: 30s
port: metrics
params:
module: [https_2xx]
target: [https://shop.example.com/health]
selector:
matchLabels:
app: blackbox-exporterIt won’t replace user‑journey synthetics, but it will catch cert rot and DNS drift before your users do.
The rule of thumb: If a synthetic can’t point you to a trace and a rollback in two clicks, it’s not done.
Key takeaways
- Uptime checks don’t catch real user pain. Model critical journeys (login, search, checkout) and track leading indicators like auth 4xx, retries, and tail latency.
- Tag and trace synthetic traffic. Inject `traceparent`, tag spans with `synthetic=true`, and make every failed step clickable to a trace.
- Export synthetic metrics to your TSDB and write SLO‑ish queries (success ratio, p95 step latency, retry rate) as rollout gates.
- Use Argo Rollouts (or Flagger) to pause/rollback automatically when synthetics or SLO burn degrade—no heroics required.
- Control flakiness: stable data, idempotent accounts, deterministic test runners, and noisy‑alert suppression with multi‑signal correlation.
- Make runbooks discoverable from alerts. Every alert title should encode env, flow, region, release, and link to dashboards/traces.
Implementation checklist
- Pick 3-5 revenue-critical journeys and write synthetics for each.
- Emit and store metrics for success ratio, tail latency, retries, and step errors with rich tags (env, region, release, canary).
- Inject `traceparent` and `synthetic=true` into requests; verify traces in Jaeger/Tempo/Datadog.
- Create Prometheus recording rules for success ratio and p95 by flow and release.
- Wire those rules into Argo Rollouts AnalysisTemplates as canary gates.
- Route alerts to PagerDuty/Slack with runbook links and example traces.
- Control flake: stable test data, retries with backoff, timeouts matching user patience, and daytime dry runs.
- Review weekly: delete dead checks, add new ones for fresh features, and tune SLO gates.
Questions we hear from teams
- Isn’t real user monitoring (RUM) enough?
- RUM tells you what happened to real users. It’s reactive and confounded by user behavior and ad blockers. Synthetics are proactive, deterministic, and can run during rollouts to catch issues before they hit real users. You need both: synthetics for gates and leading indicators; RUM for long‑tail analysis and cohort impact.
- Aren’t synthetics flaky and expensive?
- They are if you treat them like Selenium from 2010. Use Playwright/k6 with deterministic data, inject trace context, and run at low concurrency continuously with bursts during rollouts. Keep flows short, idempotent, and owned. Use open‑source runners for frequency and a SaaS for multi‑region coverage. Review and prune monthly.
- How do we avoid paging on noise?
- Page on SLO burn and user‑visible pain. For synthetics, require consecutive failures, use multi‑region quorum, and correlate with dependency error rates. Enrich alerts with runbook links and example traces so the on‑call can act in minutes, not after a Slack archeology dig.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
