Why not just use real user monitoring (RUM)?

Use both. RUM tells you what happened to real users; synthetics tell you what will happen next from controlled locations with zero sampling bias. Synthetics are your early-warning system and a safe canary gate.

Won’t synthetic tests be flaky and noisy?

Only if you treat them like afterthoughts. Keep the suite tiny, stabilize selectors, seed data, enforce strict timeouts, and page on burn rate across two windows. We delete or fix any synthetic that flakes twice without correlated impact.

Can I do this with Datadog/New Relic instead of Prometheus?

Yes. Datadog Synthetics and NR Synthetics ship browser checks and SLOs out of the box. The key is the same: emit leading indicators (p95/p99, success rate), define burn-rate alerts, and wire webhook/Argo integrations to gate rollouts.

Is Pushgateway safe for production signals?

It’s fine for low-frequency, short-lived jobs like synthetics if you treat it as an ingestion edge. Do not rely on it for high-cardinality or high-volume metrics. If you can, prefer native remote write (k6) or a lightweight sidecar that pushes OTLP to a collector.

How do I avoid false positives from regional blips?

Run from at least two regions and require both to violate the SLO before paging (route-level inhibition). Keep a ‘control’ synthetic that hits a static asset to distinguish app vs. network/CDN issues.

Reliability-observability · Oct 31, 2025 · 9 minute read

The Synthetic Checks That Saved Our Canary: Leading Indicators Wired to Argo Rollouts

Stop shipping blind. Build synthetic monitors that mirror real user journeys, surface leading indicators, and auto-drive canary decisions before customers ever feel it.

Alex Mercer

Partner, Reliability Engineering at GitPlumbers

20 years keeping distributed systems honest—from adtech at 2M QPS to fintech with seven-figure minutes. Ex-Google SRE, ex-Stripe production engineer. I help teams replace dashboards with decisions.

“If your monitor can’t place an order, neither can your users. Measure the journey, not the server.”

Back to all posts

The outage you didn’t see coming

If you’ve ever watched a Grafana dashboard stay green while Stripe conversions fell to zero, you know the pain. I’ve seen Nginx 200s and low CPU lull teams into complacency while a frontend bundle bloat pushed checkout to a 7s LCP on Android over 4G. Real users were gone long before the first Sev-1 page.

The fix wasn’t “more dashboards.” We built synthetic checks that walked the exact revenue path—sign-in, add-to-cart, checkout—every minute from multiple regions. We published those timings as Prometheus histograms, tied SLO burn alerts to runbooks, and wired Argo Rollouts to automatically abort bad canaries. MTTR dropped from 40m to under 8m. More importantly, MTTD went from “Twitter reports it” to 30s.

This is how to do it without boiling the ocean.

Measure what predicts pain, not what flatters

Vanity: CPU < 60%, requests/sec, “homepage up.” Your CFO doesn’t care.

Leading indicators that actually predict incidents:

Journey p95/p99 latency per critical flow (login, search, checkout) as histograms.
TTFB and LCP on synth browsers, not just server timers.
Auth and payment success rate (HTTP 200 != success).
3rd‑party dependency timing and error rate (CDN, IdP, payments, feature flag SDKs).
Error budget burn for each journey (multi-window, multi-burn).
Saturation precursors: queue depth, circuit-breaker opens, retry storms.

Define SLOs on these SLIs. Example: “Checkout: 99% of synthetic runs have p95 < 3s and success rate ≥ 99.9%.” Do not hide behind averages.

Design synthetic checks that mirror reality

You don’t need 100 scripts. You need 3–5 that reflect revenue or retention paths.

Keep steps deterministic: fixed test user, seeded catalog, stable selectors.
Assert outcomes: “Order placed” text exists, analytics event fired.
Emit metrics inside the script. Avoid scraping logs for timings after the fact.
Run from at least 2 regions and 2 networks (cloud + last‑mile provider if you can).
Budget: finish in < 60s; fail fast on hangs; retry once max.

Here’s a lightweight Playwright check that walks login -> add to cart -> checkout and exports timings to Prometheus via pushgateway.

// synthetics/checkout.spec.ts
import { test, expect } from '@playwright/test';

const PUSHGATEWAY = process.env.PUSHGATEWAY || 'http://pushgateway:9091';
const SITE_URL = process.env.SITE_URL || 'https://shop.example.com';

async function publishMetrics(journey: string, p: { durationMs: number, ttfbMs: number }) {
  const body = `# HELP synthetic_nav_duration_seconds End-to-end journey wall time\n` +
`# TYPE synthetic_nav_duration_seconds histogram\n` +
// rudimentary buckets; in prod, emit real buckets from k6 or bucketize here
`synthetic_nav_duration_seconds_bucket{journey="${journey}",le="1"} ${p.durationMs <= 1000 ? 1 : 0}\n` +
`synthetic_nav_duration_seconds_bucket{journey="${journey}",le="3"} ${p.durationMs <= 3000 ? 1 : 0}\n` +
`synthetic_nav_duration_seconds_bucket{journey="${journey}",le="5"} ${p.durationMs <= 5000 ? 1 : 0}\n` +
`synthetic_nav_duration_seconds_bucket{journey="${journey}",le="+Inf"} 1\n` +
`synthetic_nav_duration_seconds_sum{journey="${journey}"} ${p.durationMs/1000}\n` +
`synthetic_nav_duration_seconds_count{journey="${journey}"} 1\n` +
`synthetic_ttfb_seconds{journey="${journey}"} ${p.ttfbMs/1000}\n` +
`synthetic_success_total{journey="${journey}",outcome="good"} 1\n`;
  await fetch(`${PUSHGATEWAY}/metrics/job/synthetics`, { method: 'POST', body, headers: { 'Content-Type': 'text/plain' } });
}

test('checkout journey (synthetic)', async ({ page }) => {
  const t0 = Date.now();
  const resp = await page.goto(SITE_URL, { waitUntil: 'domcontentloaded' });
  const ttfbMs = resp && resp.timing() ? resp.timing()!.responseStart - resp.timing()!.startTime : 0;

  await page.getByRole('link', { name: 'Sign in' }).click();
  await page.getByLabel('Email').fill(process.env.SYNTH_USER || 'synthetic@test.local');
  await page.getByLabel('Password').fill(process.env.SYNTH_PASS || 'notsecret');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await page.getByText('Welcome').waitFor({ timeout: 10000 });

  await page.getByText('Add to cart').first().click();
  await page.getByRole('link', { name: 'Checkout' }).click();
  await page.getByRole('button', { name: 'Place order' }).click();
  await page.getByText('Order confirmed').waitFor({ timeout: 10000 });

  const durationMs = Date.now() - t0;
  await publishMetrics('checkout', { durationMs, ttfbMs });
  expect(durationMs).toBeLessThan(5000);
});

If you prefer load-oriented synthetics with native histograms, k6 with Prometheus remote write is rock solid:

// synthetics/checkout.js
import http from 'k6/http';
import { check, Trend } from 'k6';

export const options = {
  thresholds: {
    http_req_duration: ['p(99)<800'],
    journey_checkout: ['p(95)<3000'],
  },
};

const journey_checkout = new Trend('journey_checkout');

export default function () {
  const url = __ENV.SITE_URL || 'https://shop.example.com';
  const res = http.get(url);
  check(res, {
    'status is 200': (r) => r.status === 200,
    'ttfb<300ms': (r) => r.timings.waiting < 300,
  });
  journey_checkout.add(res.timings.duration);
}

# send metrics to Prometheus via experimental remote write
K6_PROMETHEUS_RW_SERVER_URL=http://prometheus:9090/api/v1/write \
k6 run -o experimental-prometheus-rw synthetics/checkout.js

Either way, get your synthetics publishing metrics your SRE stack understands.

From timings to SLOs: Prometheus rules that mean something

Dashboards are passive. SLOs with burn-rate alerts are active and predictive.

Example Prometheus rules for a checkout journey using histograms:

# prometheus/rules/synthetic-slo.yaml
groups:
- name: synthetics
  rules:
  - record: sli:checkout_p99
    expr: |
      histogram_quantile(0.99,
        sum(rate(synthetic_nav_duration_seconds_bucket{journey="checkout"}[5m])) by (le))

  - record: sli:checkout_fast_ratio
    expr: |
      sum(rate(synthetic_nav_duration_seconds_bucket{journey="checkout",le="3"}[5m]))
      /
      sum(rate(synthetic_nav_duration_seconds_count{journey="checkout"}[5m]))

  # Consider "good" if p99 < 3s and success events exist
  - record: sli:checkout_good
    expr: |
      (sli:checkout_p99 < 3) * 1

  - record: slo:checkout_error_budget_burn_5m
    expr: |
      (1 - avg_over_time(sli:checkout_good[5m])) / 0.01  # 99% SLO -> 1% budget

  - record: slo:checkout_error_budget_burn_1h
    expr: |
      (1 - avg_over_time(sli:checkout_good[1h])) / 0.01

  - alert: SyntheticCheckoutBudgetBurn
    expr: |
      slo:checkout_error_budget_burn_5m > 2 and slo:checkout_error_budget_burn_1h > 2
    for: 10m
    labels:
      severity: page
      team: web
      journey: checkout
    annotations:
      summary: 'Checkout synthetic SLO burning too fast (p99 >= 3s)'
      runbook_url: 'https://internal.wiki/runbooks/checkout-synthetic'

We compute p99 from the histogram and a binary SLI for “good.”
Two windows catch both fast spikes and slower regressions.
runbook_url goes straight into Slack/PagerDuty messages.

Alertmanager routes it where it belongs with context:

# alertmanager/config.yaml
route:
  receiver: sre-slack
  routes:
  - matchers:
    - severity="page"
    receiver: pagerduty
receivers:
- name: sre-slack
  slack_configs:
  - channel: '#oncall'
    title: '{{ .CommonLabels.alertname }}: {{ .CommonLabels.journey }}'
    text: >-
      {{ range .Alerts }}*{{ .Annotations.summary }}*\n
      p99 target: 3s\n
      Runbook: {{ .Annotations.runbook_url }}\n
      Labels: {{ .Labels | toJson }}{{ end }}
- name: pagerduty
  pagerduty_configs:
  - routing_key: ${PAGERDUTY_KEY}

Now when checkout gets slow in Virginia but not Frankfurt, the right humans get paged with the exact runbook.

Tie telemetry to triage, not just charts

Synthetics should stitch into your triage muscle memory:

Correlate with traces: tag synthetic requests with userType=synthetic and a fixed header like X-Synthetic: checkout. Filter in Datadog/APM or Jaeger and pivot from alert to the slow span.
Link runbooks: your alert already carries runbook_url. Keep it versioned in Git next to the rule.
Enrich incidents: include last_good_build, region, and 3p_dependency_status labels so oncall doesn’t have to guess.
Noise discipline: if a synthetic is flakey twice in a month, fix or delete it. Flakes erode trust and page fatigue kills.

If you’re on OpenTelemetry, add a minimal span in your synthetic:

// add a span around the journey, export with OTLP/HTTP to your collector
import { context, trace } from '@opentelemetry/api';

const tracer = trace.getTracer('synthetics');
await tracer.startActiveSpan('journey.checkout', async (span) => {
  span.setAttributes({ 'synthetic.journey': 'checkout', 'env': 'prod' });
  // ... run steps ...
  span.setAttribute('synthetic.p99.target', 3.0);
  span.end();
});

Now your alert links to a trace that shows the smoking gun (that new feature flag SDK call blocking on DNS…).

Close the loop: let synthetics drive rollouts

This is where it gets fun. Use your synthetic SLIs to gate canaries. If the journey regresses, the rollout pauses or rolls back—no human in the loop.

With Argo Rollouts:

# argo/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-latency
spec:
  args:
  - name: threshold
    value: "3"
  metrics:
  - name: synthetic-checkout-p99
    interval: 1m
    count: 5
    successCondition: result < args.threshold
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          histogram_quantile(0.99, sum(rate(synthetic_nav_duration_seconds_bucket{journey="checkout"}[1m])) by (le))

Attach it to your canary:

# argo/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: web-checkout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: { duration: 60 }
      - analysis:
          templates:
          - templateName: checkout-latency
          args:
          - name: threshold
            value: "3"
      - setWeight: 50
      - pause: { duration: 120 }
      - analysis:
          templates:
          - templateName: checkout-latency

If your p99 blows past 3s during the canary, Argo pauses or aborts the rollout. We’ve set this up at a unicorn fintech; it prevented a Friday deploy from melting conversion when a “harmless” SVG optimization killed LCP.

Prefer Flagger on Istio/NGINX? Same concept—MetricTemplate against Prometheus and analysis checks on increments.

A reference path that actually works

Here’s the pragmatic stack we deploy for clients when we need results in a week:

Synthetics: Playwright for multi-step UX; k6 for timing histograms and light load. Runs on GitHub Actions or a small K8s CronJob, geo-distributed via Actions runners or Fly.io machines.
Metrics: Prometheus + Alertmanager. Remote write to Grafana Cloud or your existing TSDB if needed.
Tracing: OpenTelemetry Collector shipping to Tempo/Jaeger/Datadog.
Automation: Argo Rollouts (or Flagger) gating canaries using synthetic SLIs.
Process: Tests, rules, and runbooks in the same repo. PRs touching checkout must update the synthetic and SLO if they change behavior.

Example GitHub Actions to run synthetics every minute from two regions:

# .github/workflows/synthetics.yml
name: synthetics
on:
  schedule:
  - cron: '*/1 * * * *'
jobs:
  checkout-us:
    runs-on: ubuntu-latest
    env:
      SITE_URL: https://shop.example.com
      PUSHGATEWAY: https://pushgw-us.gitplumbers.net
    steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-node@v4
      with: { node-version: '20' }
    - run: npm ci
    - run: npx playwright install --with-deps
    - run: npx playwright test synthetics/checkout.spec.ts
  checkout-eu:
    runs-on: ubuntu-latest
    env:
      SITE_URL: https://shop.example.com
      PUSHGATEWAY: https://pushgw-eu.gitplumbers.net
    steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-node@v4
      with: { node-version: '20' }
    - run: npm ci
    - run: npx playwright install --with-deps
    - run: npx playwright test synthetics/checkout.spec.ts

Keep it boring. Boring ships.

What we learned and how to start Monday

A few scars and patterns from the field:

Own flake rate: alert if a synthetic fails >2% of runs with no correlated real-user impact. Your test is busted.
Budget the suite: 3–5 journeys, < 60s each, 1–5 minute cadence. More than that and you’ll spend life babysitting.
Tag everything: journey, region, env, build_sha. During an incident, tags save minutes. Minutes save money.
Don’t hide 3rd parties: measure your IdP, CDN, and payment providers with isolated synthetics so you can triage blame quickly.
Review like code: failing SLO? PR to change targets should be harder than the PR that caused the regression.

If you want to move the needle this week:

Pick your top journey and write a Playwright script that asserts the actual business outcome.
Publish p95/p99 timings and a binary “good/bad” to Prometheus.
Add a 99% SLO and a dual-window burn-rate alert with a runbook link.
Gate your next canary with an Argo AnalysisTemplate on that SLI.

You’ll catch the 4 p.m. incident before it wakes you at 2 a.m. I’ve watched this save seven-figure weekends.

Related Resources

Key takeaways

Ping checks aren’t user experience. Synthetics must mirror your revenue paths and emit real SLIs.
Use leading indicators: p99 journey latency, TTFB, auth success rates, third‑party dependency timing, and error budget burn.
Publish synthetics into Prometheus with histograms so you can compute SLOs and burn rates.
Wire alerts to runbooks and labels that route to the right humans (or bots) instantly.
Close the loop: feed synthetic SLIs into Argo Rollouts/Flagger to automatically pause or rollback canaries.
Keep the suite small, reliable, and fast—own it like production code.

Implementation checklist

Map top 3 user journeys to synthetic scripts with explicit success criteria.
Emit timing and outcome metrics as Prometheus histograms/counters per journey.
Define SLOs and multi-window burn-rate alerts for synthetic SLIs.
Tag alerts with team, service, runbook_url; route via Alertmanager.
Integrate Argo Rollouts/Flagger AnalysisTemplate against Prometheus queries.
Version everything (tests, rules, runbooks) in Git; review alongside app changes.
Continuously prune/repair flakey synthetics; enforce a strict timeout budget.

Questions we hear from teams

Why not just use real user monitoring (RUM)?: Use both. RUM tells you what happened to real users; synthetics tell you what will happen next from controlled locations with zero sampling bias. Synthetics are your early-warning system and a safe canary gate.
Won’t synthetic tests be flaky and noisy?: Only if you treat them like afterthoughts. Keep the suite tiny, stabilize selectors, seed data, enforce strict timeouts, and page on burn rate across two windows. We delete or fix any synthetic that flakes twice without correlated impact.
Can I do this with Datadog/New Relic instead of Prometheus?: Yes. Datadog Synthetics and NR Synthetics ship browser checks and SLOs out of the box. The key is the same: emit leading indicators (p95/p99, success rate), define burn-rate alerts, and wire webhook/Argo integrations to gate rollouts.
Is Pushgateway safe for production signals?: It’s fine for low-frequency, short-lived jobs like synthetics if you treat it as an ingestion edge. Do not rely on it for high-cardinality or high-volume metrics. If you can, prefer native remote write (k6) or a lightweight sidecar that pushes OTLP to a collector.
How do I avoid false positives from regional blips?: Run from at least two regions and require both to violate the SLO before paging (route-level inhibition). Keep a ‘control’ synthetic that hits a static asset to distinguish app vs. network/CDN issues.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about wiring synthetics into your rollouts Get our starter repo: Playwright + k6 + Prometheus + Argo