The Payroll Run That Didn’t Page Us: Observability That Stopped a Cascade Before It Started

How a pragmatic OpenTelemetry + Prometheus overhaul caught a Kafka-induced latency spiral 22 minutes before customers felt it—and kept the Friday payroll run green.

“For the first time, our Friday run was boring. The alert told us exactly what to do, and it worked.” — Head of SRE, Fintech Client
Back to all posts

The payroll spike that used to melt everything

A fintech client (Series C, SOC 2, PCI-in-scope) processed 4–6x normal traffic on the first Friday of the month. Historically, that payroll run was a slot machine: sometimes green, sometimes a 45-minute brownout. They ran a hybrid stack: a Java monolith handling auth and payouts, plus 18 Go/Node microservices on EKS, Kafka for events, Postgres (RDS), and a smattering of Lambda-based glue. Observability was a greatest-hits album of partial attempts: Datadog dashboards no one trusted, logs everywhere, and alerts tuned by survivor bias.

Why this mattered and the constraints

This wasn’t a vanity refactor. A missed payroll is a headline and a churn machine. Constraints:

  • Compliance: PII everywhere; no full-payload logs outside VPC.

  • Budget: Keep existing Datadog APM for billing for now; reduce overall spend, don’t increase it.

  • Org reality: Two SREs, five feature teams, and a product calendar that didn’t care about our perfect dashboard plan.

  • Tech debt: Mixed SDK versions, sidecar sprawl, HPA tuning that assumed CPU == performance (it didn’t).

What we changed: pragmatic observability that actually shipped

I’ve seen teams try to “do observability” with a six-month migration and a whitepaper. It fails. We did a four-week, critical-path-first makeover and shipped behind feature flags via ArgoCD.

What we installed and why:

  • OpenTelemetry for traces/metrics; Tempo for traces; Prometheus for metrics; Loki for logs; Grafana on top. Vendor-neutral, cheap to run, easy to swap.

  • SLOs on the externally visible APIs (payouts, auth) and the internal bottlenecks (Kafka consumer billing-worker).

  • Burn-rate alerts over multi-windows to catch fast and slow regressions without noise.

  • Autoscaling on Kafka lag via KEDA, not CPU.

  • Runbooks attached to alerts with exact PromQL and trace links. No guesswork at 2 a.m.

Instrumentation highlights:


// Node/Express payout API (TypeScript)

import express from 'express'

import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'

import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'

import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'

import { Resource } from '@opentelemetry/resources'

import { SemanticResourceAttributes as SRA } from '@opentelemetry/semantic-conventions'

import { registerInstrumentations } from '@opentelemetry/instrumentation'



const provider = new NodeTracerProvider({

  resource: new Resource({ [SRA.SERVICE_NAME]: 'payout-api', 'service.version': '1.12.3' })

})

provider.addSpanProcessor(new (require('@opentelemetry/sdk-trace-base').BatchSpanProcessor)(

  new OTLPTraceExporter({ url: 'http://otel-collector:4317' })

) )

provider.register()



registerInstrumentations({ instrumentations: [getNodeAutoInstrumentations()] })



const app = express()

app.post('/v1/payouts', async (req, res) => {

  // business logic, spans auto-created for HTTP + DB

  res.status(202).json({ ok: true })

})

app.listen(8080)

OpenTelemetry Collector pipeline (Tempo + Prometheus metrics, with tail-based sampling so we keep errors and high-latency traces):


receivers:

  otlp:

    protocols: { http: {}, grpc: {} }

processors:

  batch: {}

  tail_sampling:

    decision_wait: 5s

    policies:

      - name: errors

        type: status_code

        status_code:

          status_codes: [ERROR]

      - name: slow-requests

        type: latency

        latency: { threshold_ms: 500 }

exporters:

  otlp/tempo:

    endpoint: tempo:4317

    tls: { insecure: true }

  prometheus:

    endpoint: 0.0.0.0:9464

service:

  pipelines:

    traces:

      receivers: [otlp]

      processors: [batch, tail_sampling]

      exporters: [otlp/tempo]

    metrics:

      receivers: [otlp]

      processors: [batch]

      exporters: [prometheus]

Prometheus scraping and SLO burn rules (GitOps’d via kube-prometheus-stack):


apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

  name: payout-api

spec:

  selector: { matchLabels: { app: payout-api } }

  endpoints:

    - port: http-metrics

      interval: 15s

---

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

  name: payout-slo-burn

spec:

  groups:

    - name: availability.slo

      rules:

        # 99.9% SLO -> 0.1% budget

        - alert: PayoutAPIErrorBudgetBurnFast

          expr: |

            (

              sum(rate(http_requests_total{job="payout-api",status=~"5.."}[5m]))

              / sum(rate(http_requests_total{job="payout-api"}[5m]))

) > (0.001 * 14.4)

          for: 5m

          labels: { severity: page }

          annotations:

            summary: "Fast burn (1h) on payout API"

            runbook: "https://runbooks/payout-api-slo"

        - alert: PayoutAPIErrorBudgetBurnSlow

          expr: |

            (

              sum(rate(http_requests_total{job="payout-api",status=~"5.."}[30m]))

              / sum(rate(http_requests_total{job="payout-api"}[30m]))

) > (0.001 * 6)

          for: 30m

          labels: { severity: warn }

          annotations:

            summary: "Slow burn (6h) on payout API"

            runbook: "https://runbooks/payout-api-slo"

Kafka lag and autoscaling (scale the actual bottleneck):


apiVersion: keda.sh/v1alpha1

kind: ScaledObject

metadata:

  name: billing-worker

spec:

  scaleTargetRef: { name: billing-worker }

  pollingInterval: 10

  cooldownPeriod: 300

  triggers:

    - type: kafka

      metadata:

        bootstrapServers: kafka:9092

        consumerGroup: billing-worker

        topic: payouts

        lagThreshold: "2000"

Logs without breaking the bank (route only what’s actionable to Loki):


scrape_configs:

  - job_name: kubernetes-pods

    pipeline_stages:

      - match:

          selector: '{app="payout-api"}'

          stages:

            - regex: { expression: '.*(ERROR|WARN).*' }

            - labels: { level: "$1" }

Alertmanager routes that page humans only when SLOs burn, everything else to Slack:


route:

  receiver: 'slack'

  routes:

    - match: { severity: 'page' }

      receiver: 'pagerduty'

      continue: false

receivers:

  - name: 'pagerduty'

    pagerduty_configs:

      - routing_key: ${PD_KEY}

  - name: 'slack'

    slack_configs:

      - channel: '#alerts'

The moment it paid off: stopping a cascade before customers felt it

Two Fridays later, 08:07 UTC, the new fast burn alert fired. The runbook link jumped straight into a Grafana dashboard with:

  • p95 latency creeping from 210ms to 480ms on POST /v1/payouts.

  • Kafka consumer lag for billing-worker climbing past 2,400 (we’d set 2,000).

  • Traces showed a payment-gateway dependency adding 300–400ms on calls tagged sandbox-us-east-1.

What happened next (and why it worked):

  1. SRE clicked the runbook’s PromQL and saw saturation on billing-worker CPUs at 70%, but queue lag rising—classic sign that more replicas, not bigger pods, were needed.

  2. KEDA scaled billing-worker from 4 -> 10 replicas within 90 seconds.

  3. A feature flag toggled a circuit breaker for the flaky payment-gateway sandbox calls (we pre-wired a fallback tokenization path).

  4. A synthetic check (Grafana cloud-prober/blackbox) confirmed external availability stayed 99.99%.

Net: p95 fell back under 250ms in 7 minutes. No customer impact. Error-budget graph never crossed the hard page threshold again that day.

For the skeptical: before this work, the same pattern took down payouts for ~45 minutes. The difference wasn’t magic; it was signal over noise and the ability to scale the actual bottleneck fast.

Results that matter (not vanity charts)

Measurable outcomes over 30 days:

  • Zero customer-facing incidents during payroll spikes; SLO (99.9% availability) maintained.

  • MTTR dropped from 68 minutes to 9 minutes. Root-cause time fell because traces pinned the slow hop immediately.

  • Alert noise down 63%; pages per engineer per week went from 3.2 to 1.1.

  • Logging cost down 35% via Loki routing + trace sampling (kept error/slow traces, dropped the rest).

  • p95 latency on payouts improved 28% during spikes due to autoscaling on lag, not CPU.

  • Change failure rate fell from 18% to 8% after we GitOps’d the dashboards, rules, and alert routes. No more “works on my Grafana.”

Playbook you can steal this quarter

If you’ve been burned by “let’s buy $BIGTOOL and call it observability,” here’s what actually works:

  • Start with one customer journey. For us: POST /v1/payouts -> Kafka -> billing-worker -> gateway. Draw the map, then instrument that line end-to-end.

  • Define one SLO per interface. Keep it boring: availability or latency. Publish it. Tie alerts to error budget burn, not CPU spikes.

  • Autoscale on real work. Use KEDA for Kafka, or RPS/queue depth for HTTP workers. CPU ≠ demand.

  • Standardize labels. service.name, env, version. Your queries and dashboards will stop lying.

  • Attach runbooks to alerts. Include PromQL and a “first mitigations” checklist. Don’t page a human without a play.

  • GitOps the whole thing. Dashboards, rules, routes, collectors. We used Terraform + ArgoCD; PRs reviewed by SRE + service owners.

Useful PromQL snippets we baked into runbooks:


# Request error rate

sum(rate(http_requests_total{job="payout-api",status=~"5.."}[5m]))

/ sum(rate(http_requests_total{job="payout-api"}[5m]))



# Kafka consumer lag (max over 5m to catch bursts)

sum by (consumergroup, topic) (

  max_over_time(kafka_consumergroup_lag{consumergroup="billing-worker"}[5m])

)



# Saturation (CPU) paired with latency

avg(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{pod=~"billing-worker.*"})

and on()

histogram_quantile(0.95, sum by (le) (rate(http_server_request_duration_seconds_bucket{job="payout-api"}[5m]))) > 0.4

What I’d do differently next time

We got a lot right, but hindsight is 20/10:

  • I would push golden signals into a single dashboard earlier. Teams still had pet views.

  • We should have chaos-tested the gateway flaky path with a 2% error injection before the event; we did it after.

  • I’d add continuous profiling (eBPF, Parca/pyroscope) for the hot services to shave 10–15% CPU during spikes.

  • We kept Datadog APM for billing too long; a faster consolidation would simplify runbooks and billing.

Bottom line: you don’t need a platform migration to prevent an outage. You need the right signals, measured against promises you’ve made (SLOs), and an on-ramp for humans to act quickly. That’s what we ship at GitPlumbers.

Related Resources

Key takeaways

  • Instrument the critical path first; don’t boil the ocean.
  • Alert on SLO burn, not raw CPU or pod restarts.
  • Trace + metrics + logs in one view cuts MTTR more than any single tool.
  • Autoscale on real bottlenecks (queue lag), not CPU.
  • GitOps the observability stack so it evolves safely with the app.

Implementation checklist

  • Map the business-critical journey and define a single SLO per interface.
  • Add OpenTelemetry SDKs to critical services; export traces to a vendor-neutral collector.
  • Create RED-based dashboards and multi-window SLO burn alerts.
  • Instrument Kafka lag and downstream saturation; autoscale on backlog, not CPU.
  • Consolidate logs with Loki; sample traces strategically to control cost.
  • Codify everything (PrometheusRules, ServiceMonitors, dashboards) and ship via GitOps.
  • Write runbooks that pair alerts with one-click queries and mitigation steps.

Questions we hear from teams

Do I have to migrate off my current APM to get these results?
No. We often start by layering OpenTelemetry + Prometheus alongside your existing APM. We instrument the critical path and route telemetry where it’s most actionable. You can phase out vendors on your timeline.
How long did this take end-to-end?
Four weeks to MVP: 2 weeks for instrumentation and collector rollout, 1 week for SLOs/alerts/dashboards, 1 week for runbooks and a game day. We scheduled the work so it didn’t block ongoing feature delivery.
Will this work if we’re not on Kubernetes?
Yes. The patterns (SLOs, burn-rate alerts, OTel traces, autoscaling on actual demand signals) apply on VMs, ECS, or serverless. The tooling changes slightly, the playbook doesn’t.
What about security and compliance for PII?
We scrub PII at the source using OTel processors, restrict egress from collectors, and keep payloads out of traces/logs. We’ve passed SOC 2 and PCI assessments with this setup at fintechs and healthtechs.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Fix your observability where it matters Book a 30-minute technical consult

Related resources