What’s the minimum viable set of metrics for release health?

Latency distribution (p95/p99), error rate, retry rate, saturation (CPU throttling/thread queues), and one dependency health signal (DB slow queries or cache hit rate). Add queue lag if you have async workers and a small synthetic transaction.

How do we avoid false positives from small traffic canaries?

Compare canary vs stable deltas and use short windows with multiple evaluation periods (e.g., 1m interval, 5 counts). Require both a directional change and a magnitude threshold. If traffic is extremely low, rely more on synthetic checks and dark traffic before ramping.

Won’t tagging everything with `service.version` blow up our costs?

Index only what you need. Metrics labels for `service.version` are cheap. Use tail-based sampling for traces that up-sample on errors/latency for the new version. Keep high-res data only during rollout windows, then downsample. Logs: index metadata, store bodies cold.

We’re not on Kubernetes. Does this still work?

Yes. ECS, Nomad, VMs—same pattern. Export `service.version` via environment, propagate headers, tag telemetry, and integrate with your deploy tool (Spinnaker, CodeDeploy, Octopus) using webhooks or canary analysis (Kayenta).

How big should the canary be and how fast should we ramp?

Start 5–10%, hold for at least two analysis intervals (5–10m), then ramp by 15–25% with checks after each step. Sensitive systems (payments, auth) should hold longer and require both technical and business signal checks (conversion, decline rates).

Reliability-observability · Oct 3, 2025 · 10 minute read

The First 15 Minutes: Instrument Release Health to Catch Regressions Before Customers Do

You don’t need more dashboards. You need the right leading signals wired into your rollout controller so bad releases stop themselves.

Back to all posts

The First 15 Minutes: Instrument Release Health to Catch Regressions Before Customers Do

Key takeaways

Implementation checklist