Can we do this without ArgoCD/Argo Rollouts?

Yes. The patterns matter more than the tools. You can do GitOps with Flux, progressive delivery with Flagger, or use Spinnaker/GitLab for orchestration. Keep the release manifest, immutable artifacts, metric gates, and checklists. Those cut CFR/MTTR regardless of vendor.

What if we have a monolith plus a few services?

Great. Start by putting the monolith and services into the same release manifest and apply canaries where it makes sense (e.g., edge services). Use feature flags to de-risk monolith releases. The goal is the same: one contract, gated promotions, automated rollback.

How do we handle cross‑DB or cross‑region changes?

Split into waves per data plane. Run expand/contract migrations first, then ship app changes with canaries. For multi‑region, promote per region using the same manifest; stagger by 30–60 minutes and treat the first region as a canary. Keep read/write routing and replication lag in your gates.

Isn’t this overkill for a small team?

Start small: a manifest, ArgoCD/Flux for GitOps, one Rollout with a Prometheus gate, and a two-page checklist. You’ll still reduce CFR and MTTR. The same patterns scale when you add teams and services.

How do we measure success beyond CFR, lead time, and MTTR?

Track deployment frequency, error budget burn rate, and percent of automated rollbacks vs. manual. Also measure time spent in status meetings. If the manifest and dashboards cut meetings, your process is working.

Release-engineering · Oct 19, 2025 · 9 minute read

Stop Orchestrating Outages: Automating Multi‑Service Releases with GitOps, Rollouts, and Real Gates

Complex releases don’t have to be chaos. Treat the release as code, gate with real signals, and standardize the checklists. Your change failure rate, lead time, and recovery time will finally trend the right way.

Alex Ramirez

Principal Release Engineer, GitPlumbers

20 years taming deploys at scale—Rails monoliths, Kubernetes fleets, and now AI-assisted pipelines. Ex-Spotify, helped teams ship safely through the microservices era and back.

“If you can’t ship on a Tuesday at 4 p.m., you don’t have automation — you have ritual.”

Back to all posts

The Friday release that took down 14 services

I’ve watched a fintech try to ship a “simple” payments update across 14 services on a Friday. Terraform drifted, ArgoCD forced a sync on the wrong namespace, a non‑compatible DB migration slipped through, and Istio sent 20% of prod traffic into a black hole. MTTR was five hours. Change failure rate that month was 28%. Leadership swore off Friday releases; the real fix was release engineering.

Here’s what actually works when multi-service releases are your norm: treat the release as code, let GitOps do the heavy lifting, gate promotions with real signals, and make the checklists so dead simple anyone on rotation can run them at 2 a.m.

North-star metrics:

Change failure rate (CFR): target <10% initially
Lead time for change: from merged code to prod; target hours, not days
Recovery time (MTTR): target minutes, not hours

If your automation moves those three, you’re winning.

Make the manifest the contract

Multi-service releases fail when the “plan” lives in Slack. You need a single, auditable source of truth: a release manifest stored in Git.

It declares services, versions (immutable SHAs), dependencies, and gates.
It’s the unit of change you promote between environments.
Pipelines, dashboards, and ChatOps all read the same file.

Example release.yaml:

version: 2025-10-19
release: R-2025.10.19
environment: staging
services:
  - name: payments-api
    image: ghcr.io/acme/payments-api@sha256:8a9c...f1
    chart: charts/payments-api
    dependsOn: [users-api, ledger]
    migrations:
      - tool: liquibase
        script: db/changelog/2025-10-19.xml
    flags:
      - key: payments.v2-routing
        state: off
  - name: users-api
    image: ghcr.io/acme/users-api@sha256:2f3b...c9
    chart: charts/users-api
  - name: ledger
    image: ghcr.io/acme/ledger@sha256:aa12...77
    chart: charts/ledger
policy:
  rolloutStrategy: canary
  waves: 3
  gates:
    - type: prometheus
      name: payments-5xx
      query: rate(http_requests_total{app="payments-api",status=~"5.."}[5m]) < 0.01
notes: "Introduce v2 payments routing (flagged)."

Put this in a release-metadata repo and make promotions PR-driven. Your lead time drops because everyone aligns on one artifact, and CFR drops because the graph and gates are explicit.

Orchestrate by dependency, not by repo

Stop trying to choreograph 20 pipelines by hand. Drive orchestration from the manifest and let GitOps (ArgoCD) reconcile reality. Use sync waves to respect dependencies.

Wave 0: backward-compatible DB migrations, configmaps, CRDs
Wave 1+: services in dependency order (dependsOn)
Final wave: traffic routing, flag flips

ArgoCD “app-of-apps” with sync waves:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: release-r-2025-10-19
spec:
  project: default
  source:
    repoURL: https://github.com/acme/env-staging.git
    path: apps
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
---
# Example child app with wave annotation
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: users-api
  annotations:
    argocd.argoproj.io/sync-wave: "10"
spec:
  source:
    repoURL: https://github.com/acme/users-api-deploy.git
    path: charts/users-api
    helm:
      values: |
        image: ghcr.io/acme/users-api@sha256:2f3b...c9
  destination:
    server: https://kubernetes.default.svc
    namespace: users
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payments-api
  annotations:
    argocd.argoproj.io/sync-wave: "20"
spec:
  source:
    repoURL: https://github.com/acme/payments-api-deploy.git
    path: charts/payments-api
    helm:
      values: |
        image: ghcr.io/acme/payments-api@sha256:8a9c...f1
  destination:
    server: https://kubernetes.default.svc
    namespace: payments

Two keys that keep CFR low:

Single writer to prod: only ArgoCD mutates cluster state. Humans submit PRs to env repos.
Deterministic waves: use annotations and a generated app list from the manifest. No manual click-ops during a release.

Progressive delivery with metrics, not vibes

If your rollback plan is “hope,” your MTTR will always be ugly. Use Argo Rollouts for canaries and gate each step with Prometheus metrics. No approval clicks until the data says it’s safe.

Rollout for payments-api with an analysis template:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payments-api
  namespace: payments
spec:
  replicas: 6
  strategy:
    canary:
      canaryService: payments-api-canary
      stableService: payments-api-stable
      steps:
        - setWeight: 5
        - pause: { duration: 120 }
        - analysis:
            templates:
              - templateName: payments-slo
            args:
              - name: service
                value: payments-api
        - setWeight: 25
        - pause: { duration: 180 }
        - analysis:
            templates:
              - templateName: payments-slo
            args:
              - name: service
                value: payments-api
        - setWeight: 50
        - pause: { duration: 300 }
        - analysis:
            templates:
              - templateName: payments-slo
            args:
              - name: service
                value: payments-api
  selector:
    matchLabels:
      app: payments-api
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      containers:
        - name: api
          image: ghcr.io/acme/payments-api@sha256:8a9c...f1

Analysis gating on error rate and latency budgets:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: payments-slo
  namespace: payments
spec:
  metrics:
    - name: http-5xx-rate
      interval: 30s
      count: 10
      successCondition: result[0] < 0.01
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            sum(rate(http_requests_total{app={{args.service}},status=~"5.."}[2m]))
            /
            sum(rate(http_requests_total{app={{args.service}}}[2m]))
    - name: p95-latency
      interval: 30s
      count: 10
      successCondition: result[0] < 0.35
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{app={{args.service}}}[2m])))

Tie gates to your SLOs. If you’re burning error budget too fast, Rollouts aborts and reverts. MTTR drops because rollback is automated and pre-tested.
Add Istio destination rules and circuit breakers so even failed canaries don’t crush backends.

If you can’t ship on a Tuesday at 4 p.m., you don’t have automation — you have ritual.

Build once, promote everywhere (with provenance)

The fastest way to inflate lead time is rebuilding per environment. Don’t. Build once, generate an SBOM, sign the image, and promote the same digest from dev → staging → prod.

Use cosign for signing and policy enforcement.
Store artifacts in an OCI registry and reference by digest in your manifest.
Validate signatures at deploy time with an admission controller (e.g., Kyverno, OPA Gatekeeper) and aim for SLSA Level 2+.

Signing and verifying:

# sign (keyless with Fulcio/Rekor)
COSIGN_EXPERIMENTAL=1 cosign sign ghcr.io/acme/payments-api@sha256:8a9c...f1

# verify in CI before promotion
COSIGN_EXPERIMENTAL=1 cosign verify ghcr.io/acme/payments-api@sha256:8a9c...f1

Pair this with Terraform for infra changes: plan once per env, apply in a gated wave (never alongside app canaries). CFR drops because infra and app changes stop stepping on each other.

Databases and flags: the two footguns

I’ve seen more outages from schemas and feature flags than anything else. Treat both as first-class citizens in the release.

DB migrations: use expand/contract. Wave 0 applies additive changes; services deploy; Wave N removes old columns. Tools: liquibase, flyway, prisma migrate.
Data backfills: run idempotent jobs with rate limits. Monitor write amplification and QPS.
Feature flags: use OpenFeature or LaunchDarkly. Flags are in the manifest and flipped in waves, not YOLO’d in prod.

Flag-driven routing example (Istio virtual service excerpt):

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payments-api
spec:
  hosts: ["payments.acme.internal"]
  http:
    - match:
        - headers:
            X-Flag-PaymentsV2:
              exact: "on"
      route:
        - destination:
            host: payments-api-canary
          weight: 100
    - route:
        - destination:
            host: payments-api-stable
          weight: 100

Flags let you decouple deploy from release. That alone can cut CFR in half for UI/API changes.

The checklists that scale with headcount

Your pipeline is only as good as the runbooks glued to it. Keep them short, versioned, and executable via ChatOps.

Release checklist (condensed):

Validate release.yaml (schema + dependency graph)
Verify artifact signatures and SBOMs
Pre-flight SLOs healthy; error budget non-zero; no Sev-2+ open
Wave 0: apply additive DB migrations
Wave 1+: ArgoCD sync by dependency; Argo Rollouts canaries with gates
Observe metrics and logs; auto-rollback on breach
Gradually flip feature flags (1%, 5%, 25%, 50%, 100%)
Post-release: remove deprecated paths; schedule contract phase; update service catalog

Rollback checklist (condensed):

Hit rollback release R-2025.10.19 in ChatOps
Confirm Rollouts reverted and ArgoCD shows stable sync
Verify SLOs recovered; incident timeline captured; root cause noted in release PR

Codify both checklists as markdown in the repo and link them from the manifest and your on-call runbook. Recovery time improves because there’s no decision fatigue at 2 a.m.

What good looks like in 90 days

At a B2B SaaS we worked with, releases spanned 11 services, 3 DBs, and 2 clusters. We implemented a manifest, ArgoCD apps with waves, Rollouts with Prometheus gates, and signed artifacts. We also replaced tribal knowledge with runbooks and checklists.

Results after 90 days:

CFR: 22% → 6%
Lead time: 2–3 days → 2–4 hours (merge to prod)
MTTR: ~70 minutes → <15 minutes (automated rollback)

The surprise benefit: fewer meetings. The manifest and dashboards answered “what’s shipping?” without a status call. Engineering managers reclaimed ~4 hours/week.

If your release train still jumps the tracks, don’t add more people to push it. Fix the tracks, the schedule, and the signals. That’s release engineering. And that’s the work we do at GitPlumbers when the stakes are real.

Related Resources

Key takeaways

Treat the release as code with a versioned manifest that encodes services, dependencies, and gates.
Drive the pipeline off GitOps. Promote artifacts; don’t rebuild per environment.
Use progressive delivery with metric-based gates to cut change failure rate and recovery time.
Standardize checklists and runbooks as code so teams scale without reinventing process.
Build once, sign, and verify provenance across environments to control blast radius and lead time.

Implementation checklist

Define a versioned `release.yaml` manifest (services, versions, dependencies, gates).
Build once and sign artifacts (e.g., `cosign`) and generate SBOMs.
Pre-flight: schema drift check, dependency graph validation, SLO status, error budget.
Wave 0: run backward-compatible DB migrations (expand) and dry-run Helm charts.
Apply ArgoCD sync waves per dependency; enable Argo Rollouts canary steps.
Gate with metrics (Prometheus) and error budgets; auto-rollback on breach.
Flip feature flags gradually; monitor user and infra health.
Document release notes and evidence; update service catalog; schedule contract cleanup (contract phase).

Questions we hear from teams

Can we do this without ArgoCD/Argo Rollouts?: Yes. The patterns matter more than the tools. You can do GitOps with Flux, progressive delivery with Flagger, or use Spinnaker/GitLab for orchestration. Keep the release manifest, immutable artifacts, metric gates, and checklists. Those cut CFR/MTTR regardless of vendor.
What if we have a monolith plus a few services?: Great. Start by putting the monolith and services into the same release manifest and apply canaries where it makes sense (e.g., edge services). Use feature flags to de-risk monolith releases. The goal is the same: one contract, gated promotions, automated rollback.
How do we handle cross‑DB or cross‑region changes?: Split into waves per data plane. Run expand/contract migrations first, then ship app changes with canaries. For multi‑region, promote per region using the same manifest; stagger by 30–60 minutes and treat the first region as a canary. Keep read/write routing and replication lag in your gates.
Isn’t this overkill for a small team?: Start small: a manifest, ArgoCD/Flux for GitOps, one Rollout with a Prometheus gate, and a two-page checklist. You’ll still reduce CFR and MTTR. The same patterns scale when you add teams and services.
How do we measure success beyond CFR, lead time, and MTTR?: Track deployment frequency, error budget burn rate, and percent of automated rollbacks vs. manual. Also measure time spent in status meetings. If the manifest and dashboards cut meetings, your process is working.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer Download the release manifest template