Do we need a monorepo to do a release manifest?

No. The manifest works with polyrepos too. The key is that the manifest pins immutable artifacts (image digests, chart versions), not branches. Your build system can publish artifacts from many repos; the release cut step assembles them into one promoted document.

What if different services have different rollout needs?

That’s normal. Put rollout strategy per service in the manifest (canary for tier-1, batch for tier-2, rolling for internal). Standardize the interface (health checks, gates), not the implementation.

How do we handle database migrations safely?

Default to expand/contract. Automate checks that prod releases only include backward-compatible changes unless explicitly approved. If rollback isn’t possible, document the forward-fix plan and make it part of the release gates.

How do we measure change failure rate reliably?

Pick an operational definition and automate it: deployments that trigger rollback, hotfix PRs, or SLO burn alerts within a time window (e.g., 24 hours). The exact definition matters less than consistency and tying it to release metadata.

Release-engineering · Dec 25, 2025 · 8 minute read

The 17-Service Release That Taught Us to Stop “Coordinating” and Start Automating

Multi-service releases don’t fail because engineers are dumb. They fail because the system is built on tribal knowledge, Slack pings, and “please don’t merge right now.” Here’s the automation pattern that actually moves your change failure rate, lead time, and recovery time.

GitPlumbers Release Engineering Team

Fixers of CI/CD, legacy platforms, and AI-assisted codebases

We’ve spent two decades cleaning up release trains, untangling service dependencies, and turning “deploys” into repeatable, auditable systems. We optimize for change failure rate, lead time, and recovery time because those are the numbers that survive contact with production.

If your “release process” requires your best engineer to be awake, it’s not a process—it’s a liability.

Back to all posts

When “just deploy it” turns into a change-failure factory

I’ve watched a lot of teams hit the same wall: they graduate from a single deployable to a constellation of services—api, worker, billing, notifications, frontend, plus a handful of “shared” libraries nobody owns. Releases start as a simple pipeline step and evolve into a Friday-night group chat.

The failure mode is painfully consistent:

Someone merges a harmless-looking change to orders-api.
It depends on a new field in Kafka or a backward-incompatible schema tweak.
payments-worker is still on the old contract.
The deploy is “coordinated” in Slack, which is not a control plane.
You ship anyway because the sprint ends today.

That’s how you get a high change failure rate, longer lead time (because everyone starts batching changes “to be safe”), and brutal recovery time (because rollback is now a distributed negotiation).

Here’s what actually works: stop treating multi-service releases like a meeting, and start treating them like an artifact that your automation can reason about.

Pick north-star metrics that punish heroics

If your deployment automation doesn’t explicitly optimize these three, it’ll optimize for the wrong thing (usually “number of deploys”):

Change failure rate: % of deployments causing incidents, rollbacks, hotfixes, or SLO violations.
Lead time for changes: commit → production (and usable, not “deployed but broken behind flags”).
Recovery time (MTTR/time to restore service): detection → mitigation → stable.

In practice, the best automation decisions are the ones that:

Reduce blast radius (canaries, batching, per-service rollouts)
Reduce ambiguity (pinned versions, explicit dependencies)
Reduce human branching factor during incidents (one standard rollback path)

If your “release process” requires your best engineer to be awake, it’s not a process—it’s a liability.

At GitPlumbers, when we rescue release pipelines, we instrument these three metrics first. You can’t improve what you can’t see, and release work is notorious for hiding inside Slack and tribal knowledge.

The pattern: a release manifest + GitOps promotion + progressive delivery

The core move is simple: create a release manifest that is the source of truth for what constitutes a release across services.

Instead of “deploy whatever is on main,” you promote a manifest through environments. It pins:

Service image digests (not tags)
Helm chart versions / Kustomize overlays
Feature flag toggles (or references to flag configs)
Migration steps (and compatibility constraints)
Verification steps and rollback strategy

A minimal release.yaml looks like this:

apiVersion: release.gitplumbers.io/v1
kind: Release
metadata:
  name: checkout-2025-12-25.1
spec:
  environment: prod
  services:
    orders-api:
      image: ghcr.io/acme/orders-api@sha256:2f3c...
      chart: oci://ghcr.io/acme/charts/orders-api
      chartVersion: 1.42.0
      rollout:
        strategy: canary
        steps:
          - setWeight: 10
          - pause: { duration: 5m }
          - setWeight: 50
          - pause: { duration: 10m }
          - setWeight: 100
    payments-worker:
      image: ghcr.io/acme/payments-worker@sha256:9ab1...
      chart: oci://ghcr.io/acme/charts/payments-worker
      chartVersion: 3.8.4
  migrations:
    - name: orders-db-expand
      type: flyway
      target: "2025.12.25.1"
      constraint: backwardCompatible
  gates:
    - type: prometheus
      query: sum(rate(http_requests_total{job="orders-api",status=~"5.."}[5m]))
      threshold: "< 0.02"
    - type: synthetic
      check: checkout_happy_path

Then you promote that manifest using GitOps—ArgoCD is common, but Flux works too. The point is: the CD system reconciles desired state, and your “release” is a commit, not a button click.

This is where teams usually object: “But we have 60 services, this will be heavy.”

It’s the opposite. The manifest is how you stop shipping 60 independent mysteries.

Concrete automation: building, cutting, and promoting a multi-service release

A workable pipeline has three layers:

Build layer: produce immutable artifacts (image digests), attach provenance (SBOM/signatures).
Release-cut layer: generate and validate release.yaml (dependency checks, policy checks).
Promotion layer: GitOps applies the manifest to staging/prod with progressive delivery.

Here’s a trimmed GitHub Actions example for the release-cut step:

name: cut-release
on:
  workflow_dispatch:
    inputs:
      environment:
        required: true
        type: choice
        options: [staging, prod]
jobs:
  cut:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - name: Install tools
        run: |
          curl -sSL https://github.com/mikefarah/yq/releases/download/v4.44.5/yq_linux_amd64 -o /usr/local/bin/yq
          chmod +x /usr/local/bin/yq
      - name: Generate release manifest
        run: |
          ./scripts/gen-release.sh --env ${{ inputs.environment }} > release.yaml
      - name: Policy checks
        run: |
          ./scripts/validate-release.sh release.yaml
      - name: Commit manifest
        run: |
          git config user.email "release-bot@acme.com"
          git config user.name "release-bot"
          git checkout -b release/${{ github.run_id }}
          mkdir -p releases/${{ inputs.environment }}
          cp release.yaml releases/${{ inputs.environment }}/release.yaml
          git add releases/${{ inputs.environment }}/release.yaml
          git commit -m "Cut release for ${{ inputs.environment }}"
          git push origin HEAD

What lives in validate-release.sh (the part that saves you) is typically:

Verify every service references an image digest (reject :latest and floating tags)
Ensure migration plan exists if schema-affecting services changed
Validate backward compatibility constraints (see next section)
Confirm required SLO gates exist for tier-1 services

Then ArgoCD watches releases/prod/release.yaml and reconciles it into the cluster.

If you want the “multi-service” part to not become a monolith, split ownership:

A central release manifest ties versions together
Each service team owns how their service deploys (chart, rollout strategy, health checks)

That’s how you scale without a release-engineering priesthood.

The thing everyone forgets: contracts, schema, and sequencing

Most multi-service incidents I’ve seen weren’t “Kubernetes flaked out.” They were contract breaks:

REST/JSON payload changes without versioning
Kafka/Avro schema updates without compatibility enforcement
DB migrations that assume deploy order

The automation needs to force a safe pattern. The most boring one that works is expand/contract:

Expand: add new fields/tables/columns in a backward-compatible way.
Deploy services that can write both old and new.
Contract: remove old fields after everything reads the new.

You can enforce some of this mechanically.

Example: gate a release if a Flyway migration is not marked compatible:

#!/usr/bin/env bash
set -euo pipefail

file="$1"

# Require every migration to declare backwardCompatible=true for prod
compat=$(yq '.spec.migrations[].constraint' "$file" | sort -u)
if echo "$compat" | grep -vq 'backwardCompatible'; then
  echo "ERROR: prod release includes non-backward-compatible migration"
  exit 1
fi

For event contracts, teams often use Confluent Schema Registry compatibility modes (BACKWARD or FULL). Make it part of the gate:

Query Schema Registry in CI
Fail the release cut if compatibility would break existing consumers

And for sequencing, don’t rely on “deploy service A then B” as human knowledge. Put it in the manifest:

Migrations first (expand)
Producers before consumers only when compatible
Contract migrations only after traffic confirms no old readers

This is exactly where change failure rate gets won or lost.

Progressive delivery + automated rollback: where recovery time gets cut

If you’re still doing “all-at-once” deploys for tier-1 systems, you’re choosing higher MTTR.

Use progressive delivery (Argo Rollouts is the usual suspect) and make rollback a first-class command.

A Kubernetes Rollout with canary plus analysis looks like this:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: orders-api
spec:
  replicas: 20
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 300 }
        - analysis:
            templates:
              - templateName: orders-api-error-rate
        - setWeight: 50
        - pause: { duration: 600 }
        - setWeight: 100
  selector:
    matchLabels:
      app: orders-api
  template:
    metadata:
      labels:
        app: orders-api
    spec:
      containers:
        - name: orders-api
          image: ghcr.io/acme/orders-api@sha256:2f3c...
          ports:
            - containerPort: 8080

When the analysis fails, Rollouts aborts and (if configured) can automatically rollback to the stable ReplicaSet.

Operationally, this does two things:

Lowers change failure rate by catching issues before 100% traffic
Slashes recovery time because rollback isn’t a war room—it’s an automated state change

Your on-call runbook should be boring:

# See rollout status
kubectl argo rollouts get rollout orders-api

# Abort a bad canary
kubectl argo rollouts abort orders-api

# Roll back to last stable
kubectl argo rollouts undo orders-api

If your rollback requires rebuilding images or “finding the last good tag,” you’re going to have a bad night.

Repeatable checklists that scale past “everyone knows everything”

I’ve seen checklists dismissed as “process.” The teams that say that are usually the ones hemorrhaging reliability when headcount doubles.

The trick is to keep them short, enforceable, and tied to the three metrics.

Release-cut checklist (lead time + change failure rate)

Confirm every service in the release is pinned by image digest
Confirm each service declares:
- healthcheck endpoint (/healthz and /readyz)
- SLO gate query (or explicit exemption)
- rollout strategy (canary, blueGreen, or batch)
Confirm migrations are expand/contract and marked compatible
Confirm feature flags are listed with default state and rollback plan

Deployment checklist (change failure rate)

Progressive rollout enabled for tier-1 services
Automated smoke/synthetic checks run against the canary
Alert routing verified (PagerDuty/Slack channel not misconfigured)
Error budget policy applied (don’t ship into a burning house)

Recovery checklist (recovery time)

One-command rollback path exists:
- Git revert of manifest (preferred)
- or argo rollouts undo for isolated service rollback
DB rollback plan defined (or forward-fix plan if rollback isn’t possible)
Incident timeline is captured automatically (deploy start/end, gate results, who approved)

The scaling move: as the org grows, you don’t add meetings—you add policy-as-code.

Use Open Policy Agent (OPA) / conftest to enforce manifest rules
Use service catalog metadata (Backstage is common) to know what’s tier-1
Standardize required SLO gates for tier-1, optional for tier-3

That’s how you keep lead time low without letting change failure rate creep up.

What this looks like when it’s working (and how GitPlumbers helps)

When teams implement this pattern well, the outcomes are predictable:

Change failure rate drops because bad releases fail fast in canary or get blocked by contract/migration gates.
Lead time improves because you stop batching “for safety” and go back to smaller, safer releases.
Recovery time improves because rollback is a normal operation, not a bespoke incident.

Numbers vary, but the shape is consistent. On one engagement, we took a platform with ~30 services from “weekly coordinated releases” to daily promotions in ~6 weeks:

Change failure rate: ~18% → ~6% (measured as deploys requiring rollback/hotfix)
Lead time: ~5–7 days → < 24 hours for most services
Recovery time: ~90 minutes median → ~20 minutes median (rollback + stable)

The hard part wasn’t Kubernetes or ArgoCD. It was deleting the tribal process and replacing it with a manifest, gates, and boring automation.

If you’re staring at a multi-service release that’s held together by Slack and bravery, GitPlumbers is the team that comes in, instruments the real metrics, and turns it into something repeatable. No silver bullets—just the stuff that actually survives on-call.

Talk to us about release automation rescue: https://gitplumbers.com/services/release-engineering
See how we stabilize legacy + AI-assisted codebases: https://gitplumbers.com/case-studies

Related Resources

Key takeaways

Treat a multi-service release as a first-class artifact: a **release manifest** with pinned SHAs, configs, and gates.
Optimize automation for **change failure rate**, **lead time**, and **recovery time**—not “how fast can we push buttons.”
Use **GitOps promotion** (dev → staging → prod) so the deployment system is reproducible and auditable.
Make progressive delivery and rollback boring: **health checks, SLO gates, and one-command revert**.
Ship with checklists that scale: what’s manual at 5 engineers becomes enforced policy at 50.

Implementation checklist

Release manifest created and reviewed (pinned SHAs, chart versions, config, migration plan)
Automated preflight: dependency graph, schema compatibility, feature flag plan
Build provenance: SBOM, image digest pinning, signed artifacts
Progressive rollout configured (canary/batch) with SLO/error budget gates
Automated post-deploy verification (synthetics + key metrics)
Rollback plan tested (app rollback + DB rollback/forward plan)
Audit trail captured (who approved, what changed, when, outcomes)

Questions we hear from teams

Do we need a monorepo to do a release manifest?: No. The manifest works with polyrepos too. The key is that the manifest pins immutable artifacts (image digests, chart versions), not branches. Your build system can publish artifacts from many repos; the release cut step assembles them into one promoted document.
What if different services have different rollout needs?: That’s normal. Put rollout strategy per service in the manifest (canary for tier-1, batch for tier-2, rolling for internal). Standardize the interface (health checks, gates), not the implementation.
How do we handle database migrations safely?: Default to expand/contract. Automate checks that prod releases only include backward-compatible changes unless explicitly approved. If rollback isn’t possible, document the forward-fix plan and make it part of the release gates.
How do we measure change failure rate reliably?: Pick an operational definition and automate it: deployments that trigger rollback, hotfix PRs, or SLO burn alerts within a time window (e.g., 24 hours). The exact definition matters less than consistency and tying it to release metadata.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a release pipeline triage session See how we rescue legacy and AI-assisted codebases