The API Versioning Plan That Survives Real Clients (and Avoids a Breaking-Change Fire Drill)
A pragmatic, metrics-driven playbook for evolving APIs without detonating your biggest customers—or your on-call rotation.
Versioning isn’t documentation. Versioning is an operational contract.Back to all posts
The moment you realize “we can’t change that field” is your new product requirement
I’ve watched teams ship a “minor” response shape tweak on a Friday, only to spend the weekend learning which enterprise customer is still on a Java 8 SDK pinned to a 2019 model class. The postmortem always sounds the same: “But we bumped the version in the docs.”
Versioning isn’t documentation. Versioning is an operational contract. If you don’t treat it like one—with gates, telemetry, and a deprecation runway—you’ll end up with shadow clients, frozen schemas, and a permanent tax on delivery.
This guide assumes you already have production consumers and you want an approach that:
- Keeps backward compatibility as the default
- Makes breaking changes rare, explicit, and measurable
- Gives you proof points (dashboards + CI checks) that you’re not guessing
If you’re already in “v1 forever” hell (or dealing with AI-generated endpoints that don’t match the docs), GitPlumbers typically starts by building a compatibility test harness and an adoption dashboard before touching any production routing.
Pick a compatibility policy first (or you’ll bikeshed URI vs headers forever)
Before you argue about /v2 vs Accept:, write down what your org considers breaking. If you skip this step, every PR becomes a debate.
A rubric that actually works in practice:
Breaking changes (require new major version or parallel endpoint):
- Removing an endpoint, field, or enum value
- Changing a field type (e.g.,
string→object) - Changing semantics (e.g.,
status=PAIDno longer means settled) - Tightening validation (previously accepted payloads now rejected)
- Changing default sorting/pagination behavior
Non-breaking changes (allowed in-place):
- Adding optional fields
- Adding new endpoints
- Adding enum values if clients are required to ignore unknowns
- Adding new query params that default to old behavior
Checkpoint (policy):
- 1-pager: “What counts as breaking” approved by API owners + on-call lead
- Client guideline: “Ignore unknown fields; treat unknown enum as
UNKNOWN”
Metric to track:
- Breaking-change rate:
# of breaking changes / month(goal: trending to near-zero)
Tooling suggestion:
- Use
Spectralto enforce style + compatibility conventions at PR time.
# .spectral.yaml
extends: ["spectral:oas"]
rules:
no-unknown-enum-without-fallback:
description: "Enums must have an UNKNOWN value for forward compatibility"
given: "$.components.schemas..[?(@.enum)]"
then:
field: "enum"
function: "truthy"Choose a versioning mechanism you can enforce at the edge
Here’s what I’ve seen “work in slides” vs “work at 2am”:
Option A: Version in the path (/v1/...)
Pros:
- Visible, cache-friendly, simple routing
- Easy to run dual-stack services
Cons:
- Versions leak into client code and bookmarks forever
Option B: Version in headers (Accept / custom header)
Common pattern:
Accept: application/vnd.acme.orders+json;version=2
Pros:
- Cleaner URLs, aligns with content negotiation
Cons:
- Harder to debug with curl/browser
- Some proxies and SDKs mangle headers
Option C: “No versions, only compatibility”
This is the “Stripe-style” dream. It can work if you have:
- A strict compatibility policy
- Strong client update control
- Heavy contract testing
Most orgs don’t have that maturity on day one.
What actually works for most teams:
- Major versions in the path (
/v1,/v2) for routing and operations - Minor evolution via additive changes within a major version
Checkpoint (decision):
- One standard across the org (don’t let each team invent their own)
- Gateway routing supports running
v1andv2simultaneously
Example: NGINX routing with explicit upstreams:
location /v1/ {
proxy_pass http://orders-v1/;
}
location /v2/ {
proxy_pass http://orders-v2/;
}Metric to track:
- Traffic share by version (you can’t deprecate what you can’t measure)
Design for additive change (and stop breaking clients with “small” refactors)
Most “accidental breaking changes” come from the same handful of mistakes:
- Renaming JSON fields because “it reads better”
- Tightening validation without a migration path
- Reusing fields with new semantics
Concrete tactics that keep you shipping:
- Never rename a field in place. Add the new field, keep the old, deprecate later.
- Prefer new endpoints over overloading semantics.
/v1/orders/{id}/cancelbeatsstatus=CANCELLEDwith hidden rules. - Treat enums as open sets. Clients must tolerate unknown values.
- Make pagination stable. Cursor-based pagination is more version-resistant than offset.
- Use idempotency keys for write APIs. Backward compatibility includes retries.
OpenAPI example: additive field with clear deprecation markers:
openapi: 3.1.0
info:
title: Orders API
version: "1.7.0"
paths:
/v1/orders/{id}:
get:
responses:
"200":
description: OK
content:
application/json:
schema:
$ref: "#/components/schemas/Order"
components:
schemas:
Order:
type: object
required: [id, status]
properties:
id:
type: string
status:
type: string
enum: [PENDING, PAID, SHIPPED, UNKNOWN]
totalCents:
type: integer
deprecated: true
description: "Use totalMoney instead. Will be removed in v2."
totalMoney:
type: object
properties:
amount:
type: string
currency:
type: stringCheckpoint (API review):
- Every change classified: additive vs breaking
- Every deprecation includes: replacement, target removal version, and date
Metrics to track:
- Client-impacting error rate: increase in
4xxafter deploy, segmented by version - p95 latency by version (v2 shouldn’t be “new and slower”)
Put compatibility gates in CI: spec diff + contract tests + fuzz
If your compatibility strategy relies on humans noticing a subtle change in a 2,000-line OpenAPI file… I’ve seen that fail. Repeatedly.
A CI pipeline that catches the real problems:
1) OpenAPI breaking-change detection
Use oasdiff (or openapi-diff) to block breaking changes on main for a given major version.
# compare base branch spec to PR spec
npx oasdiff@latest \
-fail-on=breaking \
./openapi-base.yaml \
./openapi-pr.yaml2) Lint conventions (Spectral)
npx @stoplight/spectral-cli lint openapi.yaml3) Consumer-driven contract tests (Pact)
If you have top-tier consumers (mobile app, partner integrations, internal platform SDKs), treat their expectations as first-class.
- Producers publish pacts
- Consumers verify against producer builds
- Breaking changes get caught before merge
4) Runtime property testing (Schemathesis)
This catches the “docs say one thing, implementation does another” drift.
schemathesis run http://localhost:8080/openapi.yaml \
--checks all \
--hypothesis-max-examples 200Checkpoint (CI):
- PR cannot merge if
breakingdiff detected for the current major version - Pact verification required for top N consumers
- Schemathesis run in nightly (or per-release) against staging
Metrics to track:
- Escaped breaking changes: breaking diffs that reached production (goal: 0)
- Spec/impl drift incidents per quarter
Run versions in parallel, route at the gateway, and migrate with telemetry
The cleanest migrations I’ve seen look boring:
v2exists alongsidev1- Traffic shifts gradually
- Old version is shut off on purpose, not by accident
Step-by-step migration plan
- Ship v2 behind the gateway (no public announcement yet)
- Canary internal clients (or a friendly partner)
- Measure SLOs per version (error rate, latency, saturation)
- Publish a migration guide + SDK updates
- Add deprecation headers on v1
- Throttle new signups to v1 (soft pressure)
- Sunset v1 when adoption hits your threshold
Deprecation headers example:
HTTP/1.1 200 OK
Deprecation: true
Sunset: Wed, 31 Dec 2026 23:59:59 GMT
Link: <https://api.acme.com/docs/migrate-to-v2>; rel="deprecation"Gateway routing example (Envoy) for explicit version prefixes:
virtual_hosts:
- name: orders
domains: ["api.acme.com"]
routes:
- match: { prefix: "/v1/orders" }
route: { cluster: orders_v1 }
- match: { prefix: "/v2/orders" }
route: { cluster: orders_v2 }Checkpoint (operational readiness):
- Separate dashboards and alerts per version
- Separate rollbacks per version (don’t tie v1/v2 to the same deploy)
Metrics to track (minimum viable):
- Traffic share by version (requests/min)
- 4xx and 5xx rate by version
- p95 latency by version
- Adoption slope: % migrated per week
- Deprecation burn-down: # clients remaining on v1 (tracked via API keys or auth claims)
Tooling suggestions:
Prometheus+Grafanadashboards withversionlabelOpenTelemetrytraces withhttp.routeincluding/v1vs/v2Kong/Apigee/AWS API Gatewayusage plans to identify top laggards
Deprecation is a product: set budgets, dates, and escalation paths
If you don’t put constraints around old versions, you end up supporting them forever. I’ve seen “temporary” v1 endpoints still running during a cloud migration five years later because nobody owned the shutdown.
A deprecation process that doesn’t rely on vibes:
- Define support windows
- Example: “Major versions supported for 24 months; security fixes for 30 months.”
- Require explicit sunset dates for new majors
- Publish customer comms templates (email + docs + changelog)
- Create an escalation ladder
- 90 days: warn
- 30 days: targeted outreach to top consumers
- 7 days: require exception approval to stay on old version
- Budget migration engineering time
- If you don’t fund migration help, you’re choosing permanent legacy.
Checkpoint (governance):
- A single owner (role, not a person) for version lifecycle
- Quarterly review: which versions can be retired this quarter
Metrics to track:
- Time-to-deprecate: major release → old major retired
- Exception count: # clients granted extended support (this is hidden cost)
What GitPlumbers does when versioning is already messy
When we get called in, the common pattern is: multiple versions, inconsistent routing, AI-assisted code changes that didn’t update the OpenAPI, and no one can answer “who is still on v1?”
The fastest path to sanity usually looks like:
- Build an adoption dashboard (by API key, tenant, or auth claim)
- Add CI breaking-change gates (
oasdiff+ Spectral) - Stand up dual routing at the gateway and canary safely
- Add contract tests for your top consumers
Internal links:
- API and legacy rescue: https://gitplumbers.com/services/api-rescue
- Code rescue for AI-generated code: https://gitplumbers.com/services/vibe-code-cleanup
- Case studies: https://gitplumbers.com/case-studies
If you want a second set of eyes, bring us one service’s OpenAPI spec, your gateway config, and a list of top consumers. We can usually tell in a week whether you’re versioning safely or just hoping.
Next step: run oasdiff against your last 20 merges and see how many “breaking” changes slipped through. That number is your starting line.
Key takeaways
- Versioning is a policy problem first: define what counts as “breaking” and enforce it with tooling.
- Prefer additive evolution; when you must break, route versions at the edge and run both in parallel long enough to measure adoption.
- Track version adoption, error rates by version, and breaking-change rate—treat them like SLOs, not documentation.
- Automate compatibility gates in CI using OpenAPI diffs + consumer contracts (Pact) + runtime probes (Schemathesis).
- Deprecation without telemetry is fantasy; build dashboards and budgets for migration time.
Implementation checklist
- Define a “breaking change” rubric and publish it internally (and ideally to customers).
- Choose a versioning surface area (path vs header vs media type) and standardize it across services.
- Add CI gates: `oasdiff`/`openapi-diff` + Spectral rules for compatibility.
- Implement consumer-driven contract tests (Pact) for top consumers.
- Instrument metrics: traffic share by version, 4xx/5xx by version, p95 latency by version, deprecation burn-down.
- Implement deprecation headers and a predictable sunset process (with dates).
- Run dual-stack (v1 + v2) behind the gateway and canary traffic until SLOs hold.
- Delete old versions intentionally—after your dashboards say it’s safe.
Questions we hear from teams
- Should we use semantic versioning for APIs?
- Use SemVer concepts, but be explicit: most HTTP APIs only meaningfully support **major** (breaking) and **minor** (additive) evolution. The important part is enforcing what “breaking” means with automated diffs and contracts, not arguing about whether a change is 1.6.3 vs 1.7.0.
- Is header-based versioning better than `/v1` paths?
- Header/media-type versioning is elegant, but path-based majors are easier to route, debug, and operate across gateways and caches. For most orgs with multiple clients and imperfect tooling, `/v1` and `/v2` is the least surprising option.
- How do we know when it’s safe to shut off v1?
- When you can prove (1) traffic share for v1 is below a threshold you set (often <1–5%), (2) top tenants have migrated, and (3) v2 meets SLOs. If you can’t identify remaining clients by key/tenant, you’re not ready to sunset—add that telemetry first.
- What’s the fastest win if we’re already breaking clients accidentally?
- Add an OpenAPI breaking-change gate (`oasdiff -fail-on=breaking`) to CI for the current major version, and start tracking 4xx/5xx by version. That alone prevents a lot of “oops” releases and gives you hard data for prioritizing cleanup.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
