The API Versioning Plan That Stops 2 a.m. Rollbacks (and Keeps Old Clients Alive)
A pragmatic, battle-tested playbook for evolving public and internal APIs without lighting up PagerDuty. Includes concrete checkpoints, metrics, and tooling for REST, gRPC, and GraphQL.
Versioning isn’t a URL scheme. It’s a commitment you enforce with CI gates, runtime metrics, and a deprecation calendar you actually follow.Back to all posts
The failure mode you’re trying to avoid
I’ve watched “simple” API changes trigger week-long fire drills: a mobile app pinned to an old JSON shape, an enterprise customer running a six-month release train, and a backend team that quietly changed pagination semantics because “it still returns 200.” The result is always the same: backward compatibility becomes an incident, and versioning turns into tribal lore.
The trick isn’t just adding /v2. The trick is designing a system where:
- Additive changes ship weekly without coordinating with every consumer.
- Breaking changes are rare, deliberate, and measurable.
- Old clients don’t die unexpectedly (but they also don’t live forever).
At GitPlumbers, most “API versioning projects” we get called into are really API behavior archaeology plus guardrails. Here’s what actually works.
Step 1: Write down your compatibility contract (or you don’t have one)
Before you touch routing or URL paths, define what you promise clients. This becomes the rulebook your CI enforces.
- Define what counts as a breaking change for your API style.
- Decide what changes are conditionally breaking (depends on client assumptions).
- Document “safe” evolution patterns.
REST/JSON breaking changes (common list):
- Removing a field or endpoint
- Renaming fields
- Changing type (
string→number) - Making an optional field required
- Changing enum values (removing or repurposing)
- Changing pagination defaults/limits in ways that break loops
- Changing auth behavior (scopes/claims) without a transition plan
Semantics are real breaking changes even if the schema looks compatible:
- Sorting order changes
- Idempotency changes (POST becomes “sometimes”)
- Timezone/format changes for timestamps
Checkpoint:
- You can answer: “If we do X, is it breaking?” in under 30 seconds.
Metrics to start tracking now:
- Breaking-change rate (count/month)
- Time-to-detect accidental breaking changes (should trend to minutes via CI)
- MTTR for client-impacting regressions
Step 2: Pick your versioning surface (and don’t mix schemes)
There are only a few sane places to put a version. Choose based on your clients and infrastructure.
Option A: URI versioning (/v1/...)
- Best for: public APIs, human debugging, broad client diversity, simple gateways
- Pros: easy routing, visible in logs, trivial docs
- Cons: encourages “forking” instead of evolving; people cargo-cult
/v3forever
Example:
GET /v1/orders/123
GET /v2/orders/123Option B: Media type / content negotiation (Accept: application/vnd...)
- Best for: APIs behind strong gateways (Kong/Envoy), internal platforms, when you want stable URLs
- Pros: more flexible; URL stays stable
- Cons: more work in clients; debugging is harder; some tooling is clunkier
Example:
GET /orders/123
Accept: application/vnd.acme.orders+json;version=2Option C: Header versioning (X-API-Version: 2)
- Best for: internal APIs where you control clients
- Pros: easy to add; routing is simple
- Cons: not cache-friendly by default; less standard than media types
gRPC note
For gRPC, versioning usually lives in protobuf package names (and/or service names). Don’t fight the ecosystem.
syntax = "proto3";
package acme.orders.v1;
service OrdersService {
rpc GetOrder(GetOrderRequest) returns (GetOrderResponse);
}Checkpoint:
- One API = one primary scheme. You’re not supporting
/v1andX-API-Versionunless you enjoy ambiguity.
Tooling suggestion:
- If you’re using an API gateway (
Kong,NGINX,Envoy), pick the scheme it can route and log cleanly.
Step 3: Design “v1 to vNext” as an additive migration, not a rewrite
The fastest way to blow up backward compatibility is to treat v2 as a blank slate. The boring (successful) approach is:
- Keep
v1behavior stable. - Implement
v2as v1 + additive capabilities. - Use translation layers only where you must.
REST example: keep old shape, add new fields
If v1 has:
{ "id": "123", "total": 4200 }v2 can safely add:
{ "id": "123", "total": 4200, "currency": "USD", "lineItems": [] }Rule: clients must ignore unknown fields. If you have clients that choke on extra fields (yes, some old Java/Jackson setups did this), fix the client or put those clients on a stricter negotiated response.
Express routing example (URI versioning)
import express from "express";
const app = express();
app.get("/v1/orders/:id", async (req, res) => {
const order = await loadOrder(req.params.id);
res.json({ id: order.id, total: order.totalCents });
});
app.get("/v2/orders/:id", async (req, res) => {
const order = await loadOrder(req.params.id);
res.json({
id: order.id,
total: order.totalCents,
currency: order.currency,
lineItems: order.items,
});
});Gateway routing example (NGINX)
location ~ ^/v(?<ver>\d+)/orders/ {
proxy_set_header X-Api-Version $ver;
proxy_pass http://orders-service;
}Checkpoint:
- You can deploy
v2with zero changes from at least one realv1client.
Metrics:
- Per-version request volume (v1 vs v2)
- 4xx rate by version (a sudden v2-only spike is an adoption problem; a v1 spike is a regression)
- p95 latency by version (translation layers can hurt)
Step 4: Put compatibility enforcement in CI (so humans don’t have to remember)
I’ve seen “we’ll be careful” fail at Google-scale and startup-scale. You need an automated bouncer.
REST/OpenAPI: diff the spec
- Generate or maintain an
OpenAPI 3.0/3.1spec. - In CI, compare the branch spec to
main. - Fail the build on unapproved breaking changes.
Using oasdiff (one popular option):
# compare two specs and fail on breaking changes
oasdiff breaking ./openapi-main.yaml ./openapi-branch.yamlGitHub Actions sketch:
name: api-compat
on: [pull_request]
jobs:
oasdiff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: curl -Ls https://github.com/Tufin/oasdiff/releases/latest/download/oasdiff-linux-amd64 -o oasdiff
- run: chmod +x oasdiff
- run: ./oasdiff breaking openapi-main.yaml openapi.yamlAdd linting with Spectral to keep specs honest:
spectral lint openapi.yamlgRPC/protobuf: use buf
buf breaking --against '.git#branch=main'Checkpoint:
- A PR that removes a response field or changes a type cannot merge without an explicit override and reviewer sign-off.
Proof-point metric:
- “Breaking changes caught in CI” should be non-zero early on. If it’s zero forever, either you’re perfect (unlikely) or the gate isn’t real.
Step 5: Build consumer reality checks (contracts + traffic)
Specs catch structural breaks. They don’t catch semantic breaks or the “weird client from 2017.” You need runtime signals.
Consumer-driven contract testing (CDC)
If you have a manageable set of known consumers, Pact is still one of the most practical tools.
- Consumers publish expectations.
- Provider verifies them in CI.
- You find out you broke Billing before Billing finds out you broke Billing.
Checkpoint:
- Your highest-risk consumers (mobile, partner integrations, revenue-critical services) are covered by CDC or equivalent golden tests.
Runtime metrics that matter
Instrument and dashboard:
- Requests by version (top consumers if you can identify them)
- 4xx/5xx by route + version
- Schema/validation errors (if you validate)
- Unknown field usage (if you can log/trace it)
If you’re already on Prometheus, make it easy:
# example Prometheus scrape label strategy (conceptual)
metric_relabel_configs:
- source_labels: [__name__]
regex: http_server_requests_seconds_count
action: keepAnd in your app, tag metrics with api_version, route, and status.
SLO suggestion:
- Availability SLO per version (yes, per version):
99.9%monthly forv1andv2untilv1is formally retired.
Step 6: Deprecation that doesn’t turn into an eternal support burden
This is where most orgs lie to themselves. They say “we’ll turn off v1 in 90 days,” then a big customer shows up and suddenly it’s 900 days.
What works is boring governance plus tooling:
- Publish a deprecation policy (dates, criteria, exceptions).
- Emit explicit warnings to clients.
- Measure adoption weekly.
- Enforce the sunset.
Add deprecation signals to responses
Use Deprecation and Sunset headers (supported in modern HTTP tooling), plus docs links.
Deprecation: true
Sunset: Wed, 01 Oct 2026 00:00:00 GMT
Link: <https://docs.acme.com/apis/orders/v1-deprecation>; rel="deprecation"In OpenAPI, mark operations as deprecated:
paths:
/v1/orders/{id}:
get:
deprecated: true
responses:
"200":
description: OKCheckpoints for sunsetting
- 90 days out: headers + emails + dashboard shared
- 30 days out: error-budget review; confirm top consumers migrated
- 7 days out: final notice; increase alerting on v1 error spikes (clients scrambling)
- Sunset day: block or serve a clear error response with migration link
Metrics that keep you honest:
- % traffic on deprecated versions (goal: monotonic down)
- # of unique consumers still on v1 (goal: down)
- Support tickets related to migration (goal: spike early, then down)
If you can’t name an owner for “v1 retirement,” congratulations—you’ve just created a permanent compatibility tax.
Step 7: The “don’t be clever” rules I wish more teams followed
These are scars talking:
- Don’t version for every change. Version for breaking changes and major semantic shifts. Everything else should be additive.
- Don’t fork the whole backend per version unless you absolutely must. Prefer shared core + thin adapters.
- Don’t ship an untestable API. If you can’t generate a spec or define contracts, you’re flying blind.
- Don’t let clients choose arbitrary versions forever. Support a bounded set (
v1,v2) with published dates. - Don’t ignore “behavior compatibility.” A schema diff won’t catch that you changed rounding, ordering, or idempotency.
If you’re already in the weeds—three versions live, inconsistent payloads, and “vibe-coded” handlers nobody wants to touch—GitPlumbers typically starts by wiring up the spec + diff gates, then stabilizing behavior with contract tests and per-version dashboards. That’s the path from folklore to an actual system.
If you want a second set of eyes, here are the kinds of engagements we do: API rescue and hardening and legacy modernization.
Key takeaways
- Prefer additive evolution and strict compatibility rules; treat breaking changes like production incidents.
- Pick a versioning surface deliberately (URI vs header/media type) based on your client ecosystem and gateways.
- Automate breaking-change detection with OpenAPI/proto diffs in CI and enforce with a “no unreviewed breaking change” gate.
- Measure adoption and compatibility with per-version traffic, 4xx deltas, and client upgrade velocity.
- Deprecation needs dates, headers, dashboards, and an owner—otherwise you’ll run zombie versions forever.
Implementation checklist
- Define what “breaking” means for your API (types, fields, semantics, auth, pagination).
- Choose a single versioning scheme per API and document it (URI or media type; avoid mixing).
- Add CI gates: `oasdiff`/`openapi-diff` for REST, `buf` for protobuf, schema checks for GraphQL.
- Instrument per-version metrics: request volume, latency, 4xx/5xx rates, and top consumers.
- Ship compatibility tests (consumer-driven contracts or golden tests) before launching `vNext`.
- Publish a deprecation policy with timelines and enforce it with headers + dashboards.
- Use canaries and gradual rollout (gateway routing + feature flags) for risky changes.
Questions we hear from teams
- Should I use SemVer for REST APIs?
- Use SemVer thinking, not SemVer theater. SemVer works great for libraries; HTTP APIs evolve differently. Treat **breaking changes** as a major version bump (new `/v2` or negotiated version). Treat additive, backward-compatible changes as minor/patch and ship them continuously without forcing client upgrades.
- Is `/v1` in the URL always bad?
- No. URI versioning is often the most operationally transparent choice—especially for public APIs and partner integrations. The failure mode is using `/v2` as permission to rewrite everything. If you keep versions bounded, add CI compatibility checks, and run a real deprecation process, `/v1` is perfectly fine.
- How do we handle versioning with GraphQL?
- GraphQL typically avoids explicit versions by evolving the schema additively: add fields, deprecate old fields with `@deprecated`, and never change field meaning. If you need a true breaking change, you can introduce a parallel field or a new root type. The same rules apply: measure deprecated field usage and enforce sunsets.
- What’s the minimum tooling stack to prevent accidental breaking changes?
- For REST: an `OpenAPI` spec + `oasdiff` (or `openapi-diff`) in CI, plus basic `Prometheus`/`Grafana` per-version dashboards. For gRPC: `buf breaking`. Add CDC (`Pact`) for your top consumers when you can.
- How many API versions should we support concurrently?
- As few as you can, typically **2** (current + previous). Supporting 3+ versions usually means you lack a working deprecation mechanism or you’re papering over deeper coupling. If you must support more, make it time-bound and visible with adoption metrics and a retirement owner.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
