The Monolith That Wouldn’t Die: How We Made a 12-Year Legacy App Ship Weekly Without a “Big Rewrite”
A real-world modernization: modularizing a Java/Spring monolith, carving out the right services, and turning “deploy day” from a fire drill into a routine.
The biggest win wasn’t microservices. It was making change safe again.Back to all posts
The situation: “Deploy day” was a weekly incident generator
I’ve seen this movie since the early 2000s: a monolith that started life as a reasonable Rails/Java/.NET app, then got “enterprise-ified” until every change required a priest, a pager, and a weekend.
This client was a 250-person B2B SaaS company in the payments/fintech orbit (read: PCI-ish controls, audits, and zero tolerance for data weirdness). The core app was a Java 8 + Spring MVC monolith with a single shared database (PostgreSQL) and a growing pile of “helpers” that every team touched.
Symptoms you’ll recognize:
- Release cadence: monthly (and still scary)
- Change-failure rate: ~30% of releases caused rollback/hotfix
- MTTR: ~6 hours (because reproducing prod issues locally was fantasy)
- p95 latency: ~900ms on core workflows during peak
- On-call: alert storms, shallow dashboards, deep blame
They’d already been pitched the classic consultant bingo card: “Just do microservices” + “move to Kubernetes” + “rewrite the backend in $NEW_THING.” They didn’t need hype. They needed the thing to stop breaking.
GitPlumbers came in with one constraint up front: no big-bang rewrite. Keep revenue flowing while we modernize.
Constraints that made this hard (and why most rewrites die here)
The most dangerous part of legacy modernization is pretending constraints don’t exist. These did:
- Regulated environment: audit trails and access controls mattered as much as throughput.
- Hard uptime requirements: maintenance windows were basically gone.
- Coupled data model: one schema served everything; “just split the database” wasn’t a plan.
- Org reality: teams were feature-focused, not platform-focused. Any approach requiring a 6-month freeze was dead on arrival.
- A splash of AI-assisted code: some newer modules were “vibe-coded” into existence—fast, inconsistent, and not always correct. We treated that as risk to quarantine, not a moral failing.
Here’s what I’ve learned the hard way: if you extract services before you have observability, contracts, and operational discipline, you don’t get microservices—you get distributed monolith pain.
What we did first: make the monolith safe to change
Before we carved anything out, we stabilized delivery. The goal was simple: reduce blast radius and increase confidence.
Baseline metrics (so we could prove improvement)
- DORA-style: deploy frequency, lead time, change-failure rate, MTTR
- Runtime: p95 latency, error rate, saturation (CPU, DB connections)
Add real observability
OpenTelemetrytracing across HTTP + DB callsPrometheusmetrics +Grafanadashboards- Standardized log fields:
trace_id,customer_id,request_id
# Example: instrumenting a Spring service with OTel agent (no code changes to start)
java -javaagent:/otel/opentelemetry-javaagent.jar \
-Dotel.service.name=core-monolith \
-Dotel.exporter.otlp.endpoint=https://otel-collector.internal:4317 \
-jar app.jar- Create boundaries inside the monolith (the part people skip)
- We introduced a modular monolith structure:
billing,accounts,orders,reporting - Enforced dependency rules (no “reach into another package’s guts”)
- We introduced a modular monolith structure:
// Example: a boundary-friendly interface (conceptual)
export interface BillingService {
createInvoice(customerId: string, orderId: string): Promise<{ invoiceId: string }>
}
// The key is not TypeScript vs Java—the key is *module ownership* and a stable interface.- CI/CD that fails fast
- Integration tests moved to ephemeral environments (Docker Compose for PRs)
- Contract tests added where we knew extraction would happen
Result after 4 weeks: fewer “unknown unknowns.” Incidents didn’t stop yet—but we could finally see causes instead of guessing.
The architecture move: modular monolith first, then selective strangling
There’s a reason the pendulum swung back from “microservices everything.” Most orgs don’t need 40 services; they need clear seams and independent deployability where it actually pays.
We used a two-step strategy:
- Step A: modular monolith to stop internal sprawl
- Step B: Strangler Fig to extract the highest-churn domain (billing) behind a stable edge
We introduced an API edge that routed requests either to the monolith or new services. Nothing fancy—just a clean choke point.
# Example: Kubernetes Ingress routing during strangler phase
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-edge
spec:
rules:
- host: api.example.com
http:
paths:
- path: /billing
pathType: Prefix
backend:
service:
name: billing-svc
port:
number: 8080
- path: /
pathType: Prefix
backend:
service:
name: monolith-svc
port:
number: 8080Why billing?
- High change rate (pricing experiments, invoicing rules)
- Clear domain boundaries (inputs/outputs were already semi-stable)
- Painful failure modes (billing bugs become money + trust bugs)
We explicitly did not extract low-churn domains just to say we did microservices. I’ve watched that vanity project eat roadmaps.
The data problem: shared database without “YOLO distributed transactions”
The shared DB is where most modernization efforts faceplant. If you split compute but keep one schema and let everyone write to everything, you’ve built a slower monolith with extra steps.
We did three practical things:
- Write ownership: billing service owned billing tables. The monolith stopped writing them.
- Outbox pattern: publish domain events reliably without two-phase commit.
- Idempotent consumers: because retries happen (especially at 2am).
-- Outbox table (owned by the billing service)
CREATE TABLE billing_outbox (
id UUID PRIMARY KEY,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
published_at TIMESTAMPTZ
);
CREATE INDEX ON billing_outbox (published_at) WHERE published_at IS NULL;# Example: Kafka topic config (conceptual)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: billing-events
spec:
partitions: 6
replicas: 3
config:
min.insync.replicas: 2
cleanup.policy: compactAnd the non-negotiable rule:
- The monolith could subscribe to
billing-events, but it could not “helpfully” reach into billing tables anymore.
That single rule prevented months of backsliding.
Delivery mechanics: GitOps, progressive rollout, and fewer 2am surprises
Once billing was extractable, we had to make deploys boring. We’ve all seen teams “modernize” architecture but keep 2009-era release process.
We implemented:
ArgoCDfor GitOps (cluster state matches Git, not tribal knowledge)- Canary deployments for billing changes
- Feature flags for risky behavior changes (especially pricing rules)
- SLOs with alerting based on symptoms, not noise
# Example: Horizontal Pod Autoscaler for billing service
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: billing-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: billing
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65# Example: ArgoCD sync for a service (what “push button deploy” became)
argocd app sync billing-service-prod
argocd app wait billing-service-prod --healthWe also added a simple but effective operational guardrail: error-budget-based release decisions. If the billing SLO was burning, experiments paused. It stopped the “ship more to fix the last ship” spiral.
Results: measurable outcomes after 90 days (and what changed culturally)
By day 90, we had not “killed the monolith.” We’d done something better: we made it maintainable and started extracting where it paid off.
Measured outcomes (production):
- Deploy frequency: monthly → weekly (billing service: multiple deploys/week)
- MTTR:
6 hours → **45 minutes** (better signals + smaller blast radius) - Change-failure rate:
30% → **12%** (still room to improve, but no longer roulette) - p95 latency (billing endpoints):
900ms → **220ms** (cache + reduced contention) - Incident volume: down ~40% (fewer cascading failures)
- Infra cost: down ~25% (targeted scaling instead of scaling the whole monolith)
What changed culturally was just as important:
- Teams stopped treating the monolith as a haunted house nobody touches.
- New development defaulted to clear module ownership.
- AI-assisted changes got safer because we had contracts, CI, and observability to catch “looks right” code.
The biggest win wasn’t microservices. It was making change safe again.
What actually worked (and what we’d do differently next time)
Here’s the blunt guidance I give leaders who’ve been burned before:
- Don’t start with Kubernetes. Start with boundaries, tests, and telemetry. K8s won’t fix spaghetti.
- Modular monolith buys you time. You’ll get 60–70% of the benefit without the distributed systems tax.
- Extract based on economics. High change rate + clear domain seam + high business risk = good candidate.
- Make data ownership explicit. Without it, your “services” are just remote method calls with latency.
- Use canaries/flags like grown-ups. If you can’t roll forward safely, you’re not actually shipping.
What we’d do differently:
- Add consumer-driven contract tests earlier. We added them once extraction started; earlier would’ve reduced rework.
- Start SLOs in week one. We added them around week four; those early weeks had too much “it feels better” decision-making.
If you’re staring at a legacy monolith and dreading the next roadmap cycle, GitPlumbers can help you modernize without betting the business on a rewrite.
Key takeaways
- Start by making the monolith safe to change (tests, boundaries, observability) before extracting services.
- Prefer a **modular monolith** over premature microservices; extract only where you have high change rate + clear domain seams.
- Use **contract tests** and an **outbox pattern** to avoid distributed data “mystery meat.”
- Tie modernization to measurable KPIs: deploy frequency, change-failure rate, MTTR, p95 latency, and infra cost.
Implementation checklist
- Baseline current delivery + reliability metrics (deploy frequency, MTTR, change-failure rate, p95 latency)
- Define bounded contexts and enforce module boundaries in-code (not in a slide deck)
- Introduce an API edge (`/api`) so you can strangle safely
- Add observability (`OpenTelemetry`, `Prometheus`, trace IDs) before you split anything
- Implement outbox + idempotent consumers before adopting async events
- Automate releases with CI/CD + GitOps (`ArgoCD`) and add progressive delivery (canary/flags)
- Extract 1-2 services max at a time; measure impact; rinse and repeat
Questions we hear from teams
- Should we always break a monolith into microservices?
- No. Most teams get faster by enforcing module boundaries and shipping a modular monolith first. Extract services only where the domain seam is clear and the change rate/business risk makes the distributed-systems tax worth it.
- How do we split a shared database without breaking everything?
- Start with explicit write ownership and stop cross-domain writes. Use an outbox pattern for reliable event publication and make consumers idempotent. Full DB-per-service can come later, but you can get major wins without it.
- What metrics should we track to prove modernization is working?
- Deploy frequency, lead time for changes, change-failure rate, MTTR, plus service SLOs (latency/error rate) and unit costs (e.g., cost per 1k requests or per customer workflow).
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
