The Monolith We Didn’t Rewrite: Turning a 12‑Year Java App Into Something You Can Ship
A fintech’s Java/Oracle monolith was burning engineers and budgets. We didn’t rewrite it. We made it maintainable, safe to change, and fast to ship — in six months, under PCI constraints, without adding headcount.
“For the first time in two years, we shipped on Friday and went home.” — VP Eng, LedgerlyBack to all posts
The mess we walked into
A mid-market fintech (call them “Ledgerly”), ~$200M ARR, had a 12-year-old Java monolith: Java 8
, Spring MVC 3.2
, Hibernate 4
, running on WebLogic 12c
with an Oracle 11g
backend (8 TB). Deploys were monthly, usually on a Saturday night with pizza and a prayer. CI was a chain of Jenkins
freestyle jobs and shell scripts. Version control: SVN
. Observability: grep and vibes.
- Incidents were up 40% YoY, pages averaged 18/week.
- MTTR sat around 140 minutes.
- Change failure rate hovered at 27%.
- p95 for the checkout flow was 1.4s on a good day.
- Compliance (PCI DSS) meant tight change windows and ugly audit trails.
Leadership wanted microservices yesterday. I’ve seen that movie. You spend 18 months and three re-orgs to get a slower, more expensive system. We pitched something different: keep the monolith, make it modular, carve only the hotspots, and make shipping safe and reversible. GitPlumbers got six months, no headcount increase, and a 99.9% SLA to maintain.
Constraints that made it interesting
If you’ve done regulated fintech, you know the constraints write half your playbook.
- No headcount increase, no freezing feature work for more than two sprints.
- Oracle 11g could not be upgraded within the window. Data gravity was real.
- PCI DSS and SOX auditability: change approvals, artifact provenance, least privilege.
- 99.9% SLO publicly committed, with quarter-end freeze windows we couldn’t touch.
- Vendor lock-in pockets (WebLogic, HAProxy, a proprietary fraud service) we had to accommodate.
- Batch jobs: nightly reconciliation took 8 hours and couldn’t be broken easily.
These constraints killed the “greenfield rewrite” fantasy. We needed incremental wins with measurable risk control.
What we changed first (first 90 days)
You can’t steer what you can’t see. We front-loaded telemetry, safety rails, and version control sanity. No architecture astronautics; just plumbing.
- Version control and CI/CD hygiene
- Migrated SVN to
GitHub Enterprise
, trunk-based development withCODEOWNERS
, protected branches, and required reviews. - Built CI in
GitHub Actions
withmvn -T 1C -DskipTests=false
parallelization; average build time dropped from 18 to 7 minutes. - Added
pre-commit
hooks for formatting and basic static checks.
- Migrated SVN to
- Containerization and repeatable deploys
- Wrapped the monolith in a container using a
distroless
base (gcr.io/distroless/java
), signed images withcosign
. - Provisioned
EKS
withTerraform
, deployed viaArgoCD
(GitOps). WebLogic went away; we ran the app as a plain Java process behindNGINX Ingress
.
- Wrapped the monolith in a container using a
- Observability before refactoring
- Dropped in
OpenTelemetry
Java auto-instrumentation,Prometheus
scraping viaServiceMonitor
, logs toLoki
, traces toTempo
. - Created Grafana SLO dashboards for checkout, login, and reconciliation with clear error budgets.
- Dropped in
- Safety switches
- Introduced
LaunchDarkly
for feature flags. Every risky change shipped dark first. - Wrote runbooks and “two-button rollback” using
ArgoCD
app diffs and history.
- Introduced
Result: in 12 weeks, we had shippable, instrumented, and rollback-able deployments. No heroics required.
Carving the monolith without rewriting it
We used a modular monolith strategy and the strangler pattern for the highest-cost domain: payments and reporting. Everything else stayed in the monolith, but with boundaries you could actually see.
- Capability mapping: We mapped modules to business capabilities and tagged them in code. Payments, Reporting, Accounts, and Admin became packages with explicit interfaces. We enforced boundaries with
ArchUnit
tests. - Hexagonal ports/adapters: We introduced ports for downstreams (fraud, ledger, notifications). Adapters lived at the edges, making it easier to swap without invasive changes.
- Contracts first: New boundaries were defined with
OpenAPI
specs and a thin anti-corruption layer inside the monolith. - Data decoupling: We deployed
Kafka (MSK)
andDebezium
CDC off Oracle redo logs. The reporting service consumed change events and wrote to its ownPostgreSQL 14 (RDS)
. We used the outbox pattern for new events emitted by the monolith. - Selective extraction: Only two modules were pulled out:
reporting-svc
(read-heavy, zero write contention) inSpring Boot 3 / Java 17
.payment-gateway
façade (to isolate vendor churn), alsoSpring Boot 3
withResilience4j
guards.
- Leave the rest: Accounts and Admin stayed inside the monolith, but with module boundaries, tests, and telemetry. No one gets promoted for rewriting Admin screens.
A few nuts-and-bolts decisions that mattered:
- Kept the monolith on
Java 11
(upgrade from 8) to minimize blast radius; new services shipped onJava 17
. - Introduced
G1GC
and tuned-XX:MaxRAMPercentage=60
to avoid container OOMs; p99 GC pauses dropped 40%. - Replaced brittle SOAP calls with internal REST over
HTTP/2
andTLS
, keeping cert rotation incert-manager
. - Added
Resilience4j
timeouts (500ms
), retries (2 with jitter), bulkheads, and circuit breakers around the payment gateway boundary.
Shipping safely: GitOps, canaries, and failure budgets
Shipping needed to be boring. We used progressive delivery and let SLOs throttle ambition.
- GitOps all the way:
ArgoCD
watched environment repos. Changes to Helm values were the only path to prod. Rollbacks wereargocd app rollback checkout-service@N
simple. - Progressive delivery:
Argo Rollouts
handled canaries for both the monolith container and the new services. Typical policy: 10% → 30% → 60% → 100% with automated analysis.- Automated analysis queried
Prometheus
: error rate (rate(http_requests_total{status=~"5.."}[5m])
), p95 latency, and pod restarts. - If error budget burn clipped 2% in 30 minutes, rollouts paused automatically. Humans decided whether to proceed.
- Automated analysis queried
- Shadow traffic: For
reporting-svc
, we mirrored 5% of GET traffic usingNGINX Ingress
annotations for a week before flipping reads. - Feature flags everywhere: Risky code paths were wrapped with
LaunchDarkly
. Toggling didn’t require redeploys and left an audit trail for PCI. - Chaos where it counts: Basic
chaos-mesh
experiments in non-prod: kill pods, inject latency, and verify circuit breakers actually break.
This let us ship daily without turning on-call into a blood sport.
Reliability and observability you can take to the board
We turned “please wait while I grep” into graphs execs could understand, with SRE practices that held up under audit.
- SLOs and error budgets: Checkout SLO at 99.9% success and p95 < 400ms. Reporting SLO at 99.95% availability. Error budgets informed deploy gates.
- Golden signals:
http_request_duration_seconds
,http_requests_total
,jvm_memory_used_bytes
,jvm_threads_states
,kafka_consumer_lag
were baseline. We correlated trace IDs through logs with OTel context propagation. - Runbooks and alerts: Alerts mapped to SLO burn, not noisy host metrics. Paging dropped because we removed the “alert for everything” culture.
- Security and provenance: SBOMs via
syft
, scans viatrivy
, image signing withcosign
. Satisfies PCI inquiries without three meetings and a PDF.
The point wasn’t Dashboard Theater. It was measurable risk reduction tied to delivery speed.
Results after six months
No fairy dust. Just plumbing and discipline. Here’s what moved, measured from 4-week baselines:
- Deployment frequency: monthly → daily (avg 5 prod deploys/week) for the monolith; on-demand for services.
- Lead time (commit→prod): ~7 days → ~4 hours for low-risk changes.
- MTTR: 140 min → 22 min (thanks to traceability and one-command rollbacks).
- Change failure rate: 27% → 6% (canaries + flags + tests that actually run).
- p95 checkout latency: 1.4s → 320ms after GC tuning, thread pool right-sizing, and payment façade.
- On-call pages: 18/week → 5/week.
- Batch reconciliation: 8h → 2h by moving reads to Postgres and parallelizing jobs.
- Infra cost: compute down ~12% via right-sizing and HPA; overall cost roughly flat after adding Kafka and observability.
- Audit friction: Change approval cycle time down 50% due to GitOps trails and signed artifacts.
Timeline highlights:
- Week 4: containerized monolith in
EKS
, GitOps live in staging. - Week 8: OTel in prod, Prometheus/Grafana dashboards live, first SLOs signed.
- Week 12: first canary to prod, Saturday deployments eliminated.
- Week 18:
reporting-svc
live; shadow traffic complete, reads flipped. - Week 22:
payment-gateway
façade live with circuit breakers; incident rate drops noticeably. - Week 24: quarter-end freeze sailed by without a war room.
“For the first time in two years, we shipped on Friday and went home.” — VP Eng, Ledgerly
What we’d do again (and what we wouldn’t)
What worked:
- Modular monolith first. It’s amazing how far you can get by drawing lines and enforcing them with tests.
- Instrument, then refactor. We caught two nasty hotspots via traces that saved weeks of blind refactoring.
- GitOps + canaries. Boring deploys are a competitive advantage.
- Treat data as a product. CDC and outbox turned the database from a shackle into a pipeline.
What we’d change:
- We waited too long to set
SLOs
for batch jobs. Do it early — they drive prioritization. - We underestimated the Ops lift of Kafka in a regulated shop. Budget runbooks and training up front.
- We should have introduced contract tests (
Pact
) earlier between monolith modules and new services.
Actionable guidance if you’re staring at a similar beast:
- Map capabilities, tag modules, and add
ArchUnit
tests to enforce boundaries. - Add
OpenTelemetry
first. You’ll save months. - Containerize and adopt
ArgoCD
. Make rollbacks a button, not a project. - Pick one hotspot to strangle. Use
Debezium
and the outbox pattern; avoid dual writes. - Wrap risky calls with
Resilience4j
— timeouts, retries with backoff, circuit breakers. - Set SLOs and let error budgets control rollout cadence.
- Use
LaunchDarkly
andArgo Rollouts
to keep the blast radius small. - Don’t fetishize microservices. Extract surgically; leave the rest modular and boring.
Key takeaways
- You don’t need a full microservices rewrite to regain velocity — a modular monolith plus selective extractions is often the fastest path.
- Instrument first, then slice. Without telemetry and SLOs, you’re just moving risk around.
- Use feature flags, shadow traffic, and canary rollouts to de-risk every step; avoid “weekend big bangs.”
- Treat data as the real monolith. Use CDC (Debezium) and the outbox pattern to carve safely.
- Adopt GitOps and progressive delivery to make shipping boring and reversible.
Implementation checklist
- Map capabilities and hot spots; pick one or two to strangle, leave the rest in a modular monolith.
- Add OpenTelemetry tracing and Prometheus metrics before major refactors.
- Move to GitHub/GitLab, trunk-based dev, and protected branches; wire CI to run tests and smoke checks.
- Containerize the monolith, run it in Kubernetes with right-sized limits and HPA.
- Introduce ArgoCD and Argo Rollouts for declarative deploys and canaries with automated analysis.
- Implement Resilience4j timeouts, retries, bulkheads, and circuit breakers on risky boundaries.
- Use Debezium CDC and an outbox table to evolve data boundaries without dual writes.
- Define SLOs with error budgets; let them govern rollout speed and risk.
Questions we hear from teams
- Why didn’t you just move everything to microservices?
- Because the business needed results in months, not years. The modular monolith plus strangler pattern delivered measurable improvements with lower risk, especially under PCI constraints and tight change windows.
- How did you avoid dual writes when splitting data?
- We used Debezium CDC for existing tables and the outbox pattern for new events. Services consumed change streams and owned their read models (Postgres), avoiding risky dual writes.
- Is Istio required for this?
- No. We used NGINX Ingress and Argo Rollouts for canaries. A full service mesh adds operational load you might not need. If you do add one later, start with mTLS and traffic policies only.
- What if we can’t touch Oracle?
- Same here. Treat the database like an event source: capture changes with Debezium, move read-heavy use cases first, and isolate write paths behind a façade to reduce blast radius.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.