The Monolith We Didn’t Rewrite: Turning a 12‑Year Java App Into Something You Can Ship

A fintech’s Java/Oracle monolith was burning engineers and budgets. We didn’t rewrite it. We made it maintainable, safe to change, and fast to ship — in six months, under PCI constraints, without adding headcount.

“For the first time in two years, we shipped on Friday and went home.” — VP Eng, Ledgerly
Back to all posts

The mess we walked into

A mid-market fintech (call them “Ledgerly”), ~$200M ARR, had a 12-year-old Java monolith: Java 8, Spring MVC 3.2, Hibernate 4, running on WebLogic 12c with an Oracle 11g backend (8 TB). Deploys were monthly, usually on a Saturday night with pizza and a prayer. CI was a chain of Jenkins freestyle jobs and shell scripts. Version control: SVN. Observability: grep and vibes.

  • Incidents were up 40% YoY, pages averaged 18/week.
  • MTTR sat around 140 minutes.
  • Change failure rate hovered at 27%.
  • p95 for the checkout flow was 1.4s on a good day.
  • Compliance (PCI DSS) meant tight change windows and ugly audit trails.

Leadership wanted microservices yesterday. I’ve seen that movie. You spend 18 months and three re-orgs to get a slower, more expensive system. We pitched something different: keep the monolith, make it modular, carve only the hotspots, and make shipping safe and reversible. GitPlumbers got six months, no headcount increase, and a 99.9% SLA to maintain.

Constraints that made it interesting

If you’ve done regulated fintech, you know the constraints write half your playbook.

  • No headcount increase, no freezing feature work for more than two sprints.
  • Oracle 11g could not be upgraded within the window. Data gravity was real.
  • PCI DSS and SOX auditability: change approvals, artifact provenance, least privilege.
  • 99.9% SLO publicly committed, with quarter-end freeze windows we couldn’t touch.
  • Vendor lock-in pockets (WebLogic, HAProxy, a proprietary fraud service) we had to accommodate.
  • Batch jobs: nightly reconciliation took 8 hours and couldn’t be broken easily.

These constraints killed the “greenfield rewrite” fantasy. We needed incremental wins with measurable risk control.

What we changed first (first 90 days)

You can’t steer what you can’t see. We front-loaded telemetry, safety rails, and version control sanity. No architecture astronautics; just plumbing.

  1. Version control and CI/CD hygiene
    • Migrated SVN to GitHub Enterprise, trunk-based development with CODEOWNERS, protected branches, and required reviews.
    • Built CI in GitHub Actions with mvn -T 1C -DskipTests=false parallelization; average build time dropped from 18 to 7 minutes.
    • Added pre-commit hooks for formatting and basic static checks.
  2. Containerization and repeatable deploys
    • Wrapped the monolith in a container using a distroless base (gcr.io/distroless/java), signed images with cosign.
    • Provisioned EKS with Terraform, deployed via ArgoCD (GitOps). WebLogic went away; we ran the app as a plain Java process behind NGINX Ingress.
  3. Observability before refactoring
    • Dropped in OpenTelemetry Java auto-instrumentation, Prometheus scraping via ServiceMonitor, logs to Loki, traces to Tempo.
    • Created Grafana SLO dashboards for checkout, login, and reconciliation with clear error budgets.
  4. Safety switches
    • Introduced LaunchDarkly for feature flags. Every risky change shipped dark first.
    • Wrote runbooks and “two-button rollback” using ArgoCD app diffs and history.

Result: in 12 weeks, we had shippable, instrumented, and rollback-able deployments. No heroics required.

Carving the monolith without rewriting it

We used a modular monolith strategy and the strangler pattern for the highest-cost domain: payments and reporting. Everything else stayed in the monolith, but with boundaries you could actually see.

  • Capability mapping: We mapped modules to business capabilities and tagged them in code. Payments, Reporting, Accounts, and Admin became packages with explicit interfaces. We enforced boundaries with ArchUnit tests.
  • Hexagonal ports/adapters: We introduced ports for downstreams (fraud, ledger, notifications). Adapters lived at the edges, making it easier to swap without invasive changes.
  • Contracts first: New boundaries were defined with OpenAPI specs and a thin anti-corruption layer inside the monolith.
  • Data decoupling: We deployed Kafka (MSK) and Debezium CDC off Oracle redo logs. The reporting service consumed change events and wrote to its own PostgreSQL 14 (RDS). We used the outbox pattern for new events emitted by the monolith.
  • Selective extraction: Only two modules were pulled out:
    • reporting-svc (read-heavy, zero write contention) in Spring Boot 3 / Java 17.
    • payment-gateway façade (to isolate vendor churn), also Spring Boot 3 with Resilience4j guards.
  • Leave the rest: Accounts and Admin stayed inside the monolith, but with module boundaries, tests, and telemetry. No one gets promoted for rewriting Admin screens.

A few nuts-and-bolts decisions that mattered:

  • Kept the monolith on Java 11 (upgrade from 8) to minimize blast radius; new services shipped on Java 17.
  • Introduced G1GC and tuned -XX:MaxRAMPercentage=60 to avoid container OOMs; p99 GC pauses dropped 40%.
  • Replaced brittle SOAP calls with internal REST over HTTP/2 and TLS, keeping cert rotation in cert-manager.
  • Added Resilience4j timeouts (500ms), retries (2 with jitter), bulkheads, and circuit breakers around the payment gateway boundary.

Shipping safely: GitOps, canaries, and failure budgets

Shipping needed to be boring. We used progressive delivery and let SLOs throttle ambition.

  • GitOps all the way: ArgoCD watched environment repos. Changes to Helm values were the only path to prod. Rollbacks were argocd app rollback checkout-service@N simple.
  • Progressive delivery: Argo Rollouts handled canaries for both the monolith container and the new services. Typical policy: 10% → 30% → 60% → 100% with automated analysis.
    • Automated analysis queried Prometheus: error rate (rate(http_requests_total{status=~"5.."}[5m])), p95 latency, and pod restarts.
    • If error budget burn clipped 2% in 30 minutes, rollouts paused automatically. Humans decided whether to proceed.
  • Shadow traffic: For reporting-svc, we mirrored 5% of GET traffic using NGINX Ingress annotations for a week before flipping reads.
  • Feature flags everywhere: Risky code paths were wrapped with LaunchDarkly. Toggling didn’t require redeploys and left an audit trail for PCI.
  • Chaos where it counts: Basic chaos-mesh experiments in non-prod: kill pods, inject latency, and verify circuit breakers actually break.

This let us ship daily without turning on-call into a blood sport.

Reliability and observability you can take to the board

We turned “please wait while I grep” into graphs execs could understand, with SRE practices that held up under audit.

  • SLOs and error budgets: Checkout SLO at 99.9% success and p95 < 400ms. Reporting SLO at 99.95% availability. Error budgets informed deploy gates.
  • Golden signals: http_request_duration_seconds, http_requests_total, jvm_memory_used_bytes, jvm_threads_states, kafka_consumer_lag were baseline. We correlated trace IDs through logs with OTel context propagation.
  • Runbooks and alerts: Alerts mapped to SLO burn, not noisy host metrics. Paging dropped because we removed the “alert for everything” culture.
  • Security and provenance: SBOMs via syft, scans via trivy, image signing with cosign. Satisfies PCI inquiries without three meetings and a PDF.

The point wasn’t Dashboard Theater. It was measurable risk reduction tied to delivery speed.

Results after six months

No fairy dust. Just plumbing and discipline. Here’s what moved, measured from 4-week baselines:

  • Deployment frequency: monthly → daily (avg 5 prod deploys/week) for the monolith; on-demand for services.
  • Lead time (commit→prod): ~7 days → ~4 hours for low-risk changes.
  • MTTR: 140 min → 22 min (thanks to traceability and one-command rollbacks).
  • Change failure rate: 27% → 6% (canaries + flags + tests that actually run).
  • p95 checkout latency: 1.4s → 320ms after GC tuning, thread pool right-sizing, and payment façade.
  • On-call pages: 18/week → 5/week.
  • Batch reconciliation: 8h → 2h by moving reads to Postgres and parallelizing jobs.
  • Infra cost: compute down ~12% via right-sizing and HPA; overall cost roughly flat after adding Kafka and observability.
  • Audit friction: Change approval cycle time down 50% due to GitOps trails and signed artifacts.

Timeline highlights:

  • Week 4: containerized monolith in EKS, GitOps live in staging.
  • Week 8: OTel in prod, Prometheus/Grafana dashboards live, first SLOs signed.
  • Week 12: first canary to prod, Saturday deployments eliminated.
  • Week 18: reporting-svc live; shadow traffic complete, reads flipped.
  • Week 22: payment-gateway façade live with circuit breakers; incident rate drops noticeably.
  • Week 24: quarter-end freeze sailed by without a war room.

“For the first time in two years, we shipped on Friday and went home.” — VP Eng, Ledgerly

What we’d do again (and what we wouldn’t)

What worked:

  • Modular monolith first. It’s amazing how far you can get by drawing lines and enforcing them with tests.
  • Instrument, then refactor. We caught two nasty hotspots via traces that saved weeks of blind refactoring.
  • GitOps + canaries. Boring deploys are a competitive advantage.
  • Treat data as a product. CDC and outbox turned the database from a shackle into a pipeline.

What we’d change:

  • We waited too long to set SLOs for batch jobs. Do it early — they drive prioritization.
  • We underestimated the Ops lift of Kafka in a regulated shop. Budget runbooks and training up front.
  • We should have introduced contract tests (Pact) earlier between monolith modules and new services.

Actionable guidance if you’re staring at a similar beast:

  1. Map capabilities, tag modules, and add ArchUnit tests to enforce boundaries.
  2. Add OpenTelemetry first. You’ll save months.
  3. Containerize and adopt ArgoCD. Make rollbacks a button, not a project.
  4. Pick one hotspot to strangle. Use Debezium and the outbox pattern; avoid dual writes.
  5. Wrap risky calls with Resilience4j — timeouts, retries with backoff, circuit breakers.
  6. Set SLOs and let error budgets control rollout cadence.
  7. Use LaunchDarkly and Argo Rollouts to keep the blast radius small.
  8. Don’t fetishize microservices. Extract surgically; leave the rest modular and boring.

Related Resources

Key takeaways

  • You don’t need a full microservices rewrite to regain velocity — a modular monolith plus selective extractions is often the fastest path.
  • Instrument first, then slice. Without telemetry and SLOs, you’re just moving risk around.
  • Use feature flags, shadow traffic, and canary rollouts to de-risk every step; avoid “weekend big bangs.”
  • Treat data as the real monolith. Use CDC (Debezium) and the outbox pattern to carve safely.
  • Adopt GitOps and progressive delivery to make shipping boring and reversible.

Implementation checklist

  • Map capabilities and hot spots; pick one or two to strangle, leave the rest in a modular monolith.
  • Add OpenTelemetry tracing and Prometheus metrics before major refactors.
  • Move to GitHub/GitLab, trunk-based dev, and protected branches; wire CI to run tests and smoke checks.
  • Containerize the monolith, run it in Kubernetes with right-sized limits and HPA.
  • Introduce ArgoCD and Argo Rollouts for declarative deploys and canaries with automated analysis.
  • Implement Resilience4j timeouts, retries, bulkheads, and circuit breakers on risky boundaries.
  • Use Debezium CDC and an outbox table to evolve data boundaries without dual writes.
  • Define SLOs with error budgets; let them govern rollout speed and risk.

Questions we hear from teams

Why didn’t you just move everything to microservices?
Because the business needed results in months, not years. The modular monolith plus strangler pattern delivered measurable improvements with lower risk, especially under PCI constraints and tight change windows.
How did you avoid dual writes when splitting data?
We used Debezium CDC for existing tables and the outbox pattern for new events. Services consumed change streams and owned their read models (Postgres), avoiding risky dual writes.
Is Istio required for this?
No. We used NGINX Ingress and Argo Rollouts for canaries. A full service mesh adds operational load you might not need. If you do add one later, start with mTLS and traffic policies only.
What if we can’t touch Oracle?
Same here. Treat the database like an event source: capture changes with Debezium, move read-heavy use cases first, and isolate write paths behind a façade to reduce blast radius.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an architect See how we approach modernization

Related resources