Why not just rewrite the monolith into microservices?

Because deadlines don’t care about architecture purity. A rewrite would have blown the date and multiplied failure modes. We carved one seam behind a façade and flag, which produced immediate risk reduction without destabilizing the rest.

Could we have skipped Istio?

You could use NGINX plus retries/timeouts, but the outlier detection and circuit breaking we needed were easier with Istio in this environment. We kept it minimal: no fancy multi-tenant mesh, just what protected the integration.

What if we don’t have Kafka or Debezium?

Start with read replicas and a queue. If CDC is too heavy for your org right now, create a pub/sub contract from the monolith first. The goal is decoupling, not tech for its own sake.

How do we apply this if we’re on ECS, not EKS?

Same playbook: GitOps via CodePipeline/Spinnaker or ArgoCD-on-ECS, canaries via ALB weighted target groups, and circuit breaking via service mesh proxies or Envoy sidecars. Tools change, principles don’t.

Case-studies · Nov 7, 2025 · 8 minute read

The Black Friday Launch That Our Legacy Stack Couldn’t Survive—Until We Modernized Just Enough

A retail logistics platform was staring down a contractual go‑live with a national big-box retailer. The stack was a Java 8 monolith, shared database, and a Jenkins snowflake. We had eight weeks. Here’s what we changed—and what moved the needle.

Alex Kim

Partner, Modernization Lead

20 years shipping and rescuing software at scale. Ex-Stripe platform, ex-Atlassian reliability. At GitPlumbers, Alex leads modernization engagements that turn scary launch windows into boring Tuesdays.

We don’t need shiny. We need safe.

Back to all posts

The launch window we couldn’t miss

They had a signed integration date with a national retailer—think penalties per day if they slipped, plus marketing dollars already committed. Traffic models showed a 5–8x spike the first 72 hours. The stack? A Java 8 monolith on Tomcat 8.5, PostgreSQL 11 with a single writer, and a shared schema that looked like a spider web of implicit contracts. Deploys were weekly, manual, and brittle through a snowflake Jenkins job.

I’ve seen this movie: you don’t rewrite under a fixed date. You modernize just enough to remove the failure modes that will wreck you at the worst possible time.

“We don’t need shiny. We need safe.” the VP Eng told me on day one. Our kind of project.

What we walked into (and why it mattered)

Industry context:

Retail logistics with SLAs tied to scan-to-ship latency and carrier label generation
Contractual penalties and brand damage if onboarding slipped
SOC 2 Type II renewal in 90 days—so we needed audit-friendly controls

Constraints:

Eight weeks, no rewrite, no multi-quarter platform effort
On-call burnout and change freeze pressure from execs
EKS cluster on v1.20 nearing EOL and cost overruns in off-peak hours

Top risks we saw in week one:

Single blast radius: the monolith coupled ingestion, pricing, and label generation
Unbounded retries around third-party rate limits—classic thundering herd
No canary path, only blue/green by hand, rollbacks took ~45 minutes
Zero traceability across services; only app logs and a busy CloudWatch group
Slow queries in order_events (40M rows, no partitioning, missing composite index)

If we pushed a high-variance release the week of launch, we were one fat-finger away from a national retailer escalation and a postmortem in the Wall Street Journal. Seen it. Don’t recommend.

The minimum modernization that mattered

We framed the work as: stabilize, make releases boring, then carve the first seam out of the monolith.

SLOs and error budgets

Defined two SLOs: 99.95% success for CreateLabel within 800ms p95, and 99.9% success for IngestWebhook within 1.2s p95
Wired Prometheus + Alertmanager with burn-rate alerts; sampling traces via OpenTelemetry SDK for Java

# prometheus alert: fast burn (2% in 1h) and slow burn (5% in 6h)
- alert: CreateLabelErrorBudgetBurn
  expr: (
    sum(rate(http_requests_total{route="/labels",status!~"2.."}[1h]))
    / sum(rate(http_requests_total{route="/labels"}[1h]))
  ) > 0.02
  for: 10m
  labels:
    severity: page
  annotations:
    summary: CreateLabel SLO fast burn

GitOps and deploy safety

Moved deploys to ArgoCD with app-of-apps; Jenkins still built artifacts, but release rights moved to Git
Introduced Argo Rollouts for canary with metric checks

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: label-service
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 60}
      - analysis:
          templates:
          - templateName: p95-latency
      - setWeight: 50
      - pause: {duration: 120}
      - setWeight: 100

Circuit breaker and backpressure

Dropped Istio in with mTLS and added a simple circuit breaker around the carrier API

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: carrier-api
spec:
  host: carrier.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Feature flags and a strangler façade

Added Unleash for RecommendCarrierV2 and PricingRulesV2, off by default, to decouple code merge from feature exposure
Inserted Kong as an API façade to route /labels/* to either monolith or new label-service without changing clients

Data decoupling via CDC

Stood up Debezium -> Kafka to replicate order_events so the new label-service could read from a topic, not the shared DB
Added the missing composite index and partitioning for the immediate win

CREATE INDEX CONCURRENTLY idx_order_events_ordertype_created_at
ON order_events(order_type, created_at DESC);

-- time-based partitioning (native in PG11 via inheritance pattern)
CREATE TABLE order_events_2024_11 PARTITION OF order_events
FOR VALUES FROM ('2024-11-01') TO ('2024-12-01');

Autoscaling that actually followed load

Introduced KEDA to scale the monolith worker on Kafka lag; lowered off-peak nodes with cluster-autoscaler

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: label-worker
spec:
  scaleTargetRef:
    name: label-worker-deployment
  triggers:
  - type: kafka
    metadata:
      topic: order_events
      bootstrapServers: kafka:9092
      lagThreshold: "5000"

Rehearse the bad days

Two full rollback drills: failed canary, then DB failover; both under 15 minutes by the end
Chaos-tested the circuit breaker by blackholing the carrier sandbox

Shipping without burning the team

We kept the team out of hero mode by creating boring, repeatable paths:

Runbooks in the repo; make targets for common flows
Pre-flight checklist for release captains
Synthetic traffic with k6 to warm caches and verify SLOs before a cutover window

k6 run --vus 200 --duration 10m load/labels.js \
  -e BASE_URL=https://staging.api.example.com \
  -e AUTH_TOKEN=$STAGING_TOKEN

We cut the label-service canary to 10% three weeks pre-launch during off-hours. We watched p95s and error rates for a full hour, then walked it to 50%. No pages. We parked it there until the next day’s traffic before moving to 100%. The monolith route stayed behind a toggle for a one-click rollback in Kong.

The Friday before launch, we paused changes, ran the drill, and slept. That’s the part most teams skip.

Results that mattered (not vanity metrics)

Launch hit on time; zero P1s during the first 72 hours
Peak traffic 7.3x baseline; p95 for CreateLabel dropped from 1.8s to 450ms
MTTR down from ~4h to 22m over the first month post-launch
Release frequency up from weekly to 15 deploys/day (median) with Argo Rollouts
Error rate during canaries <0.1%; two auto-aborts saved us from pushing a bad build during the surge
Infra cost/transaction down 22% via autoscaling and idle node cleanup
Staging environment provisioning down from 5 days to 45 minutes via Terraform modules and ArgoCD

Here’s a representative Terraform slice we standardized on to kill snowflakes:

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_version = "1.27"
  cluster_name    = var.cluster_name
  manage_aws_auth = true

  eks_managed_node_group_defaults = {
    instance_types = ["m6i.large"]
    desired_size   = 3
    min_size       = 2
    max_size       = 12
  }
}

What we’d do differently (and what you can copy tomorrow)

Lessons learned:

Don’t chase a full mesh right away. We used Istio only for what we needed: mTLS + circuit breaker + traffic policy. Everything else stayed off.
Feature flags are only safe when you have telemetry on the code paths they gate. We added span attributes for flag_key so tracing showed which path users hit.
CDC is a footgun if you don’t monitor lag and schema drift. We added alerts for Debezium connector errors and a contract test on the topic schema.

Copy-paste playbook:

Define 2–3 SLOs, page on burn rate, and celebrate when you don’t burn budget.
Move deploy rights to Git with ArgoCD. Keep Jenkins as a build farm initially.
Add Argo Rollouts canaries with a single metric check. Start at 10% for 60 minutes.
Put a circuit breaker at your most fragile integration.
Carve one seam out of the monolith behind a façade and flag. Not three. One.
Add CDC to free the new service from the shared DB and throttle with backpressure.
Rehearse failure. Twice. Make it boring.

Where GitPlumbers fit and how to engage

We weren’t the hero coders. We were the force multipliers who installed seatbelts and taught the team how to drive faster without crashing. GitPlumbers ran a tight eight-week engagement:

Weeks 1–2: SLOs, observability, canary path, deploy GitOps
Weeks 3–5: Circuit breaker, flags, CDC, index/partition fixes
Weeks 6–8: Cutovers, drills, performance tuning, hand-off

If you’re staring down a date you can’t miss and a stack you don’t entirely trust, we’ll help you modernize just enough to ship safely—and leave you with systems you can maintain.

See how we approach modernization: Modernization Services
Dig into our observability play: Observability & SRE
More war stories: Case Studies

Related Resources

Key takeaways

Stabilize before you optimize: set SLOs and kill the top 3 failure modes first.
Use a strangler façade and feature flags to carve risk out of the monolith incrementally.
Canary + circuit breaker beats heroics—especially under deadline pressure.
Data decoupling (CDC) unlocks parallel delivery without a risky big-bang migration.
GitOps and infra-as-code make reversibility your safety net during go-live.

Implementation checklist

Define 2–3 critical SLOs with error budgets before touching code.
Introduce feature flags to control blast radius (`Unleash`, `LaunchDarkly`, or `OpenFeature`).
Stand up GitOps (ArgoCD/Flux) and move deploy rights out of Jenkins.
Add canary deploys and a circuit breaker via `Argo Rollouts` + `Istio`.
Instrument with `OpenTelemetry` and wire up `Prometheus`/`Grafana`/`Alertmanager`.
Decouple writes with CDC (`Debezium` to Kafka) for the first shared-DB boundary.
Practice two full rollback/recovery drills before launch.

Questions we hear from teams

Why not just rewrite the monolith into microservices?: Because deadlines don’t care about architecture purity. A rewrite would have blown the date and multiplied failure modes. We carved one seam behind a façade and flag, which produced immediate risk reduction without destabilizing the rest.
Could we have skipped Istio?: You could use NGINX plus retries/timeouts, but the outlier detection and circuit breaking we needed were easier with Istio in this environment. We kept it minimal: no fancy multi-tenant mesh, just what protected the integration.
What if we don’t have Kafka or Debezium?: Start with read replicas and a queue. If CDC is too heavy for your org right now, create a pub/sub contract from the monolith first. The goal is decoupling, not tech for its own sake.
How do we apply this if we’re on ECS, not EKS?: Same playbook: GitOps via CodePipeline/Spinnaker or ArgoCD-on-ECS, canaries via ALB weighted target groups, and circuit breaking via service mesh proxies or Envoy sidecars. Tools change, principles don’t.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about your must-hit launch Download the Launch Safety Checklist