The Black Friday Launch That Our Legacy Stack Couldn’t Survive—Until We Modernized Just Enough
A retail logistics platform was staring down a contractual go‑live with a national big-box retailer. The stack was a Java 8 monolith, shared database, and a Jenkins snowflake. We had eight weeks. Here’s what we changed—and what moved the needle.
We don’t need shiny. We need safe.Back to all posts
The launch window we couldn’t miss
They had a signed integration date with a national retailer—think penalties per day if they slipped, plus marketing dollars already committed. Traffic models showed a 5–8x spike the first 72 hours. The stack? A Java 8 monolith on Tomcat 8.5, PostgreSQL 11 with a single writer, and a shared schema that looked like a spider web of implicit contracts. Deploys were weekly, manual, and brittle through a snowflake Jenkins job.
I’ve seen this movie: you don’t rewrite under a fixed date. You modernize just enough to remove the failure modes that will wreck you at the worst possible time.
“We don’t need shiny. We need safe.” the VP Eng told me on day one. Our kind of project.
What we walked into (and why it mattered)
Industry context:
- Retail logistics with SLAs tied to scan-to-ship latency and carrier label generation
- Contractual penalties and brand damage if onboarding slipped
- SOC 2 Type II renewal in 90 days—so we needed audit-friendly controls
Constraints:
- Eight weeks, no rewrite, no multi-quarter platform effort
- On-call burnout and change freeze pressure from execs
- EKS cluster on
v1.20nearing EOL and cost overruns in off-peak hours
Top risks we saw in week one:
- Single blast radius: the monolith coupled ingestion, pricing, and label generation
- Unbounded retries around third-party rate limits—classic thundering herd
- No canary path, only blue/green by hand, rollbacks took ~45 minutes
- Zero traceability across services; only app logs and a busy
CloudWatchgroup - Slow queries in
order_events(40M rows, no partitioning, missing composite index)
If we pushed a high-variance release the week of launch, we were one fat-finger away from a national retailer escalation and a postmortem in the Wall Street Journal. Seen it. Don’t recommend.
The minimum modernization that mattered
We framed the work as: stabilize, make releases boring, then carve the first seam out of the monolith.
- SLOs and error budgets
- Defined two SLOs: 99.95% success for
CreateLabelwithin 800ms p95, and 99.9% success forIngestWebhookwithin 1.2s p95 - Wired
Prometheus+Alertmanagerwith burn-rate alerts; sampling traces viaOpenTelemetrySDK for Java
# prometheus alert: fast burn (2% in 1h) and slow burn (5% in 6h)
- alert: CreateLabelErrorBudgetBurn
expr: (
sum(rate(http_requests_total{route="/labels",status!~"2.."}[1h]))
/ sum(rate(http_requests_total{route="/labels"}[1h]))
) > 0.02
for: 10m
labels:
severity: page
annotations:
summary: CreateLabel SLO fast burn- GitOps and deploy safety
- Moved deploys to
ArgoCDwith app-of-apps; Jenkins still built artifacts, but release rights moved to Git - Introduced
Argo Rolloutsfor canary with metric checks
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: label-service
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 60}
- analysis:
templates:
- templateName: p95-latency
- setWeight: 50
- pause: {duration: 120}
- setWeight: 100- Circuit breaker and backpressure
- Dropped
Istioin with mTLS and added a simple circuit breaker around the carrier API
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: carrier-api
spec:
host: carrier.svc.cluster.local
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50- Feature flags and a strangler façade
- Added
UnleashforRecommendCarrierV2andPricingRulesV2, off by default, to decouple code merge from feature exposure - Inserted
Kongas an API façade to route/labels/*to either monolith or newlabel-servicewithout changing clients
- Data decoupling via CDC
- Stood up
Debezium->Kafkato replicateorder_eventsso the newlabel-servicecould read from a topic, not the shared DB - Added the missing composite index and partitioning for the immediate win
CREATE INDEX CONCURRENTLY idx_order_events_ordertype_created_at
ON order_events(order_type, created_at DESC);
-- time-based partitioning (native in PG11 via inheritance pattern)
CREATE TABLE order_events_2024_11 PARTITION OF order_events
FOR VALUES FROM ('2024-11-01') TO ('2024-12-01');- Autoscaling that actually followed load
- Introduced
KEDAto scale the monolith worker on Kafka lag; lowered off-peak nodes withcluster-autoscaler
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: label-worker
spec:
scaleTargetRef:
name: label-worker-deployment
triggers:
- type: kafka
metadata:
topic: order_events
bootstrapServers: kafka:9092
lagThreshold: "5000"- Rehearse the bad days
- Two full rollback drills: failed canary, then DB failover; both under 15 minutes by the end
- Chaos-tested the circuit breaker by blackholing the carrier sandbox
Shipping without burning the team
We kept the team out of hero mode by creating boring, repeatable paths:
- Runbooks in the repo;
maketargets for common flows - Pre-flight checklist for release captains
- Synthetic traffic with
k6to warm caches and verify SLOs before a cutover window
k6 run --vus 200 --duration 10m load/labels.js \
-e BASE_URL=https://staging.api.example.com \
-e AUTH_TOKEN=$STAGING_TOKENWe cut the label-service canary to 10% three weeks pre-launch during off-hours. We watched p95s and error rates for a full hour, then walked it to 50%. No pages. We parked it there until the next day’s traffic before moving to 100%. The monolith route stayed behind a toggle for a one-click rollback in Kong.
The Friday before launch, we paused changes, ran the drill, and slept. That’s the part most teams skip.
Results that mattered (not vanity metrics)
- Launch hit on time; zero P1s during the first 72 hours
- Peak traffic 7.3x baseline; p95 for
CreateLabeldropped from 1.8s to 450ms - MTTR down from ~4h to 22m over the first month post-launch
- Release frequency up from weekly to 15 deploys/day (median) with
Argo Rollouts - Error rate during canaries <0.1%; two auto-aborts saved us from pushing a bad build during the surge
- Infra cost/transaction down 22% via autoscaling and idle node cleanup
- Staging environment provisioning down from 5 days to 45 minutes via
Terraformmodules andArgoCD
Here’s a representative Terraform slice we standardized on to kill snowflakes:
module "eks" {
source = "terraform-aws-modules/eks/aws"
cluster_version = "1.27"
cluster_name = var.cluster_name
manage_aws_auth = true
eks_managed_node_group_defaults = {
instance_types = ["m6i.large"]
desired_size = 3
min_size = 2
max_size = 12
}
}What we’d do differently (and what you can copy tomorrow)
Lessons learned:
- Don’t chase a full mesh right away. We used
Istioonly for what we needed: mTLS + circuit breaker + traffic policy. Everything else stayed off. - Feature flags are only safe when you have telemetry on the code paths they gate. We added span attributes for
flag_keyso tracing showed which path users hit. - CDC is a footgun if you don’t monitor lag and schema drift. We added alerts for Debezium connector errors and a contract test on the topic schema.
Copy-paste playbook:
- Define 2–3 SLOs, page on burn rate, and celebrate when you don’t burn budget.
- Move deploy rights to Git with
ArgoCD. Keep Jenkins as a build farm initially. - Add
Argo Rolloutscanaries with a single metric check. Start at 10% for 60 minutes. - Put a circuit breaker at your most fragile integration.
- Carve one seam out of the monolith behind a façade and flag. Not three. One.
- Add CDC to free the new service from the shared DB and throttle with backpressure.
- Rehearse failure. Twice. Make it boring.
Where GitPlumbers fit and how to engage
We weren’t the hero coders. We were the force multipliers who installed seatbelts and taught the team how to drive faster without crashing. GitPlumbers ran a tight eight-week engagement:
- Weeks 1–2: SLOs, observability, canary path, deploy GitOps
- Weeks 3–5: Circuit breaker, flags, CDC, index/partition fixes
- Weeks 6–8: Cutovers, drills, performance tuning, hand-off
If you’re staring down a date you can’t miss and a stack you don’t entirely trust, we’ll help you modernize just enough to ship safely—and leave you with systems you can maintain.
- See how we approach modernization: Modernization Services
- Dig into our observability play: Observability & SRE
- More war stories: Case Studies
Related Resources
Key takeaways
- Stabilize before you optimize: set SLOs and kill the top 3 failure modes first.
- Use a strangler façade and feature flags to carve risk out of the monolith incrementally.
- Canary + circuit breaker beats heroics—especially under deadline pressure.
- Data decoupling (CDC) unlocks parallel delivery without a risky big-bang migration.
- GitOps and infra-as-code make reversibility your safety net during go-live.
Implementation checklist
- Define 2–3 critical SLOs with error budgets before touching code.
- Introduce feature flags to control blast radius (`Unleash`, `LaunchDarkly`, or `OpenFeature`).
- Stand up GitOps (ArgoCD/Flux) and move deploy rights out of Jenkins.
- Add canary deploys and a circuit breaker via `Argo Rollouts` + `Istio`.
- Instrument with `OpenTelemetry` and wire up `Prometheus`/`Grafana`/`Alertmanager`.
- Decouple writes with CDC (`Debezium` to Kafka) for the first shared-DB boundary.
- Practice two full rollback/recovery drills before launch.
Questions we hear from teams
- Why not just rewrite the monolith into microservices?
- Because deadlines don’t care about architecture purity. A rewrite would have blown the date and multiplied failure modes. We carved one seam behind a façade and flag, which produced immediate risk reduction without destabilizing the rest.
- Could we have skipped Istio?
- You could use NGINX plus retries/timeouts, but the outlier detection and circuit breaking we needed were easier with Istio in this environment. We kept it minimal: no fancy multi-tenant mesh, just what protected the integration.
- What if we don’t have Kafka or Debezium?
- Start with read replicas and a queue. If CDC is too heavy for your org right now, create a pub/sub contract from the monolith first. The goal is decoupling, not tech for its own sake.
- How do we apply this if we’re on ECS, not EKS?
- Same playbook: GitOps via CodePipeline/Spinnaker or ArgoCD-on-ECS, canaries via ALB weighted target groups, and circuit breaking via service mesh proxies or Envoy sidecars. Tools change, principles don’t.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
