The “Real-Time” Pipeline That Stalled at Lunch — And How We Stopped Losing Money by 12:05

Streaming data isn’t about speed; it’s about reliability under pressure. Here’s how we design Kafka/Flink/Iceberg stacks that absorb spikes, preserve quality, and ship decisions that move revenue.

“If your SLOs aren’t in Grafana by week two, you’re not building a streaming system — you’re building a demo.”
Back to all posts

The day your “real-time” pipeline died at noon

I’ve watched a retail client’s “real-time” Kafka pipeline faceplant exactly at lunch — when mobile orders spiked 10x. Fraud scores lagged, orders sat, and we lost authorization windows. Everyone blamed Kafka. The real issue: no contracts on the wire, a fan-out of Python consumers with no backpressure, and a “we’ll replay if needed” fantasy with seven-day retention on a topic doing 80 MB/s.

I’ve seen this fail. Here’s what actually works when the velocity is real, the money’s on the line, and you can’t page 30 people to babysit offsets.

Start with SLOs you can bet revenue on

Skip the “tool tour.” Begin with budgets the business cares about:

  • Decision latency SLO: e.g., fraud scoring P99 < 700 ms end-to-end.
  • Freshness SLO: inventory staleness < 2 seconds for 95% of updates.
  • Loss/error budget: ≤ 0.01% dropped/poison messages per day.
  • Replay RTO: reprocess last 24 hours within 3 hours at 3x normal throughput.

Make these explicit and put them in dashboards. Everything else (Kafka partitions, Flink checkpoints, Iceberg compaction, tiered storage) becomes a tactical choice to meet these budgets. If you can’t measure P99 end-to-end latency and consumer lag per decision path, you’re building on vibes.

If your SLOs aren’t in Grafana by week two, you’re not building a streaming system — you’re building a demo.

The boring, proven core: Kafka + Flink + Iceberg (and CDC that doesn’t lie)

A stack we’ve shipped repeatedly without heroics:

  • Ingress: Debezium for CDC off OLTP, Kafka Connect for managed connectors. Avoid cron-polling tables — it lies about change timing and shreds your DB.
  • Broker: Kafka (Confluent Cloud or AWS MSK). Use Schema Registry with Avro/Protobuf. Turn on idempotent producers and transactions.
  • Compute: Apache Flink for stateful, event-time processing and exactly-once. If your team’s deep on Spark, Structured Streaming can work, but watch state store blowups.
  • Storage: Iceberg on S3/GCS/ADLS for replayable, queryable history and batch/BI interoperability. Compaction + clustering keeps reads sane.
  • Serving: Kafka for low-latency consumers; Iceberg for analytics; optional ksqlDB for simple joins/filters when you don’t need a JVM.

Topic strategy matters more than logo choices:

  • Naming: domain.entity.verb.vN (e.g., orders.payment.authorized.v2). Version on breaking changes.
  • Partitioning: pick keys that match reads/state (e.g., order_id for per-order joins; store_id for inventory aggregates).
  • Compaction: separate compacted topics for latest-state, retained topics for event history.
  • Retention: enough to meet replay RTO, with tiered storage for cost.

Example: Kafka producer properties that don’t lose money under load:

# java/kafka-clients >= 3.5
acks=all
enable.idempotence=true
retries=2147483647
max.in.flight.requests.per.connection=5
compression.type=snappy
linger.ms=10
batch.size=131072
request.timeout.ms=30000
delivery.timeout.ms=120000
transactional.id=payments-prod-01

Make data contracts real, not slideware

You can’t have quality without enforcing shape and semantics at the edge.

  1. Schema Registry mandatory. Producers must register Avro/Protobuf schemas; consumers fail fast on incompatible changes.
  2. Data contracts extend schema with semantics: required field sets, allowed ranges, PII classification, and dedupe keys.
  3. Quality gates at ingestion with DLQ and quarantine. Don’t poison downstream state.

Debezium connector for MySQL CDC with schema enforcement:

{
  "name": "mysql-orders-cdc",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql.internal",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "*****",
    "database.server.id": "184054",
    "database.server.name": "commerce",
    "table.include.list": "orders.order,orders.payment",
    "tombstones.on.delete": "true",
    "transforms": "unwrap,route",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.add.fields": "op,ts_ms",
    "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.route.regex": "(.*)\.([^.]+)\.([^.]+)",
    "transforms.route.replacement": "$2.$3.cdc.v1",
    "heartbeat.interval.ms": "5000",
    "snapshot.mode": "initial"
  }
}

Stream-time validation and DLQ in Flink (Scala) with an obvious dedupe key:

import org.apache.flink.api.common.eventtime.WatermarkStrategy
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic

case class Payment(orderId: String, amount: BigDecimal, currency: String, ts: Long, eventId: String)

val env = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(10000) // 10s
env.getCheckpointConfig.setCheckpointTimeout(60000)

val payments: DataStream[Payment] = /* Kafka source with Avro deserialization */ ???

val (good, bad) = payments
  .map(p => (p, validate(p)))
  .split { case (_, Right(_)) => Seq("good"); case _ => Seq("bad") }
  .select("good", "bad") match {
    case (g: DataStream[(Payment, Right[_,_])], b: DataStream[(Payment, Left[_,_])]) =>
      (g.map(_._1), b.map(_._1))
  }

// route bad records to DLQ topic for quarantine
bad.addSink(kafkaSink("payments.dlq.v1", semantic = Semantic.EXACTLY_ONCE))

// dedupe by eventId with 15-min TTL state
import org.apache.flink.api.common.state.{ValueStateDescriptor, StateTtlConfig}
val deduped = good
  .keyBy(_.eventId)
  .process(new DeduplicateWithTTL[Payment](ttlMinutes = 15))

deduped.addSink(kafkaSink("payments.valid.v1", semantic = Semantic.EXACTLY_ONCE))

Contract tests in CI for producer payloads (Great Expectations-esque checks) beat post-hoc data dogpiles. If AI-generated code is spitting JSON without schema evolution rules, quarantine it until it stops vibing.

Exactly-once, replay, and cost: choose all three

You can have reliability without setting money on fire.

  • Exactly-once: Kafka idempotent producers + transactions, Flink checkpointing with transactional sinks (Kafka/Iceberg). No “at-least-once + dedupe later” wishcasting on hot paths.
  • Replayable history: design for reprocessing. Retain events long enough and keep state externalized (Iceberg) so you can rebuild joins.
  • Cost controls: tier cold segments and compact aggressively.

Iceberg for replay-friendly storage and batch/BI:

-- Iceberg SQL (Spark or Flink SQL)
CREATE TABLE lake.orders_events (
  order_id STRING,
  event_type STRING,
  amount DECIMAL(18,2),
  ts TIMESTAMP,
  event_id STRING
)
PARTITIONED BY (days(ts))
TBLPROPERTIES (
  'write.format.default'='parquet',
  'write.distribution-mode'='hash',
  'format-version'='2'
);

-- time travel for point-in-time replay
SELECT * FROM lake.orders_events VERSION AS OF '2025-12-01T12:00:00Z';

Kafka with tiered storage (Confluent or OSS 3.6+) lets you keep weeks of history without a CFO incident. For compacted topics, use tombstones for deletes and measure compaction lag.

Terraforming topics so ops is a pull request, not tribal knowledge:

# Confluent Cloud example
resource "confluent_kafka_topic" "orders_events" {
  topic_name       = "orders.events.v2"
  partitions_count = 48
  config = {
    "cleanup.policy"          = "delete"
    "retention.ms"            = "604800000"   # 7 days on hot storage
    "segment.bytes"           = "1073741824"
    "min.insync.replicas"     = "2"
    "message.timestamp.type"  = "CreateTime"
    "compression.type"        = "snappy"
  }
}

If you can’t recreate downstream state from upstream events in a day, you don’t own your data — your outages do.

Observability you can page on at 3 a.m.

Watch the things that correlate with lost revenue, not just broker CPU.

  • Consumer lag per critical group with budgets. Burrow or native exporter.
  • End-to-end latency: Kafka timestamp → sink write time; watermarks vs processing time.
  • Contract violations: DLQ rate; sudden schema evolution failures.
  • Throughput vs checkpoint time: early warning for backpressure.
  • Drift: distribution changes on key fields (amount, currency, country) that predict fraud/ETL weirdness.

Prometheus alert that actually saves money:

# alert when fraud decisions risk breaching SLO in 10m
- alert: ConsumerLagBudgetBreached
  expr: (kafka_consumer_group_lag{group="fraud-scorer"} > 50000)
        or (histogram_quantile(0.99, sum(rate(stream_end_to_end_latency_seconds_bucket{pipeline="fraud"}[5m])) by (le)) > 0.7)
  for: 10m
  labels:
    severity: page
  annotations:
    summary: "Fraud pipeline nearing P99 latency SLO breach"
    description: "Lag {{ $value }} or e2e p99 > 700ms for 10m. Auto-scale consumers or shed non-critical enrichers."

If you’re on Kubernetes, scrape Flink and Kafka exporters, trace across producers/consumers with OpenTelemetry, and build runbooks: how to drain partitions, restart with savepoints, and replay safely.

Reference implementation: AWS MSK + Flink + Iceberg (what we actually deploy)

When clients ask for something they can run without a PhD, we ship this pattern:

  • Kafka: AWS MSK with 3 AZs, 48 partitions for hot topics, 3x replication, tiered storage on.
  • Schema Registry: Confluent or AWS Glue Schema Registry; we prefer Confluent for evolution policies.
  • Flink: Native on K8s with flink-kubernetes-operator or managed (Ververica). Checkpoint to S3, HA via K8s.
  • Iceberg: On S3 with Glue/REST catalog. Compaction jobs nightly.
  • GitOps: ArgoCD deploys connectors/jobs; Terraform for MSK/Topics/ACLs; secrets in AWS Secrets Manager.

Flink SQL job for a practical join with event-time and exactly-once to Iceberg:

-- sources
CREATE TABLE payments (
  order_id STRING,
  amount DECIMAL(18,2),
  currency STRING,
  ts TIMESTAMP(3),
  event_id STRING,
  WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'payments.valid.v1',
  'properties.bootstrap.servers' = 'b-1.msk:9092',
  'format' = 'avro-confluent',
  'scan.startup.mode' = 'latest-offset'
);

CREATE TABLE auths (
  order_id STRING,
  auth_code STRING,
  ts TIMESTAMP(3),
  WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'payments.authorized.v1',
  'properties.bootstrap.servers' = 'b-1.msk:9092',
  'format' = 'avro-confluent',
  'scan.startup.mode' = 'latest-offset'
);

CREATE TABLE fact_payments (
  order_id STRING,
  amount DECIMAL(18,2),
  auth_code STRING,
  event_ts TIMESTAMP(3)
) WITH (
  'connector'='iceberg',
  'catalog-name'='lake',
  'catalog-type'='hadoop',
  'warehouse'='s3://lakehouse/warehouse'
);

INSERT INTO fact_payments
SELECT p.order_id, p.amount, a.auth_code, p.ts
FROM payments p
LEFT JOIN auths a
ON p.order_id = a.order_id
AND a.ts BETWEEN p.ts - INTERVAL '2' MINUTE AND p.ts + INTERVAL '2' MINUTE;

Rollout plan that doesn’t bite:

  1. Canary consumers on 1 partition; compare outputs to batch truth.
  2. Shadow mode for new schemas; DLQ traffic under 0.01% for a week before cutover.
  3. Blue/green Flink jobs with savepoints, --fromSavepoint to rollback.
  4. Circuit breakers on enrichers (Resilience4j) so a slow third-party doesn’t drown your pipeline.

What good looks like in 90 days

We’ve done this at a fintech, a food delivery marketplace, and a logistics network. The numbers rhyme:

  • Cut P99 decision latency from 5–7 minutes to <800 ms.
  • Reduced consumer lag incidents by 93% with autoscaling and checkpoint tuning.
  • Brought schema-related outages from weekly to zero in 60 days with enforced contracts.
  • Replay of 24 hours of data completed in <2 hours using tiered storage and Iceberg compaction.
  • Finance stopped asking “why does the BI number not match the real-time number?” — because both read the same Iceberg tables.

A crisp 30/60/90 sequencing:

  • Days 0–30: SLOs in Grafana, Schema Registry live, CDC in shadow, topics and ACLs via Terraform, DLQ in place.
  • Days 31–60: Flink jobs in prod behind feature flags; end-to-end tracing; first replay drill; cost telemetry for retention/tiering.
  • Days 61–90: Self-serve patterns: connector templates, contract CI checks, replay runbooks, autoscaling policies. Decommission batch crons that were lying to you.

And yes, kill the AI-generated consumers that aren’t idempotent — we’ve cleaned too many of those with GitPlumbers’ code rescue. Replace with a standard sink that actually honors transactions.

Related Resources

Key takeaways

  • Start with business SLOs (latency, loss, freshness) before picking tech; enforce them as budget constraints.
  • Use a boring, proven spine: Kafka + Flink + Iceberg/Hudi/Delta, with Schema Registry and CDC via Debezium.
  • Make data contracts explicit and enforce them at ingestion; route violations to DLQ with a replay plan.
  • Enable end-to-end exactly-once and design for backpressure, reprocessing, and cost-aware retention/tiering.
  • Instrument the pipeline as a product: consumer lag, end-to-end latency, contract violations, and data drift with actionable alerts.
  • Ship in 90 days by sequencing: contracts and topics (30), core pipelines + observability (60), replay + self-serve patterns (90).

Implementation checklist

  • Define P99 latency, loss rate, and freshness SLOs tied to business decisions.
  • Stand up Schema Registry and enforce Avro/Protobuf contracts at producers and connectors.
  • Use CDC (Debezium) instead of polling; partition topics by access pattern and business key.
  • Turn on Kafka idempotent producers and Flink exactly-once sinks with checkpointing.
  • Add DLQs and quarantines with replay tooling; version events for evolution without downtime.
  • Instrument lag, watermark delays, and contract violations; alert on budgets, not noise.
  • Use tiered storage/compaction to control cost and keep replay feasible.
  • Deploy with GitOps (ArgoCD) and IaC (Terraform) for reproducible changes and rollbacks.

Questions we hear from teams

How do I choose between Flink and Spark Structured Streaming?
If you need low-latency event-time joins, complex state, exactly-once sinks, and predictable backpressure handling, Flink is the safer bet. Spark can work for micro-batch patterns and teams already deep on Spark, but watch state store size and checkpoint commit latency. We use Flink for hot paths and keep Spark for batch/ML training and serving Iceberg tables.
Can I get exactly-once with Kafka and Iceberg?
Yes. Use Kafka idempotent producers and transactions on the write path. In Flink, enable checkpointing and use transactional sinks for Kafka and Iceberg (Iceberg’s Flink sink commits atomically with snapshot isolation). Ensure your sinks support two-phase commits and your checkpoint storage is reliable (S3/GCS/ADLS), with timeouts tuned to your throughput.
How much retention do I need on Kafka topics?
Work backward from replay goals. If you need to reprocess 24 hours at 3x speed, keep at least 48 hours hot and push older segments to tiered storage. For compacted topics, size for compaction lag and tombstones. Monitor storage cost vs replay RTO and adjust; most teams under-retain until their first backfill fire drill.
Do I need a Schema Registry if I’m using JSON?
Yes — or stop using JSON for critical paths. Schema Registry with Avro/Protobuf gives you compatibility rules and evolution controls. JSON without a registry devolves into breaking changes at 2 a.m. If you must do JSON, use JSON Schema in a registry and enforce compatibility at producers.
What’s the fastest way to add quality checks without slowing down the pipeline?
Validate at the edge with lightweight checks (required fields, type, ranges) and route failures to a DLQ. Run deeper statistical checks (drift, anomalies) asynchronously off the DLQ or Iceberg tables. Keep hot-path logic deterministic and idempotent; do heavy lifts out-of-band.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Fix your streaming pipeline See the real-time commerce case study

Related resources