From 8‑Minute Lag to 30‑Second Insights: A Streaming Data Backbone That Doesn’t Flinch

You don’t need more buzzwords. You need a streaming architecture that survives traffic spikes, keeps your data clean, and ships value on time.

We didn’t make it faster by being clever; we made it reliable by being boring—then it got fast.
Back to all posts

The night the clickstream flooded Kafka

A retailer’s Black Friday pre-sale turned a “stable” pipeline into a bonfire. Clickstream volume tripled to 2B events/day, a couple of malformed payloads slipped through, and the stream jobs started thrashing. p95 end-to-end latency ballooned from 45s to 8 minutes, recommendations fell back to a stale model, and marketing paused spend mid-campaign. Been there? I have—more than once.

We rebuilt the streaming backbone in two weeks: kept the firehose, stopped the fires. The results: p95 latency under 30s, duplicate rate under 0.01%, data completeness at 99.98% during peak. Here’s what actually works when velocity gets real.

Architect for speed without breaking the truth

Streaming that survives peak loads is boring by design. The pattern we deploy at GitPlumbers looks like this:

  • Log-first (Kappa) architecture: one durable log (Kafka/Redpanda/Pulsar). No Lambda split-brain. Everything is an event.
  • Contracts at the edge: Schema Registry (Avro/Protobuf), BACKWARD_TRANSITIVE compatibility, producers fail fast on bad schemas.
  • Stateful processors: Flink 1.18 or Spark Structured Streaming with checkpointing, watermarks, and exactly-once sinks.
  • Idempotent writes: sinks that support upserts/merge (Iceberg, Delta, Snowflake Snowpipe Streaming, BigQuery Storage Write API).
  • Observability & SLOs: measure end-to-end latency, consumer lag, data freshness, completeness, duplicate rate, and MTTR. Alert on SLO burn, not CPU.

Boring architecture scales. Heroics don’t.

Get data in without corrupting it

High-velocity starts at ingestion. You don’t want your producers inventing formats at 2 a.m. or your connectors spraying poison pills.

  1. Use CDC or the outbox pattern for databases. With Debezium 2.x, emit change events from orders as protobuf with a stable key.
  2. For app telemetry, front with Nginx/Envoy -> Kafka REST proxy or Kafka native producers. Enforce schema at the edge.
  3. Size partitions for throughput, not vibes. Target partition write rates under ~50 MB/s and keep consumer groups balanced.

Minimal producer config for exactly-once into Kafka (or Redpanda):

acks=all
enable.idempotence=true
max.in.flight.requests.per.connection=1
linger.ms=20
batch.size=32768
compression.type=zstd
transactional.id=orders-producer-v1

Kafka topic hygiene for clickstream:

# 12 partitions, 7-day retention, compaction for user-profiles
kafka-topics --create --topic clickstream.events.v1 \
  --partitions 12 --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=$((7*24*60*60*1000))

kafka-topics --create --topic user.profile.v1 \
  --partitions 6 --replication-factor 3 \
  --config cleanup.policy=compact

Guardrails with Schema Registry (Confluent or AWS Glue):

# Enforce backward-transitive compatibility
curl -X PUT http://schema-registry/api/compatibility/subjects/clickstream.events/versions/latest \
  -H 'Content-Type: application/json' \
  -d '{"compatibility": "BACKWARD_TRANSITIVE"}'

Reliability in motion: exactly-once, backpressure, and replay

I’ve seen more outages from “fast but dirty” streams than from anything else. You need to engineer for failure on day one.

  • Exactly-once semantics (EOS): In Flink, enable checkpointing and two-phase commits. Don’t hand-wave this.
  • Backpressure: Let the framework slow upstream operators instead of dropping events. Watch watermarks and operator busy time.
  • DLQ with context: Route bad records with headers: source, offset, schemaVersion, validationErrors.
  • Replayability: Keep 7–14 days in the log. Reprocess by consumer group or by offset range.
  • Deduplication: Idempotent keys + windowed dedup so retries don’t cause double charges.
  • Circuit breakers: When a sink degrades (warehouse throttling), buffer to a staging topic and degrade gracefully.

Flink essentials that keep you out of the ditch:

# flink-conf.yaml
execution.checkpointing.interval: 30s
execution.checkpointing.mode: EXACTLY_ONCE
state.backend: rocksdb
state.checkpoints.dir: s3://my-bucket/flink/checkpoints
state.savepoints.dir: s3://my-bucket/flink/savepoints
execution.checkpointing.max-concurrent-checkpoints: 1
pipeline.object-reuse: true

Dedup in Flink with watermarking (pseudo-Java):

stream
  .keyBy(e -> e.userId())
  .assignTimestampsAndWatermarks(wmStrategy.withIdleness(Duration.ofMinutes(1)))
  .process(new DedupWithinWindow(Duration.ofMinutes(10))) // keep seen ids in RocksDB
  .uid("user-dedup");

Kafka Connect with DLQ and retries:

{
  "errors.tolerance": "all",
  "errors.deadletterqueue.topic.name": "clickstream.dlq",
  "errors.deadletterqueue.context.headers.enable": true,
  "errors.retry.timeout": 600000,
  "errors.retry.delay.max.ms": 60000
}

Quality at speed: contracts and in-stream tests

Bad data at 10k events/sec is still bad data. Catch it before it sinks.

  • Data contracts: Team owns the schema and SLAs, not just the code. Protobuf with required fields for keys; optional fields for evolvability.
  • In-stream assertions: Great Expectations/Deequ on the stream job. Fail closed on critical checks, divert to DLQ on non-critical.
  • Watermarks and lateness: Define what “on time” means. If click events can be 5 minutes late, watermark at 6–7 minutes.
  • PII governance: Classify at ingest. Mask or tokenize in-stream. Encrypt at rest and in transit. Delete by key on request (GDPR/CCPA).

Example: Light-weight GE validation in Spark Structured Streaming:

from great_expectations.dataset import SparkDFDataset

validated = (
  raw_stream
    .withColumn("ts", to_timestamp(col("ts")))
    .transform(lambda df: SparkDFDataset(df))
)

assert validated.expect_column_values_to_not_be_null("user_id").success
assert validated.expect_column_values_to_match_regex("event_type", "^[a-z_]+$").success

Data contracts in practice:

  • Producers must bump schema version on breaking changes; CI checks against Schema Registry.
  • Consumers pin to a major version; deploy canary consumers before switching traffic.
  • Alert if unknown_fields_rate > 0.1% or schema_rejections > 50/min.

If your “contract” is a wiki page, it’s not a contract.

Ship value, not just events

Events are only valuable when they drive decisions. Wire the last mile.

  • Lakehouse sinks: Stream into Apache Iceberg or Delta Lake for time-travel and ACID upserts. Partition by event date/hour; merge-on-read for cost.
  • Warehouse sinks: Use Snowflake Snowpipe Streaming or BigQuery Storage Write API for low-latency analytics. Avoid brittle batch micro-batches.
  • Materialized views: ksqlDB for quick aggregates (e.g., sessions per minute), or precompute with Flink and publish to a serving topic.
  • Feature store: Emit real-time features to Feast or your in-house store for online inference.

Concrete pattern for orders to Iceberg (Flink SQL):

CREATE TABLE orders_iceberg (
  order_id STRING,
  user_id STRING,
  amount DECIMAL(10,2),
  status STRING,
  event_time TIMESTAMP(3),
  WATERMARK FOR event_time AS event_time - INTERVAL '7' MINUTE,
  PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
  'connector'='iceberg',
  'catalog-name'='prod',
  'warehouse'='s3://datalake/warehouse',
  'format-version'='2'
);

What “good” looked like for the retailer:

  • p95 end-to-end latency: 30s (down from 8m)
  • Freshness SLA: 99% of events available in analytics under 2 minutes
  • Duplicate rate: <0.01% (down from ~4%)
  • Recommendation CTR: +6.3% with fresher features
  • Ops: MTTR for stream failures under 15 minutes

Operate it like SREs, not hobbyists

If you can’t see it, you can’t fix it. Run your streams with the same rigor as your APIs.

  • Metrics: consumer_lag, end_to_end_latency_ms, dropped_records, watermark_delay_ms, dead_letter_rate, checkpoint_duration_ms.
  • Tracing: OpenTelemetry spans from producer -> processor -> sink. Propagate trace_id in headers.
  • Dashboards & alerts: Prometheus + Grafana SLO burn alerts. Don’t alert on CPU; alert on user pain.
  • Capacity management: Autoscale on lag and operator busy time. Keep headroom for 2× spikes.
  • Deployments: GitOps with ArgoCD. Canary a new Flink job using a forked topic or a savepoint-based blue/green.
  • Runbooks: One-pagers for “replay DLQ”, “rebuild state from savepoint”, “switch sink to staging”.

Sample SLOs that survive reality:

  • p95 end-to-end latency ≤ 60s; alert if burn rate > 2× for 10 minutes
  • Data completeness ≥ 99.95% over 1 hour
  • Duplicate rate ≤ 0.05% per day
  • MTTR ≤ 30 minutes for P1 data pipeline incidents

A canary pattern that avoids consumer-group collisions:

  1. Mirror live topic to topic.canary via MirrorMaker 2 or Flink tee.
  2. Run new job against topic.canary, compare outputs to baseline.
  3. Flip producer routing to canary path (feature flag) once deltas are <0.1% for an hour.
  4. Promote and decommission the old job via savepoint.

What I’d do again (and what I wouldn’t)

  • Do again: contracts at the edge, EOS end-to-end, idempotent sinks, observability first. These bought us reliability.
  • Do again: Kappa over Lambda. One truth beats two half-truths.
  • Wouldn’t: rely on warehouse CDC for real time; throttling will back you into batch.
  • Wouldn’t: accept “we’ll fix the schema later.” Later never comes in peak season.
  • Do next time: formal error budgets for data freshness tied to marketing spend. It focuses the roadmap.

Related Resources

Key takeaways

  • Treat streaming as a product: define SLOs, data contracts, and rollback plans before you write a line of code.
  • Prefer a Kappa-style architecture: one event log (Kafka/Redpanda/Pulsar), stateful stream processing (Flink/Spark), and idempotent sinks.
  • Enforce schemas and quality checks in-flight; don’t wait for the warehouse to catch bad data.
  • Design for backpressure, retries, DLQs, and replayability—assume producers and consumers will misbehave.
  • Prove business value with latency, freshness, and completeness metrics tied to revenue-driving use cases.

Implementation checklist

  • Define SLOs: end-to-end p95 latency, data completeness %, duplicate rate, MTTR.
  • Pick the log: `Kafka`/`Redpanda` sized by partitions and throughput; enable idempotent, transactional producers.
  • Enforce data contracts with `Schema Registry` and `BACKWARD_TRANSITIVE` compatibility.
  • Implement stateful processing with `Flink` or `Spark Structured Streaming` (exactly-once, checkpointing).
  • Add inline quality checks (`Great Expectations`/`Deequ`), DLQ with context, and replay tooling.
  • Write idempotently to sinks (`Iceberg`/`Delta`/warehouse) and support backfills.
  • Instrument everything: `Prometheus`/`Grafana`, `OpenTelemetry`, consumer lag, watermark skew.
  • Ship via GitOps (`ArgoCD`) with canary pipelines and runbooks for rollback.

Questions we hear from teams

Do I need Flink, or will Spark Structured Streaming do?
Both can work. If you need low-latency stateful processing with fine-grained backpressure and native exactly-once sinks, Flink 1.18 is hard to beat. Spark Structured Streaming is great when you already run Spark and your latency SLO is ~seconds, not tens of milliseconds.
Is Confluent Kafka required?
No. We’ve deployed on OSS Kafka, Redpanda, and Pulsar. Confluent adds managed ops and a mature Schema Registry. Pick the platform your team can operate reliably. The patterns—contracts, EOS, DLQ, replay—matter more than the vendor.
How do you handle GDPR deletes in a streaming world?
Emit delete commands to a dedicated `privacy.deletes` topic keyed by user ID. Apply them in-stream to your online stores and periodically run compaction or merge operations in your lakehouse/warehouse to hard-delete historical records.
What’s the fastest path to value if we’re all batch today?
Start with one revenue-impacting use case (e.g., recommendations freshness). Stand up the log, a single Flink/Spark job, and one sink. Prove p95 < 60s and completeness > 99.9% for that path before you boil the ocean.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about your streaming SLOs Get a 2-week streaming stability assessment

Related resources