200k msgs/sec Without the Lies: Streaming Data That Stays Clean, Fast, and Auditable
Throughput is cheap. Trust is expensive. Here’s how to design streaming pipelines that don’t gaslight the business when the firehose opens.
Throughput is easy. Telling the truth fast is the hard part.Back to all posts
The midnight paging party you’ve probably lived through
You ship a “near real-time” clickstream. The demo pops. Then a Friday deploy doubles traffic and a hot partition melts your consumers. Duplicate events snake into finance, a backfill replays a week of data, and the CFO’s dashboard argues with itself. You’re up at 1:37 a.m. tracing ghost updates because someone “optimized” a producer without keys and the topic’s min.insync.replicas was set to vibes. I’ve seen this movie at adtechs pushing 200k msgs/sec, and at retailers where inventory swings meant real dollars.
This isn’t a scale problem; it’s a truth problem. Throughput is easy. Telling the truth fast is the hard part.
The backbone: contracts over heroics
Pick a boring, durable backbone: Kafka (Confluent/MSK) or Kinesis/Pub/Sub. Then lock the rules of the road with data contracts.
- Keys and partitioning decide your fate. If you care about order per
customer_id, key bycustomer_id. If you key bynull, enjoy entropy. - Replication and durability matter more than throughput graphs. Set RF=3,
min.insync.replicas=2, and stop pretendingacks=0is fine. - Schema governance: use Schema Registry with
ProtobuforAvro. EnforceBACKWARDorFULLcompatibility. Breakers go through CI, not Slack approvals. - CDC where possible: Debezium from OLTP gives you facts with built-in idempotency signals (op, LSN). CDC + compaction beats polling.
Create topics like you mean it:
# Kafka topic tuned for durability and ordered keys
kafka-topics.sh --create \
--topic orders.events.v1 \
--partitions 48 \
--replication-factor 3 \
--config min.insync.replicas=2 \
--config cleanup.policy=delete \
--config retention.ms=604800000 \
--config segment.bytes=1073741824 \
--config compression.type=zstdAnd guard your schemas:
# Enforce backward compatibility
curl -s -X PUT \
-H "Content-Type: application/json" \
--data '{"compatibility": "BACKWARD"}' \
http://schema-registry:8081/config/orders.events.v1Reliability by design: idempotency, backpressure, and DLQs
If your system can’t be re-run safely, it’s a demo, not a platform.
- Idempotency: carry stable event IDs and version counters. Use upserts or de-dup windows on the sink. For Kafka Streams, set
processing.guarantee=exactly_once_v2and write to transactional sinks. - Backpressure: let your runtime say “not now.” Flink and Spark do this well; Kinesis + Lambda without a buffer does not.
- DLQs as first-class: quarantine poison pills and invalid payloads with metadata for triage. Paging on every failure is noise; paging on DLQ rate spikes is signal.
- Checkpointing and replay: store checkpoints in something resilient (S3/GCS); don’t pin them to ephemeral disks.
Dead-lettering that ops can trust:
{
"name": "orders-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "orders.events.v1",
"connection.url": "jdbc:postgresql://dw:5432/warehouse",
"insert.mode": "upsert",
"pk.mode": "record_value",
"pk.fields": "order_id",
"errors.tolerance": "all",
"errors.deadletterqueue.topic.name": "orders.events.dlq",
"errors.deadletterqueue.context.headers.enable": "true",
"errors.retry.timeout": "60000",
"errors.retry.delay.max.ms": "5000"
}
}For stateful stream processors, flip on real checkpointing and externalized state:
# Flink application config (values excerpt)
state.savepoints.dir: s3://flink/state/savepoints
state.checkpoints.dir: s3://flink/state/checkpoints
execution.checkpointing.interval: 10000 # 10s
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 10
restart-strategy.fixed-delay.delay: 10 sQuality in motion: validate before it hits the CFO
Batch-only validation is how bad data gets the weekend off. Validate in flight.
- Contracts catch shape changes; validations catch bad values. You need both.
- Business rules in-stream: “
quantity >= 0”, “country in ISO list”, “status in {PLACED, SHIPPED, ...}”. - Quarantine invalid events to
*.quarantinetopics; sample and alert on rates.
A quick and dirty validation inside a micro-batch with Spark Structured Streaming + Great Expectations:
from pyspark.sql import SparkSession
import great_expectations as ge
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
spark = SparkSession.builder.appName("orders-stream").getOrCreate()
orders = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "orders.events.v1")
.load())
# parse value -> JSON columns... (omitted)
def validate_and_write(batch_df, batch_id):
gdf = SparkDFDataset(batch_df)
gdf.expect_column_values_to_not_be_null("order_id")
gdf.expect_column_values_to_be_between("quantity", min_value=0, max_value=10000)
gdf.expect_column_values_to_match_regex("country", r"^[A-Z]{2}$")
res = gdf.validate()
valid = batch_df.join(spark.createDataFrame(res.successful_expectations), on=[], how="cross")
invalid = batch_df.subtract(valid)
valid.write.mode("append").format("delta").save("/mnt/delta/orders_silver")
invalid.write.mode("append").format("delta").save("/mnt/delta/orders_quarantine")
(orders.writeStream
.foreachBatch(validate_and_write)
.option("checkpointLocation", "/mnt/ckpt/orders")
.start()
.awaitTermination())Tip: add lightweight statistical checks (e.g., “orders/min within 3σ of last week”) alongside hard rules. Turn deltas into alerts, not just dashboards.
Also, capture lineage so you can explain where a number came from. OpenLineage with Marquez gives you runs and datasets you can query during incidents.
# Spark with OpenLineage
export OPENLINEAGE_URL=https://marquez.example/api
export OPENLINEAGE_NAMESPACE=streaming
spark-submit \
--packages io.openlineage:openlineage-spark:1.10.0 ...Storage and reprocessing: make truth replayable
You will replay. Design for it.
- Write raw events to an immutable “bronze” table (e.g., Delta/Iceberg/Hudi). No transformations, just normalized schemas and partitioning by event date.
- “Silver” is where you de-duplicate and enforce contracts; “Gold” is what BI/ML touches. Keep the lineage explicit.
- Use time travel to reconstruct a report exactly as it was last Tuesday.
Example with Iceberg time travel when finance asks, “what did we show on 2025-08-15?”
-- Athena/Trino on Iceberg
SELECT * FROM gold.order_revenue FOR TIMESTAMP AS OF TIMESTAMP '2025-08-15 18:00:00 UTC'CDC + compaction: use Debezium’s change events to upsert into an Iceberg table keyed by business ID. You get exactness without batched cron backfills.
From raw to revenue: serving patterns that move the needle
If the business waits on batch windows, “real-time” doesn’t matter. Serve the answers where they’re consumed.
- Operational analytics: roll up aggregates into Pinot/ClickHouse/Druid for sub-second dashboards and APIs.
- Features for ML: keep online/offline parity with streams to a feature store (Feast, Tecton) and the same bronze/silver contracts.
- Search/indexing: change logs into Elasticsearch/OpenSearch with idempotent updates.
Concrete, boring wins we’ve delivered:
- Reduced P99 end-to-end order latency from 7m → 45s by moving from cron-based batch to stream compaction + Pinot rollups.
- Cut inventory mismatches 92% by enforcing keyed messages and idempotent upserts at the sink.
- Dropped infra cost 28% by turning on
zstd, compaction for changelog topics, and right-sizing partitions (from 200 → 64) after measuring consumer parallelism.
Operate like an SRE: SLOs, alerts, and cost controls
Observability is the contract with your future self.
- SLOs that matter:
- P99 end-to-end latency (ingest → gold table or serving index)
- Max data loss (target <0.01% events/day)
- Consumer lag per group (thresholds by partition count and TPS)
- Validation pass rate (keep >99.5%, alert on drops)
- Alerts on symptoms, not components. A green broker with a red backlog is still red.
- Autoscale consumers based on effective throughput, not CPU. If you can’t increase partitions, you can’t scale.
Prometheus rules that save weekends:
groups:
- name: streaming-slo
rules:
- alert: ConsumerLagHigh
expr: kafka_consumergroup_group_lag{group="orders-silver"} > 200000
for: 5m
labels:
severity: page
annotations:
summary: "Consumer lag high for orders-silver"
description: "Lag >200k for 5m. Investigate hot partitions or downstream slowness."
- alert: ValidationPassRateDrop
expr: (sum(rate(stream_valid_records_total[5m])) / sum(rate(stream_records_total[5m]))) < 0.995
for: 10m
labels:
severity: page
annotations:
summary: "Validation pass rate below SLO"
description: "<99.5% valid records over 10m. Check quarantine rates and schema changes."Cost sanity:
- Retention budgets by topic: most producers over-retain. Keep raw 7–14 days, compaction for changelogs, and archive to cheap storage.
- Compression everywhere (
zstdorlz4) and batch sizes that fill your MTUs. - Partition strategy tied to consumer parallelism. More partitions aren’t free.
A sane rollout: 90 days from demo to dependable
Don’t big-bang. Build a vertical slice that proves truth, not tech.
- Week 0–2: pick a single data product with business value (e.g., real-time revenue). Define the data contract and SLOs with the product owner.
- Week 3–6: implement backbone + bronze/silver, schema registry in CI, DLQ + validation, lineage. Shadow the existing batch.
- Week 7–10: add serving store (Pinot/ClickHouse), dashboards, alerts. Run canaries at 10%, simulate failure (broker down, schema break, poison pill).
- Week 11–12: flip traffic, hold rollback plan, publish runbook, and commit to a weekly reliability review.
Security and governance (because legal will ask):
- PII tokenization at the edge; field-level encryption for high-risk columns.
- Access via service accounts + short-lived creds. Auditable lineage + immutable logs.
- Disaster testing: kill a broker, corrupt an event, roll keys. Measure MTTR.
What actually worked (and what didn’t)
What worked:
- Treating the data contract as a product artifact, reviewed during API changes, not a data-team afterthought.
- Idempotent sinks and transactional guarantees only where money moves. Elsewhere, at-least-once plus dedupe windows was good enough.
- DLQ triage as a weekly ritual with defect metrics. We cut recurring DLQ causes by 70% in a quarter at a mid-market fintech.
- Lineage + time travel to speed audits. We shaved data incident MTTR from 6h → 55m at a marketplace by pairing OpenLineage with Iceberg snapshots.
What didn’t:
- “We’ll clean it in the warehouse.” Too late; the business already saw garbage.
- “Serverless will scale it for us.” Without backpressure and concurrency control, it scales your pager.
- “Let’s start with microservices per event type.” Start with teams and contracts; topology can evolve.
When GitPlumbers parachutes in, we don’t sell magic. We stabilize the backbone, wire contracts and validation, and set SLOs you can live with. Then we prove it with measurable outcomes in 90 days or less.
structuredSections':[{
Related Resources
Key takeaways
- Design for failure first: idempotency, backpressure, DLQs, and replayability beat raw throughput benchmarks.
- Data contracts and schema governance are the leverage points for reliability and speed across teams.
- Measure the right things: end-to-end latency, data loss rate, consumer lag SLOs, and validation pass rates.
- Keep state close and auditable: CDC + Lakehouse tables (Iceberg/Delta) enable time travel and reprocessing.
- Operational excellence matters: alerts on lag/throughput, autoscaling triggers, and predictable retention budgets.
Implementation checklist
- Pick a durable event backbone with replication and contracts (Kafka/Kinesis + Schema Registry).
- Enforce keys, partitioning, and retention configs that match your access patterns.
- Build idempotent consumers and exactly-once producers where it matters (financials, inventory).
- Introduce DLQs with mandatory triage and clear SLAs; track defect density and MTTR.
- Wire schema compatibility (BACKWARD or FULL) and automated contract checks in CI/CD.
- Validate in-stream: statistical and business-rule checks, with quarantine topics for failures.
- Capture lineage and operational metadata (OpenLineage) for audits and fast incident response.
- Set SLOs: P99 end-to-end latency, max data loss %, consumer lag thresholds, and cost per million events.
Questions we hear from teams
- Do I really need exactly-once everywhere?
- No. It’s expensive and brittle at boundaries. Use exactly-once (Kafka Streams `exactly_once_v2`, Flink checkpoints) for financial transfers, inventory, and billing. Everywhere else, at-least-once with idempotent upserts and small dedupe windows is cheaper and good enough.
- How many Kafka partitions should I create?
- Start from consumer parallelism and target throughput. If a consumer instance can process 5k msgs/sec and you need 100k msgs/sec with headroom, 24–32 partitions is reasonable. Avoid hundreds of partitions by default—controller churn and rebalances will tax you.
- Where should I validate data: producer or consumer?
- Both, but differently. Producers enforce the data contract (shape/types); consumers enforce business rules (ranges, referential integrity). Put light checks at the edge to fail fast, and heavier checks in stream processors with quarantine paths.
- Can I keep the warehouse as my source of truth?
- Keep the warehouse for analytics, but your source of truth for facts is the event log + bronze tables. That’s what you replay. Warehouses are for aggregates and history, not the canonical event timeline.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
