Kafka vs. Redpanda vs. Kinesis/PubSub—what should I pick?

If you want full control and open ecosystem, Kafka (MSK if you’re okay with AWS quirks). If you want simpler ops with Kafka compatibility, Redpanda is excellent. If you’re deep in AWS/GCP and want managed, Kinesis/PubSub are fine, but mind throughput/partitioning limits and per-message costs.

Do I really get exactly-once?

You can get end-to-end exactly-once semantics with Flink two-phase commit to supported sinks and idempotent producers. Practically, target idempotency at the sink keyed by event_id; treat ‘exactly-once’ as a design goal but verify with duplication metrics.

How do I handle out-of-order and late events?

Use event time with watermarks. Decide an allowed lateness (e.g., 5 minutes) for aggregations. Late events can trigger updates (MERGE) or be routed to a ‘late’ side output for separate processing, depending on business rules.

What about schema changes without breaking consumers?

Use Schema Registry compatibility (BACKWARD/FULL). Add new fields with defaults; avoid changing field semantics. Ship schema diffs in PRs and run consumer contract tests in CI before deploy.

How do I prove business value quickly?

Pick one P0 use case (fraud signals, inventory, pricing). Ship a freshness dashboard tied to a KPI. Publish before/after: p99 latency, DQ pass rate, duplicate rate, and MTTR. Execs don’t need Kafka diagrams—they need metrics.

Data-engineering · Oct 20, 2025 · 10 minute read

Your ‘Real‑Time’ Stream Is 47 Minutes Late: How We Fixed It for Good

Designing streaming data architectures that survive high velocity without torching reliability, quality, or your on-call rotation.

Alec Donovan

Partner, Data & Reliability Engineering, GitPlumbers

20 years building and rescuing data platforms at scale—ex-Twitter data infra, led streaming at a top-10 marketplace, and has the pager scars to prove it.

You don’t get real-time by wishing for it. You get it by designing for backpressure, replay, and contracts—and then proving it with SLOs.

Back to all posts

Related Resources

Key takeaways

Durable, replayable logs (Kafka/Redpanda/Kinesis/Pub/Sub) and correct partitioning are the backbone of high‑velocity streams.
Treat data quality as a contract: schema registry, compatibility rules, validation, and dead‑letter queues.
Stateful processing needs watermarks, checkpoints, and idempotent sinks; Flink or Kafka Streams make this manageable.
Define and monitor SLOs for latency, freshness, duplication, and DQ pass rate; automate scaling/responders.
Roll out incrementally with CDC mirroring, canaries, and backfills; measure outcomes in minutes, not quarters.

Implementation checklist

Pick a durable log (Kafka/Redpanda/Kinesis/Pub/Sub) and set retention/replication appropriately.
Define partitioning keys to balance throughput and preserve necessary ordering.
Adopt `Schema Registry` with `BACKWARD` or `FULL` compatibility; ship protobuf/avro contracts with code.
Add stream DQ checks and DLQs; make DQ failures visible and actionable.
Use Flink/Streams with event time, watermarks, checkpoints; sinks must be idempotent/exactly-once.
Instrument lag, end-to-end latency, DQ pass rate; set SLOs and alerts; automate scaling (KEDA/HPA).
Plan a 30/60/90 rollout: mirror, contract, canary, backfill, cut-over; publish before/after metrics.

Questions we hear from teams

Kafka vs. Redpanda vs. Kinesis/PubSub—what should I pick?: If you want full control and open ecosystem, Kafka (MSK if you’re okay with AWS quirks). If you want simpler ops with Kafka compatibility, Redpanda is excellent. If you’re deep in AWS/GCP and want managed, Kinesis/PubSub are fine, but mind throughput/partitioning limits and per-message costs.
Do I really get exactly-once?: You can get end-to-end exactly-once semantics with Flink two-phase commit to supported sinks and idempotent producers. Practically, target idempotency at the sink keyed by event_id; treat ‘exactly-once’ as a design goal but verify with duplication metrics.
How do I handle out-of-order and late events?: Use event time with watermarks. Decide an allowed lateness (e.g., 5 minutes) for aggregations. Late events can trigger updates (MERGE) or be routed to a ‘late’ side output for separate processing, depending on business rules.
What about schema changes without breaking consumers?: Use Schema Registry compatibility (BACKWARD/FULL). Add new fields with defaults; avoid changing field semantics. Ship schema diffs in PRs and run consumer contract tests in CI before deploy.
How do I prove business value quickly?: Pick one P0 use case (fraud signals, inventory, pricing). Ship a freshness dashboard tied to a KPI. Publish before/after: p99 latency, DQ pass rate, duplicate rate, and MTTR. Execs don’t need Kafka diagrams—they need metrics.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about your streaming roadmap Get our streaming SLO checklist

Related Resources

Key takeaways

Implementation checklist

Questions we hear from teams

Ready to modernize your codebase?

Related resources