Real-Time Data Pipelines That Don’t Lie: Decisions You Can Bet the Quarter On

If your “real-time” dashboard is 12 minutes stale, you don’t have a pipeline — you have a liability. Here’s the architecture, guardrails, and roll-out plan we use to ship reliable, business-critical streams.

Real-time isn’t about speed; it’s about trust. If the stream isn’t trusted, decisions degrade and the pager lights up.
Back to all posts

The 2 a.m. page: real-time that wasn’t

A few quarters back, a marketplace’s “real-time” revenue dashboard showed green. Finance approved a promo ramp. Ops woke up to stockouts, refunds, and a 7-figure variance. Root cause? A silent consumer failure in their Kafka pipeline masked by retries and an autoscaling quirk. The data looked current but was 18 minutes stale. I’ve seen this movie too many times.

If your business makes pricing, inventory, or risk decisions off streaming data, “eventual consistency” without guardrails is just a ticking pager. Let’s talk about what actually works when the stakes are real.

Define real-time in SLOs, not vibes

“Near real-time” means nothing. Write SLOs that force tradeoffs explicit:

  • End-to-end latency (p99): ingestion to serving < 120s for decisions, < 10s for alerts

  • Freshness: max(event_time) within 2 minutes of wall clock

  • Correctness: 99.9% deduplicated, ordered-in-window, schema-compliant events

  • Completeness: ≥ 99.5% of expected events processed within 5 minutes

  • Availability: pipeline components 99.9% (with degraded modes defined)

Measure these with real signals: consumer lag, watermark age, max event time vs now, and reconciliation against truth (CDC counts vs sink counts). Agree which decisions the stream supports. If finance is approving spend, your SLOs can’t be “best effort.”

Architecture that survives reality

Here’s the stack we deploy when the pager needs to stay quiet:

  • Ingest (CDC): Debezium off PostgreSQL/MySQL for orders, payments, inventory. It turns row changes into ordered, replayable events. Avoid DIY triggers.

  • Transport: Kafka or Redpanda (if you want less ops) with a Schema Registry (Confluent or Apicurio) enforcing compatibility (BACKWARD for consumers; lock down FULL on critical topics).

  • Processing: Flink or ksqlDB for stateful transforms, joins, and windows with watermarks. Avoid cron-y consumers for anything business-critical.

  • Storage/Serve: ClickHouse or Apache Pinot for sub-second analytics and operational dashboards; optionally Delta Lake/Iceberg for replay/archival.

  • Orchestration & CD: Terraform + GitOps (ArgoCD) so changes are reviewed, diffed, and repeatable.

  • Observability: Prometheus + Grafana + OpenTelemetry metrics from stream jobs and sinks.

We bias for idempotent writes, exactly-once semantics where affordable, and a clean replay story. A normal day still includes network hiccups, schema drift, backfills, and a junior dev’s “innocent” column rename.

Example: creating a real topic for orders

kafka-topics --bootstrap-server $BROKERS \
  --create --topic orders.v1 \
  --partitions 12 \
  --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config cleanup.policy=compact,delete \
  --config segment.ms=3600000 \
  --config retention.ms=172800000
  • Partitions sized to peak throughput + headroom

  • min.insync.replicas for durability

  • Compaction for idempotent updates; delete for backpressure relief

Debezium CDC for Postgres

{
  "name": "postgres-orders",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "*****",
    "database.dbname": "app",
    "slot.name": "orders_slot",
    "publication.autocreate.mode": "filtered",
    "table.include.list": "public.orders,public.order_items",
    "tombstones.on.delete": "false",
    "topic.prefix": "cdc",
    "heartbeat.interval.ms": "5000",
    "snapshot.mode": "initial_only",
    "decimal.handling.mode": "string",
    "time.precision.mode": "connect"
  }
}

Data contract with Avro + Schema Registry

{
  "type": "record",
  "name": "OrderEvent",
  "namespace": "com.acme.orders",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "event_type", "type": {"type":"enum","name":"EventType","symbols":["CREATED","PAID","CANCELLED","FULFILLED"]}},
    {"name": "amount", "type": {"type":"bytes","logicalType":"decimal","precision":12,"scale":2}},
    {"name": "currency", "type": "string"},
    {"name": "customer_id", "type": "string"},
    {"name": "event_time", "type": {"type":"long","logicalType":"timestamp-millis"}},
    {"name": "ingest_time", "type": {"type":"long","logicalType":"timestamp-millis"}, "default": 0}
  ]
}

Lock the compatibility mode. Don’t let a Friday deploy change amount from DECIMAL(12,2) to DOUBLE. I’ve seen that wipe out margin reporting for a week.

Reliability patterns that catch you when it matters

  • Watermarks + late data: In Flink, define watermarks so windows close predictably and late events route to a DLQ or compensation path.
-- Flink SQL example: paid revenue by minute with a 2m allowed lateness
CREATE TABLE orders_src (
  order_id STRING,
  event_type STRING,
  amount DECIMAL(12,2),
  event_time TIMESTAMP(3),
  WATERMARK FOR event_time AS event_time - INTERVAL '2' MINUTE
) WITH (
  'connector'='kafka',
  'topic'='orders.v1',
  'properties.bootstrap.servers'='${BROKERS}',
  'format'='avro-confluent',
  'avro-confluent.schema-registry.url'='${SCHEMA_REGISTRY}'
);

CREATE TABLE revenue_minute (
  minute TIMESTAMP(3),
  paid_orders BIGINT,
  revenue DECIMAL(18,2)
) WITH (
  'connector'='clickhouse',
  'url'='jdbc:clickhouse://clickhouse:8123/rt',
  'sink.batch-size'='5000',
  'sink.semantic'='exactly-once'
);

INSERT INTO revenue_minute
SELECT TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS minute,
       COUNT_IF(event_type='PAID') AS paid_orders,
       SUM(CASE WHEN event_type='PAID' THEN amount ELSE 0 END) AS revenue
FROM orders_src
GROUP BY TUMBLE(event_time, INTERVAL '1' MINUTE);
  • Idempotency and replays: Use keys and compaction to dedupe; ensure sinks are idempotent (primary keys in ClickHouse, UPSERT to Pinot).

  • Canary pipelines: Mirror input to orders.v1.canary, run new job, and diff outputs against prod. Cut over when diffs < agreed threshold for N hours.

  • Circuit breakers: If quality checks fail, stop propagating to decision surfaces. Fail closed. Your ops director will thank you.

  • Backfill strategy: Version topics (orders.v1orders.v2), run a one-shot reprocessor, and sunset consumers with a deadline. Don’t hotfix schemas in place.

Make data quality a deploy gate

I don’t ship without tests at the edges: expectations at ingress and assertions at egress. If a test fails, the pipeline halts or degrades gracefully.

Expectations with Great Expectations

# great_expectations/expectations/orders_expectations.yml
expectations:
  - expect_column_values_to_not_be_null:
      column: order_id
  - expect_column_values_to_be_between:
      column: amount
      min_value: 0
  - expect_column_values_to_match_regex:
      column: currency
      regex: "^[A-Z]{3}$"
  - expect_column_values_to_be_in_set:
      column: event_type
      value_set: ["CREATED","PAID","CANCELLED","FULFILLED"]

Wire this to your consumer job as a pre-processing gate. On failure: page, park bad events to DLQ (orders.bad.v1), and pause downstream writes.

dbt assertions at the serving layer

# models/rt/fact_orders_rt.yml
version: 2
models:
  - name: fact_orders_rt
    columns:
      - name: order_id
        tests:
          - not_null
      - name: event_type
        tests:
          - accepted_values:
              values: [CREATED, PAID, CANCELLED, FULFILLED]
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns: [order_id, event_type]

For freshness on sources, use dbt’s source freshness checks and make them part of CI/CD. No merge to main if the canary environment is stale by > 2 minutes.

Fail closed in orchestration

# Airflow task sketch
from airflow.exceptions import AirflowFailException

quality_passed = run_great_expectations()
if not quality_passed:
    raise AirflowFailException("Quality gate failed: halting downstream tasks")

Observe everything: lag, freshness, correctness

Dashboards with green boxes don’t mean “healthy.” Instrument the pipeline with metrics that map to your SLOs.

  • Lag: Consumer group lag per topic/partition; alert if > threshold for > N minutes.

  • Freshness: now - max(event_time) at the sink; page if > SLO.

  • Watermark age: now - watermark; catches slow partitions or skewed keys.

  • Reconciliation: CDC counts vs sink counts in sliding windows.

  • Checkpoint health: Flink checkpoint duration and age; stuck checkpoints precede outages.

Prometheus alert rules

groups:
- name: realtime-pipeline
  rules:
  - alert: KafkaConsumerLagHigh
    expr: sum by (consumer_group, topic) (kafka_consumergroup_lag{topic="orders.v1"}) > 5000
    for: 5m
    labels: {severity: page}
    annotations:
      summary: "High consumer lag on {{ $labels.consumer_group }}/{{ $labels.topic }}"
      description: "Lag > 5k for 5m. Investigate throughput, partitioning, or stuck consumers."
  - alert: DataFreshnessSLOViolation
    expr: (time() - max by (stream) (rt_stream_max_event_time_seconds{stream="orders"})) > 120
    for: 2m
    labels: {severity: page}
    annotations:
      summary: "Orders stream freshness > 2m"
      description: "End-to-end freshness SLO violated"

Quick sink-side sanity in ClickHouse

-- Check latest event_time vs wall clock
SELECT now() - max(event_time) AS staleness
FROM fact_orders_rt;

-- Reconciliation: events processed in last 5 minutes
SELECT count() AS sink_count
FROM fact_orders_rt
WHERE event_time > now() - INTERVAL 5 MINUTE;

Prove business value in weeks, not quarters

The CFO doesn’t care about exactly_once_v2. They care about fewer bad decisions and faster good ones. Tie the pipeline to outcomes:

  • Decision latency: “Inventory rebalancing decisions updated every 60s p99”

  • Data downtime: “< 0.1% monthly data downtime on critical streams”

  • MTTR: “Recovered streaming incidents in < 15 minutes median”

  • Correctness: “< 0.1% duplicate/late beyond watermark; no schema-breaking deploys in 90 days”

A recent GitPlumbers engagement for a mid-market retailer replaced batch+cron with CDC+streaming. In 6 weeks:

  • End-to-end p99 latency from 20 minutes → 90 seconds

  • Data downtime from 9 hours/month → 18 minutes/month

  • Inventory misallocation incidents dropped 60% quarter over quarter

  • Finance greenlit promos based on live gross margin instead of yesterday’s CSVs

A rollout you can run in 6 weeks

  1. Week 1: Define SLOs and contracts
    • Identify decision surfaces (pricing, fraud, ops). Write SLOs for latency/freshness/correctness.
    • Draft Avro/JSON Schemas for 3 critical events; lock registry compatibility.
  2. Week 2: Stand up core infra
    • Provision Kafka/Redpanda, Schema Registry, and ClickHouse/Pinot with Terraform.
    • Create topics with partitioning aligned to keys (e.g., order_id). Enable idempotent producers.
  3. Week 3: Ingest and a simple stream
    • Deploy Debezium for CDC of orders and payments.
    • Implement Flink/ksqlDB job for a single, valuable metric (e.g., paid revenue by minute).
  4. Week 4: Quality gates + observability
    • Add Great Expectations at ingress and dbt tests at egress. Wire fail-closed behavior.
    • Instrument with Prometheus and create SLO alerts for lag/freshness.
  5. Week 5: Canary and diff
    • Mirror to *.canary topics; run new pipeline in parallel.
    • Build a nightly diff job to compare counts/sums/distributions; burn down divergences.
  6. Week 6: Cutover and report
    • Flip consumers to v1v2 topics; keep replay plan documented.
    • Publish a one-page impact report: before/after SLOs and business KPIs.

Optional: materialized aggregates for serving

-- ClickHouse: real-time aggregate for UI/API
CREATE MATERIALIZED VIEW mv_paid_orders
TO agg_paid_orders AS
SELECT
  toStartOfMinute(event_time) AS minute,
  countIf(event_type='PAID') AS paid_orders,
  sumIf(amount, event_type='PAID') AS revenue
FROM orders_stream
GROUP BY minute;

What I’d do differently next time

  • Bake in a replay story day 1. You will need it.
  • Treat schema changes like API changes. Changelogs, deprecations, deadlines.
  • Page on “freshness SLO violated,” not “CPU > 80%.”
  • Use Redpanda if your team is thin on Kafka ops; it’s been boring in the best way.
  • Don’t over-index on tool fashion. Flink, ksqlDB, or Kafka Streams can all work if you honor the SLOs and contracts.

Real-time isn’t about speed; it’s about trust. If decision-makers don’t trust the stream, they’ll go back to spreadsheets — and you’ll go back to firefighting.

Related Resources

Key takeaways

  • Define “real-time” with SLOs: end-to-end p99 latency, freshness, and correctness — not vibes.
  • Anchor on a survivable architecture: CDC → Kafka/Redpanda with schema registry → stream processing with watermarks → OLAP store → serving layer.
  • Enforce data quality as a deploy gate using data contracts and expectations; fail closed when it counts.
  • Instrument the pipeline: lag, watermark latency, freshness, and reconciliation. Alert on SLOs, not just CPU.
  • Roll out with canaries, dual writes, and automated output diffs to prove zero-regret changes.
  • Tie the pipeline to dollars: decisions, saved incidents, and reduced data downtime — not just throughput.

Implementation checklist

  • Write SLOs: p99 end-to-end latency, freshness, completeness, and correctness.
  • Stand up CDC with Debezium; create partitioned topics with idempotent producers.
  • Define Avro/JSON Schema contracts; enforce via schema registry compatibility rules.
  • Add windowing + watermarks in Flink/ksqlDB; design for idempotency and replays.
  • Implement expectations (Great Expectations/dbt) and fail closed on critical violations.
  • Instrument with Prometheus + Grafana; alert on lag and freshness SLOs.
  • Run a canary stream; diff outputs and burn down divergences before cutting over.
  • Report business impact: decision latency reduced, MTTR down, data downtime down.

Questions we hear from teams

How “real-time” do I actually need?
Start from the decision. If a human action or automated control loop needs an update within 2 minutes to avoid loss or capture upside, set p99 end-to-end latency at 120s and back into partitioning, parallelism, and windowing that meet it. For fraud or ads, 1–5 seconds might be warranted. Document degraded modes if you miss it.
Do I really need Flink?
If you have stateful joins, exactly-once sinks, and watermark logic, yes — Flink or ksqlDB makes it maintainable. If it’s simple ETL with low state, Kafka Streams can be fine. Don’t write a bespoke consumer unless you’re ready to rebuild state, checkpointing, and backpressure handling.
Why not just use my data warehouse for everything?
Warehouses (BigQuery/Snowflake) are great for analytics, not sub-second operational decisions. You can land streams there for broader analysis, but your real-time control surfaces want a low-latency OLAP store like ClickHouse/Pinot with materialized views and streaming sinks.
How do I handle schema evolution without breaking consumers?
Data contracts + schema registry with strict compatibility. Use additive changes, version topics for breaking changes (v1 → v2), and run a canary with output diffing. Communicate deprecation dates like an API; don’t hot-swap types in place.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about your real-time pipeline See how we fix broken data platforms

Related resources