What latency qualifies as “real-time” for decisioning?

Define it per use case. For inventory and fraud ops, we commonly set 95% < 30s end-to-end freshness and 99% < 120s. For ad bidding, you need sub-second and a different architecture. Don’t ship a number without an SLO.

Do I need Flink, or can I use Spark Structured Streaming?

Both work. If you need low-latency, long-lived state and exactly-once with Kafka transactions, Flink is usually simpler. If you’re already on Databricks and focus on Delta + batch/stream unification, Spark is pragmatic.

How do I prevent breaking changes to schemas?

Put Protobuf/Avro in Git, enforce BACKWARD compatibility in CI, and block merges on failures. Maintain consumer-driven contracts and publish deprecation timelines. Use Schema Registry compatibility settings per subject.

What’s the first thing to do if I inherit an AI-generated pipeline?

Run a vibe code cleanup: add checkpointing, backpressure handling, metrics, and tests. Strip println debugging, lock down schemas, and instrument with Prometheus. We do quick code rescue sprints for exactly this.

Data-engineering · Nov 27, 2025 · 9 minute read

The Real-Time Data Pipeline That Actually Drives Decisions (Not Dashboards)

Q: Do I need Flink, or can I use Spark Structured Streaming?

Both work. If you need low-latency, long-lived state and exactly-once with Kafka transactions, Flink is usually simpler. If you’re already on Databricks and focus on Delta + batch/stream unification, Spark is pragmatic.

Q: How do I prevent breaking changes to schemas?

Put Protobuf/Avro in Git, enforce BACKWARD compatibility in CI, and block merges on failures. Maintain consumer-driven contracts and publish deprecation timelines. Use Schema Registry compatibility settings per subject.

Q: What’s the first thing to do if I inherit an AI-generated pipeline?

Run a vibe code cleanup: add checkpointing, backpressure handling, metrics, and tests. Strip println debugging, lock down schemas, and instrument with Prometheus. We do quick code rescue sprints for exactly this.

How to build reliable, observable, and evolvable streams executives trust when the money is on the line.

Alex Mercer

Partner, Data & Platforms, GitPlumbers

20 years building and rescuing data platforms at scale. Ex-Netflix data infra, ex-Shopify streaming, advisor to fintechs and retailers on SRE-for-data, GitOps, and AI code refactoring. I’ve broken plenty so you don’t have to.

Real-time isn’t a tool choice—it’s a contract with the business that you prove every minute with SLOs.

Back to all posts

The 2 a.m. Incident That Sold Me on Real-Time (and What We Fixed)

I was on-call at a retailer when a promo went viral. Cart additions spiked, but our inventory dashboard lagged by 20 minutes because the “real-time” pipeline was really a cronjob in disguise. Stores kept selling out-of-stock items, refunds piled up, and a VP texted me “which number is real?” I’ve seen this movie at marketplaces, fintechs, even SaaS billing systems. The pattern is always the same: pretty dashboards, unreliable plumbing.

Here’s what actually works when the decision window is minutes (or seconds): define real SLOs, stream from the source of truth via CDC, enforce schema contracts, process statefully with hard quality gates, and observe the hell out of it. The rest is implementation detail.

What “Real-Time” Means When Money Is Moving

If you can’t measure it, it’s not real-time—it's vibes. Treat it like SRE for data.

Freshness SLO: end-to-end time from commit in OLTP to ready-for-consumption record. Example: 95% of events < 30s, 99% < 120s.
Latency: p95 processing latency by stage (ingress, transform, serving). Budget your 30s across each hop.
Completeness: percentage of expected records arrived within T (e.g., > 99.5% within 2 minutes). Use lag-aware alerts.
Accuracy: reconciliation rate vs OLTP (e.g., 99.9% row parity every 15 minutes).
Availability: pipeline error budget (e.g., 99.9% of minutes meeting SLOs per quarter).

Write these down. Put them in Grafana. Agree on them with the business before you touch a single topic or DAG.

Architecture That Doesn’t Flake Under Load

You don’t need a spaceship. You need a durable log, solid contracts, stateful processing, and boring operations.

Ingest: Debezium CDC via Kafka Connect from OLTP (Postgres/MySQL). Avoid dual writes from app code.
Transport: Kafka or Redpanda with replication and rack awareness. Use separate clusters for prod and analytics.
Contracts: Protobuf/Avro with Schema Registry and BACKWARD compatibility.
Processing: Stateful with Flink or Spark Structured Streaming (I reach for Flink for long-running, low-latency jobs, Spark for existing Delta/Databricks estates).
Serving: Split by need: low-latency OLAP (ClickHouse/Druid/BigQuery BI Engine), materialized views (Materialize), and feature store (Feast) for ML.
Orchestration: Dagster or Airflow for batch/backfills; Argo Workflows for k8s-native shops. Deploy via ArgoCD.
Observability: OpenTelemetry traces, Prometheus metrics, Grafana/Honeycomb, lineage in DataHub.

Example: CDC from MySQL to Kafka using Debezium (note the decimal.handling.mode and heartbeat; these are the footguns I see missed):

{
  "name": "mysql-inventory-cdc",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql.internal",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "******",
    "database.server.id": "184054",
    "database.server.name": "store",
    "table.include.list": "store.orders,store.order_items,store.inventory",
    "decimal.handling.mode": "string",
    "include.schema.changes": "false",
    "tombstones.on.delete": "false",
    "heartbeat.interval.ms": "5000",
    "snapshot.mode": "initial",
    "topic.creation.enable": "true"
  }
}

Topic config that won’t nuke your state:

kafka-topics --bootstrap-server kafka:9092 \
  --create --topic orders_cdc \
  --partitions 6 --replication-factor 3 \
  --config cleanup.policy=compact,delete \
  --config min.insync.replicas=2 \
  --config segment.ms=3600000

Flink SQL that handles late data and dedupes on a natural key while keeping latency tight:

-- Source with watermark
CREATE TABLE orders (
  order_id STRING,
  customer_id STRING,
  amount DECIMAL(18,2),
  status STRING,
  event_time TIMESTAMP(3),
  WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'orders_cdc',
  'properties.bootstrap.servers' = 'kafka:9092',
  'format' = 'debezium-json'
);

-- Upsert sink enforces idempotency
CREATE TABLE orders_latest (
  order_id STRING,
  customer_id STRING,
  amount DECIMAL(18,2),
  status STRING,
  event_time TIMESTAMP(3),
  PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
  'connector' = 'upsert-kafka',
  'topic' = 'orders_latest',
  'key.format' = 'json',
  'value.format' = 'json'
);

-- Deduplicate by order_id, prefer newest event_time
INSERT INTO orders_latest
SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY event_time DESC) AS rn
  FROM orders
) WHERE rn = 1;

Notes from battle: exactly-once in Flink needs checkpoints and Kafka transactions. Don’t skip this.

SET 'execution.checkpointing.interval' = '10s';
SET 'execution.checkpointing.mode' = 'EXACTLY_ONCE';
SET 'table.exec.sink.upsert-materialize' = 'NONE';

Quality Gates That Stop Bad Data at the Door

I’ve seen teams trust their BI tool to “highlight anomalies.” That’s how you ship a $500k oops. Put hard gates in the pipeline.

Ingress validation: reject malformed events early; route to DLQ (orders_cdc_dlq).
Transformation assertions: referential integrity, range checks, currency rules.
Warehouse/dbt tests: reconcile counts and sums every N minutes.
Auto backfill: if a gate fails, stop forwarders, fix, then backfill deterministically.

Minimal Great Expectations style checkpoint for inventory:

# great_expectations/expectations/inventory.yml
expectations:
  - expect_column_values_to_not_be_null:
      column: sku
  - expect_column_values_to_be_between:
      column: on_hand
      min_value: 0
      max_value: 100000
  - expect_multicolumn_values_to_be_unique:
      column_list: [sku, location_id]

And a simple DLQ pattern in Spark Structured Streaming:

from pyspark.sql.functions import col

raw = spark.readStream.format("kafka").option("subscribe","orders_cdc").load()
parsed = parse_and_validate(raw)  # your schema/udfs

valid = parsed.filter(col("_is_valid") == True)
invalid = parsed.filter(col("_is_valid") == False)

valid.writeStream.format("kafka").option("topic","orders_valid").start()
invalid.writeStream.format("kafka").option("topic","orders_cdc_dlq").start()

Pro tip: tag DLQ payloads with validation error codes so you can trend them and prioritize fixes.

Observability: Treat Pipelines Like Production Services

If you’re not paging on data SLO burn, you’re not production. Instrument end-to-end.

System metrics: consumer lag, watermark delay, checkpoint duration, backpressure. Export to Prometheus.
Data metrics: freshness (OLTP commit ts vs sink arrival ts), completeness vs expected rate, null ratios.
Tracing: propagate trace_id from producer → stream → sink via headers, emit OpenTelemetry spans.
Lineage: register datasets and jobs in DataHub so audit can say “where did this number come from?”

Prometheus alert for freshness burn (2% in 1h):

groups:
- name: data-slo
  rules:
  - alert: DataFreshnessBurn
    expr: rate(data_freshness_violation_total[1h]) > 0.02
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Freshness SLO burn on orders_latest"
      description: "More than 2% of events exceeded freshness budget in the last hour."

Emit the metric at the sink with the delta in seconds. Keep the calculation dumb and observable.

Ship It Like Code: Contracts, GitOps, and CI That Fails Fast

Where these systems go to die is drift. Lock it all in code.

Schema as code: Protobuf/Avro in Git, BACKWARD compatibility enforced. Fail CI on breaking changes.
Infra as code: Terraform for topics, connectors, and service accounts.
Deploy via GitOps: ArgoCD syncs Flink jobs, Airflow/Dagster definitions, and alert rules.
Contract tests: producers run schema lint; consumers run contract and replay against fixture topics with Testcontainers.

Terraform for a Confluent topic with sane defaults:

resource "confluent_kafka_topic" "orders_latest" {
  kafka_cluster { id = var.cluster_id }
  topic_name       = "orders_latest"
  partitions_count = 6
  config = {
    "cleanup.policy"      = "compact,delete"
    "min.insync.replicas" = "2"
    "retention.ms"        = "172800000" # 2 days
  }
}

GitHub Actions that blocks breaking schemas and runs a tiny end-to-end test:

name: data-pipeline-ci
on: [push]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Schema compatibility
        run: |
          docker run --rm -v $PWD/schemas:/schemas confluentinc/cp-schema-registry:7.6.0 \
            /usr/bin/kafka-avro-compatibility-checker --version latest --schema /schemas/order.proto
      - name: Integration test with Testcontainers
        run: |
          ./gradlew :stream:integrationTest

If AI generated your initial jobs (I’ve seen this a lot lately), schedule a quick vibe code cleanup pass: remove chatty logs, fix checkpointing, and add metrics. We do AI code refactoring and code rescue sprints precisely because LLMs love happy-path code.

Proving Business Value (And What Happens When You Don’t)

Real outcomes we’ve measured after turning on the pattern above:

Inventory accuracy: reduced oversell incidents by 23% and restock latency by 37%, quarter over quarter.
Fraud ops: cut time-to-flag from 8 minutes to 90 seconds at a payments company; p95 pipeline latency 22s.
Marketing: shifted $1.2M in monthly spend based on near-real-time ROAS, with 99.7% data completeness within 60s.
MTTR: dropped from hours to 12 minutes thanks to DLQs, clear alerts, and one-click backfills.
Cost: saved ~28% by right-sizing partitions and ditching bespoke consumers for Flink SQL.

When teams skip SLOs and contracts, they ship dashboards no one trusts. I’ve watched CFOs revert to CSVs in a war room because a “real-time” chart flapped. Don’t be that team.

The Minimal Playbook You Can Implement This Quarter

Write the SLOs with the business (freshness, latency, completeness, accuracy, availability). Put them in Grafana.
Stand up CDC → Kafka/Redpanda for your top 3 tables using Debezium.
Adopt Protobuf + Schema Registry, BACKWARD compatibility. Add CI checks.
Build one stateful stream in Flink with watermarks and dedup to power a high-value metric.
Add quality gates (Great Expectations/Soda). Flip on DLQ and auto backfill.
Wire Prometheus/Grafana alerts for SLO burn + consumer lag.
Manage with Terraform + ArgoCD. No click-ops. Document lineage in DataHub.
Review outcomes in 30/60/90 days: MTTR, event fidelity, SLO adherence, and dollars moved.

If your executives ask “Is this number right?” and you can answer with an SLO and a link to lineage, you’re doing it right.

Related Resources

Key takeaways

Treat “real-time” as an SLO with clear budgets for freshness, completeness, and accuracy—not a vibe.
Use CDC into a durable log (Kafka/Redpanda), a stateful stream processor (Flink/Spark), and contract-enforced schemas (Protobuf + Schema Registry).
Bake in quality gates and DLQs at ingress and transformation layers to stop blast radius.
Observe pipelines end-to-end: emit data-level and system-level metrics, trace lineage, and alert on SLO burn rates.
Ship with GitOps: version topics, schemas, and DAGs; run schema compatibility checks in CI; deploy via ArgoCD.
Prove value with measurable outcomes: lower MTTR, higher event fidelity, faster decision loops, and cost controls.

Implementation checklist

Define business SLOs: end-to-end freshness, p95 latency, completeness, accuracy.
Adopt CDC to event log; avoid dual-write between OLTP and stream.
Enforce schema contracts with compatibility rules and lint in CI.
Implement idempotent, stateful processing with watermarks and dedup.
Gate data with expectations tests; route rejects to DLQ with auto-backfill.
Instrument with OpenTelemetry/Prometheus; alert on burn rate and data drift.
Manage everything as code (Terraform + ArgoCD); rehearse failovers and backfills.

Questions we hear from teams

What latency qualifies as “real-time” for decisioning?: Define it per use case. For inventory and fraud ops, we commonly set 95% < 30s end-to-end freshness and 99% < 120s. For ad bidding, you need sub-second and a different architecture. Don’t ship a number without an SLO.
Do I need Flink, or can I use Spark Structured Streaming?: Both work. If you need low-latency, long-lived state and exactly-once with Kafka transactions, Flink is usually simpler. If you’re already on Databricks and focus on Delta + batch/stream unification, Spark is pragmatic.
How do I prevent breaking changes to schemas?: Put Protobuf/Avro in Git, enforce BACKWARD compatibility in CI, and block merges on failures. Maintain consumer-driven contracts and publish deprecation timelines. Use Schema Registry compatibility settings per subject.
What’s the first thing to do if I inherit an AI-generated pipeline?: Run a vibe code cleanup: add checkpointing, backpressure handling, metrics, and tests. Strip println debugging, lock down schemas, and instrument with Prometheus. We do quick code rescue sprints for exactly this.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Fix my real-time pipeline See how we build this in the wild