When ‘Real‑Time’ Lies to Finance: Building Streaming Pipelines You Can Take to the Board

Your CEO doesn’t care about Kafka. They care that the number on the dashboard is right, now. Here’s how to build real‑time data pipelines that are fast, correct, and tied to business value.

“If you can’t replay to the same answer, your pipeline isn’t real‑time—it’s gambling.”
Back to all posts

The 3 a.m. page you remember

I’ve watched a VP Finance stare at a “real‑time” revenue dashboard that said flat while orders were actually spiking 28% during a promotion. The culprit wasn’t Kafka throughput; it was incorrect joins and late-arriving events silently dropped by a streaming job after a schema change. Sales lost their nerve, throttled marketing, and left money on the table. I’ve seen this fail across stacks: bespoke Spark jobs, over‑engineered KStreams, and even Snowflake tasks pretending to be streams.

Real‑time isn’t a badge. It’s a promise that numbers are both fast and correct when a business decision is on the line. If Finance and Ops can’t bet the quarter on your pipeline, it isn’t real‑time—it’s expensive cosplay.

Non‑negotiables for real‑time truth

If you don’t write these down, you will violate them.

  • Latency SLO: P99 end‑to‑end (producer → dashboard) under 5s for ops decisions; under 60s for finance. Define the exact path and measure it.
  • Freshness: Data age on dashboard under 10s for ops; under 2m for finance rollups.
  • Completeness: 100% of events eventually processed; late data window explicitly defined (e.g., 15m watermark) with zero data loss by design.
  • Accuracy: Aggregations reconcile to source-of-truth within ±0.1% daily for finance; zero double‑counting in ops.
  • MTTR: Stream failures detected and mitigated in <10m with auto-fallbacks.

Write SLOs where the business can see them. Tie them to dashboards. Alert only on SLO burn, not every hiccup.

A reference architecture that doesn’t lie

You don’t need every CNCF logo. You need a small set that plays nicely under pressure.

  • Ingest (CDC): Debezium via Kafka Connect pulling from Postgres/MySQL/SQL Server. Avoid DIY binlog parsers.
  • Broker & Contracts: Kafka with Schema Registry using protobuf or avro and BACKWARD_TRANSITIVE compatibility.
  • Stream Processing: Apache Flink with checkpointing and exactly-once sinks; watermarking based on event time.
  • Quality & Lineage: In‑stream assertions + Great Expectations/Deequ at the sinks; lineage via OpenLineage/Marquez.
  • Storage & Serving:
    • Operational analytics: ClickHouse or Apache Pinot for sub‑second queries.
    • Finance & historical: Snowflake/BigQuery with Iceberg/Delta for versioned lake and replay.
    • Feature serving: Redis/Feast if you’re doing ML decisions inline.
  • Orchestration & Ops: Terraform for infra; ArgoCD for GitOps; Prometheus + Grafana + Alertmanager for signals.

If you can’t replay from durable history to the exact same results, you didn’t build a pipeline—you built a one‑off ETL script with a Slack channel.

Concrete bits that matter:

  • Kafka topics configured with min.insync.replicas=2, acks=all, cleanup.policy=compact,delete where appropriate, and partition counts sized to consumer concurrency.
  • Flink checkpoints to durable object storage (e.g., s3://company-flink/ckpts) every 5s, exactly-once sinks, and two-phase commit.
  • Idempotent keys for events (order_id, event_id) and dedupe logic at the sink.

Example: Debezium connector for orders table:

{
  "name": "inventory-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "orders-db",
    "database.port": "5432",
    "database.user": "cdc",
    "database.password": "***",
    "database.dbname": "orders",
    "table.include.list": "public.orders,public.order_items",
    "tombstones.on.delete": "false",
    "snapshot.mode": "initial_only",
    "topic.prefix": "cdc.orders",
    "schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
    "schema.history.internal.kafka.topic": "schema-changelog"
  }
}

Flink job sketch with exactly-once Kafka sink:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(3000);

KafkaSource<OrderEvent> source = ...; // cdc.orders topic
WatermarkStrategy<OrderEvent> wm = WatermarkStrategy
  .<OrderEvent>forBoundedOutOfOrderness(Duration.ofMinutes(15))
  .withTimestampAssigner((e, ts) -> e.eventTimeMillis);

DataStream<OrderEvent> events = env.fromSource(source, wm, "orders");

DataStream<OrderAgg> aggs = events
  .keyBy(e -> e.orderId)
  .process(new DedupAndValidate())
  .keyBy(e -> e.sku)
  .window(TumblingEventTimeWindows.of(Time.seconds(5)))
  .aggregate(new RevenueAgg());

KafkaSink<OrderAgg> sink = KafkaSink.<OrderAgg>builder()
  .setBootstrapServers("kafka:9092")
  .setRecordSerializer(...)
  .setDeliverGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
  .build();

aggs.sinkTo(sink);
env.execute("orders-agg");

Create Kafka topics with guardrails:

kafka-topics.sh --create --topic cdc.orders --partitions 12 --replication-factor 3 \
  --config min.insync.replicas=2 --config cleanup.policy=delete --config retention.ms=259200000

kafka-topics.sh --create --topic curated.orders_agg_v1 --partitions 12 --replication-factor 3 \
  --config min.insync.replicas=2 --config cleanup.policy=compact

Quality gates, canaries, and circuit breakers

Quality is not a nightly job. It lives in the stream.

  • Data contracts in Git: Define payloads in protobuf or avro; enforce via Schema Registry with BACKWARD_TRANSITIVE compatibility. PRs that change schemas must include migration notes and consumer impact.
  • In‑stream assertions: Null‑rate, range checks, categorical domain checks. Fail closed when critical fields misbehave.
  • DLQs: Route rejects with full context to a -dlq topic; cap the rate so you don’t DDOS yourself.
  • Canary pipelines: Mirror topics (topic-canary) and run the same job against a small % of traffic. Promote only when checks pass.
  • Circuit breakers: If error budget burn rate > threshold, trip to a safe fallback (e.g., last known good aggregate).

Sample Great Expectations check for a curated table:

expectation_suite_name: orders_agg_suite
expectations:
  - expect_column_values_to_not_be_null:
      column: order_id
  - expect_column_values_to_be_between:
      column: total_amount
      min_value: 0
      max_value: 100000
  - expect_column_values_to_match_regex:
      column: currency
      regex: "^(USD|EUR|GBP)$"

Simple Flink guardrail for null‑blast:

if (event.totalAmount == null || event.totalAmount < 0) {
  metrics.counter("rejects").inc();
  dlqProducer.send(event.withReason("bad_total_amount"));
  return; // fail fast; don't poison aggregates
}

We’ve blocked countless 3 a.m. incidents with this trio: contracts, canaries, and circuit breakers. It’s boring in the best way.

Observability: measure what the business cares about

Infra metrics are table stakes. Tie telemetry to decisions.

  • End‑to‑end latency: Stamp ingest_ts at the producer and propagate to sinks; expose P50/P95/P99 in Grafana by stream and by dashboard.
  • Freshness: Max event_time vs now per table; alert when staleness > SLO.
  • Consumer lag and watermark delay: Alert on sustained lag + growing watermark delay; don’t alert on momentary spikes.
  • Quality drift: Null‑rate, out‑of‑domain, schema violations. Alert on burn rate (e.g., 2x over 15m) not single blips.
  • KPI correlation: Track downstream KPI deltas (e.g., revenue vs previous hour) to spot breakage even when systems say “green.”
  • Lineage: OpenLineage/Marquez to show which dashboard depends on which topic/job/table. Crucial during incidents.

A runbook that actually helps:

  1. Page on SLO burn (freshness/latency/completeness), not CPU.
  2. Check lineage to find the broken edge first.
  3. Inspect DLQ samples; if schema drift, pin producers via feature flag.
  4. Trip canary circuit breaker if quality error budget is burning.
  5. Initiate replay from Iceberg/Delta snapshot. Validate with reconciliation query.

We’ve consistently taken MTTR from hours to minutes by moving from infra‑only alerts to decision‑centric SLOs.

Governance, replay, and DR you’ll actually use

The day you need replay is not the day to discover you can’t.

  • Versioned storage: Land curated datasets to Iceberg/Delta with snapshot IDs. Tag snapshots with job commit IDs.
  • Deterministic jobs: Pure functions from input → output keyed by event IDs; no hidden wall‑clock dependencies.
  • Replays: Store raw CDC in long‑retention Kafka + lake. Reprocess with a fixed container image/tag. Write to a new versioned table orders_agg_v2 and flip consumers.
  • Schema evolution: BACKWARD_TRANSITIVE means you can add fields, not remove or repurpose. Deprecate fields with a sunset date.
  • DR: Cross‑cluster replication with MirrorMaker 2 or Cluster Linking; periodic failovers. Replicate both topics and schemas.
  • Access controls: Kafka ACLs + IAM; only curated topics feed analytics. Producers cannot write directly to curated.

Terraform a boring, repeatable baseline:

module "kafka" {
  source = "terraform-aws-modules/msk/aws"
  kafka_version = "3.6.0"
  number_of_broker_nodes = 6
  broker_node_client_subnets = var.private_subnets
}

resource "argocd_application" "orders_agg" {
  metadata { name = "orders-agg" }
  spec {
    source { repo_url = var.repo url = "git@github.com:company/data-pipelines.git" path = "flink/orders-agg" target_revision = "v1.4.2" }
    destination { server = "https://kubernetes.default.svc" namespace = "streams" }
    sync_policy { automated { prune = true self_heal = true } }
  }
}

Ship value in 90 days: a playbook we’ve used

Pick one decision loop. Ship it. Measure it. Then expand.

Days 0–15: Proof of correctness

  • Instrument source systems to emit stable IDs and timestamps. Stand up DebeziumKafka with Schema Registry.
  • Define data contracts in Git; enable BACKWARD_TRANSITIVE compatibility.
  • Build a minimal Flink job with watermarking and exactly-once to produce a curated orders_agg_v1.
  • Stand up ClickHouse with a materialized view for the target dashboard.
  • Add Great Expectations checks and DLQ; wire Prometheus metrics for lag/freshness.

Days 16–45: Productionizing

  • GitOps the pipeline via ArgoCD; checkpoint Flink to S3; size partitions and consumers.
  • Add canary topics; wire circuit breakers to freeze on quality burn.
  • Build lineage with OpenLineage and a Grafana dashboard for SLOs.
  • Run a game‑day: inject schema drift and late events; practice replay.

Days 46–90: Business integration

  • Reconcile finance rollups daily; publish accuracy deltas.
  • Integrate with a real decision: inventory reallocation, discount throttling, or fraud review routing.
  • Set alerting to SLO burn only; define escalation policy with runbooks.
  • Document the “break glass” replay procedure and do a DR failover test.

Measured outcomes we’ve delivered with this approach:

  • End‑to‑end P99 latency from >60s to 3.2s under peak.
  • Freshness SLO adherence from 78% to 99.3%.
  • Reconciling variance vs source from ±1.2% to ±0.08% for daily revenue.
  • MTTR from 2h+ to 11m median via lineage + DLQ triage.
  • A promotion decision loop that increased conversion +3.4% by reacting in near real‑time instead of waiting on batch.

If you want the engineering to matter, wire it to a dollar sign. Then scale the pattern to more decision loops. GitPlumbers has done this across retail, fintech, B2B SaaS, and marketplaces. The tech is repeatable; the domain logic is where you win.

Related Resources

Key takeaways

  • Design for truth, not just speed: define explicit data SLOs for latency, completeness, and accuracy.
  • Use CDC + schema contracts + exactly-once processing to stop silent data corruption.
  • Put quality gates and canaries in the stream so bad data can’t reach exec dashboards.
  • Instrument end-to-end with lineage and latency metrics tied to business KPIs, not just system health.
  • Make reprocessing first-class with versioned storage (Iceberg/Delta) and deterministic jobs.
  • Ship value in 90 days by scoping to one decision loop and measuring revenue, margin, or risk impact.

Implementation checklist

  • Write SLOs: P99 end-to-end latency, freshness window, completeness, and accuracy targets.
  • Adopt data contracts in Git with `protobuf`/`avro` and `BACKWARD_TRANSITIVE` compatibility.
  • Ingest with CDC (`Debezium`/`Kafka Connect`) and enforce idempotent keys.
  • Process with `Flink` using checkpoints and exactly-once sinks; checkpoint to durable storage.
  • Create DLQs and canary topics; block promotions on failed data quality tests.
  • Measure consumer lag, watermark delay, null-rate, schema violations, and downstream KPI drift.
  • Enable lineage via `OpenLineage`/`Marquez`; alert on broken dependencies.
  • Use `Iceberg`/`Delta` for versioned lake to reprocess and backfill deterministically.
  • Manage infra with `Terraform`; deploy jobs/operators via `ArgoCD` (GitOps).
  • Implement cross-cluster replication (`MirrorMaker 2`/Cluster Linking) for DR.
  • Stand up fast query stores (`ClickHouse`/`Pinot`) for operational analytics.
  • Run monthly game-days to rehearse replays, failovers, and schema evolution.

Questions we hear from teams

How do we choose between Flink, Spark Structured Streaming, and ksqlDB?
If you need low latency, exactly-once guarantees, and watermarking that scales, `Flink` is the safest bet today. `Spark Structured Streaming` is fine for higher-latency (seconds to minutes) and simpler semantics, especially if you’re already a Spark shop. `ksqlDB` works for stream-native SQL on Kafka but gets tricky with complex joins and late data. Pick the tool your team can operate at 3 a.m., not the shiniest one.
Can Snowflake/BigQuery alone support real-time?
They can do near‑real‑time with micro‑batching, but for sub‑second decisions and robust late‑data handling you still want a streaming tier (Kafka/Flink/Pinot or ClickHouse). We often land curated streams into Snowflake for finance while powering ops dashboards off Pinot/ClickHouse.
How do we handle schema evolution without breaking consumers?
Use `protobuf`/`avro` with `Schema Registry` enforcing `BACKWARD_TRANSITIVE`. Add fields; never repurpose or remove. Version topics (`orders_agg_v1` → `_v2`) for breaking changes, run both during migration, and flip consumers deliberately.
What’s the minimal set of metrics to alert on?
Alert only on SLO burn: end‑to‑end latency, freshness, and completeness. Add sustained consumer lag, watermark delay, and quality error burn rate. Everything else is dashboard‑only. This cuts noise and MTTR.
How does GitPlumbers engage on this?
We run an assessment to define your decision loops and SLOs, stand up a reference stack (Kafka/Flink/Schema Registry/ClickHouse/Snowflake), wire quality gates and observability, then ship one decision loop in 90 days. After that, we help your team own it via playbooks, game‑days, and handover.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a Real‑Time Readiness Assessment Download the Real‑Time Pipeline Checklist

Related resources