Do we need exactly-once semantics to make real-time analytics reliable?

Not strictly—but you do need deterministic, idempotent outcomes. Flink’s exactly-once with checkpointing helps a lot, but you should still design sinks with upserts/merges on stable keys and handle replays without double-counting.

Kafka vs Kinesis vs Pub/Sub: does the broker choice matter most?

Broker choice matters, but most failures we see are upstream: bad keys, no contracts, no SLOs, and no replay plan. Pick a broker your team can operate, then invest in schema governance, idempotency, and observability.

How do we handle late-arriving events without breaking dashboards?

Use event-time processing with watermarks, publish both “provisional” and “final” metrics where needed, and make freshness visible. Also track late-event rate as a first-class metric—if it spikes, something upstream changed.

What’s the fastest way to improve trust in existing real-time pipelines?

Add freshness/completeness SLOs and page on breaches, implement quarantine for invalid records, and enforce schema compatibility in CI. You’ll usually cut MTTR dramatically before you touch any fancy new tooling.

Data-engineering · Jan 4, 2026 · 9 minute read

The Real-Time Dashboard That Lied: Building Pipelines You Can Bet Revenue On

Real-time data is only useful if it’s correct, explainable, and on time. Here’s the battle-tested pattern we use to ship streaming pipelines that don’t wake up your on-call—or mislead your exec team.

Sam R.

Principal Consultant, GitPlumbers

I’ve been on-call for streaming platforms since Kafka was “that weird LinkedIn thing,” led SRE-flavored data teams through Flink and Spark wars, and cleaned up more CDC pipelines (and AI-generated glue code) than I can count. At GitPlumbers, I help engineering leaders ship reliable data systems that survive real traffic, real incidents, and real auditors.

Real-time is a feature. Reliability is the product.

Back to all posts

The “real-time” incident nobody forgets

I’ve watched this movie more times than I care to admit: the CEO is staring at a “live” revenue dashboard, the numbers are moving, everyone feels modern… and the decision gets made on data that’s wrong by just enough to matter.

Common culprits:

Late events (mobile clients, edge services, batchy upstreams) that arrive 20–90 minutes after you’ve already acted
Duplicates from retries, Kafka rebalances, at-least-once delivery, or a “helpful” producer resend
Schema drift that silently nulls a column and turns your funnel into modern art
Quality regressions where “nulls increased” is technically not a crash, so nobody gets paged

The business outcome is predictable:

Pricing changes based on bogus conversion rates
Inventory decisions based on partially ingested orders
Fraud rules tuned on missing/duplicated transactions

If your pipeline supports business-critical decision making, the bar isn’t “streams are cool.” The bar is reliability with proof: freshness, completeness, correctness, and traceability—measured continuously.

Start with the decision, then design the SLO

Here’s what actually works: treat each real-time dataset like a production service and give it an SLO. Not a hand-wavy “near real-time,” but an explicit contract with the business.

Define:

Freshness SLO: e.g., “orders_analytics is ≤ 2 minutes behind source p95, ≤ 5 minutes p99”
Completeness SLO: e.g., “≥ 99.9% of order IDs seen in OLTP appear in the warehouse within 10 minutes”
Validity SLO: e.g., “< 0.1% of events fail domain checks (negative amounts, invalid currency, impossible states)”
Availability SLO: e.g., “pipeline runs without manual intervention 99.95% monthly”
MTTR target: e.g., “data incident resolved or mitigated within 30 minutes”

Then map each SLO to how you’ll measure it.

Freshness: event-time vs processing-time lag, per partition/topic
Completeness: reconciliation counts against source-of-truth tables or CDC offsets
Validity: rule failures as metrics, not log lines

This is the difference between “streaming pipeline” and decisioning platform.

A reference architecture that survives retries, replays, and humans

If you’re building something that matters, the architecture should make failures boring.

A pattern we’ve deployed repeatedly (and used to clean up a lot of AI-assisted “vibe code” pipelines) looks like this:

CDC from OLTP with Debezium into Kafka
Schema governance with Confluent Schema Registry (or equivalent)
Stream processing with Apache Flink (or Spark Structured Streaming)
Idempotent sink into Delta Lake/Iceberg (or BigQuery/Snowflake with merge/upsert)
Semantic metrics layer (dbt metrics / LookML / MetricFlow) so every dashboard uses the same definitions

The non-negotiables:

Event keys must be real (stable primary keys, not random UUIDs generated in the pipeline)
Exactly-once semantics are a system property, not a checkbox. You still need idempotent writes.
Backfills are first-class. If replaying a day of traffic is scary, it’s not production-ready.

A concrete Debezium connector example for Postgres:

{
  "name": "orders-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres.prod",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${secrets:dbz_pg_password}",
    "database.dbname": "commerce",
    "slot.name": "orders_cdc_slot",
    "publication.autocreate.mode": "filtered",
    "table.include.list": "public.orders,public.order_items",
    "topic.prefix": "cdc",
    "tombstones.on.delete": "false",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "http://schema-registry:8081",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "true"
  }
}

This gets you:

Ordered change events per key
A stable stream you can replay
Schema enforcement (Avro/Protobuf) instead of “JSON and prayers”

Data contracts: stop schema drift before it hits prod

I’ve seen teams spend six figures on a streaming platform and still get taken out by a renamed column.

Treat schemas like APIs:

Register schemas per topic
Enforce compatibility (usually BACKWARD for analytics consumers)
Block breaking changes in CI

Example Schema Registry compatibility config:

curl -X PUT \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --data '{"compatibility":"BACKWARD"}' \
  http://schema-registry:8081/config/cdc.public.orders-value

And here’s the part people skip: business-level contracts. Not just types, but meaning:

order_total is in minor units? major units?
status is an enum with valid transitions?
created_at is event time or processing time?

Write those down, version them, and test them.

Quality gates that work in streaming (without killing throughput)

“Just run Great Expectations on the stream” sounds nice until you’ve melted your job with per-row Python checks.

What actually works is a two-layer approach:

Inline checks: cheap, fast, produce metrics and route bad records
Post-sink audit checks: heavier validation on micro-batches or windows

Example: Flink SQL with a quarantine side-output pattern (conceptually):

-- Pseudocode-ish Flink SQL pattern
CREATE TABLE orders_raw (
  order_id STRING,
  user_id STRING,
  status STRING,
  order_total_cents BIGINT,
  event_time TIMESTAMP(3),
  WATERMARK FOR event_time AS event_time - INTERVAL '5' MINUTE
) WITH (...);

CREATE TABLE orders_quarantine (
  order_id STRING,
  reason STRING,
  raw_payload STRING,
  event_time TIMESTAMP(3)
) WITH (...);

CREATE TABLE orders_clean (
  order_id STRING,
  user_id STRING,
  status STRING,
  order_total_cents BIGINT,
  event_time TIMESTAMP(3)
) WITH (...);

INSERT INTO orders_clean
SELECT * FROM orders_raw
WHERE order_total_cents >= 0
  AND status IN ('CREATED','PAID','SHIPPED','CANCELLED');

INSERT INTO orders_quarantine
SELECT order_id,
       'invalid_domain_values' AS reason,
       CAST(ROW(order_id,user_id,status,order_total_cents,event_time) AS STRING) AS raw_payload,
       event_time
FROM orders_raw
WHERE order_total_cents < 0
   OR status NOT IN ('CREATED','PAID','SHIPPED','CANCELLED');

Post-sink, run dbt tests (fast, deterministic) and a small number of expectation suites for audits.

dbt example:

version: 2
models:
  - name: fct_orders
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns: [order_id]
    columns:
      - name: order_total_cents
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: "order_total_cents >= 0"
      - name: status
        tests:
          - accepted_values:
              values: ['CREATED','PAID','SHIPPED','CANCELLED']

Measurable outcome we aim for: bad data is visible within 2–5 minutes, and quarantine volume is tracked like errors, not ignored like “data stuff.”

Observability: page on SLO breaches, not on vibes

Most “real-time” pipelines fail silently. The job is green, but your exec dashboard is stale.

Instrument the pipeline like any other service:

Lag (event-time and consumer lag)
Throughput (records/sec)
Error rate (parse failures, contract violations)
Quarantine rate (bad records/sec)
Freshness at the serving layer (warehouse table updated_at vs now)

Pipe metrics via OpenTelemetry and alert with Prometheus/Alertmanager.

Example alert rule for freshness breach:

groups:
  - name: data-pipeline-slos
    rules:
      - alert: OrdersAnalyticsFreshnessBreach
        expr: |
          (time() - max(orders_table_last_success_unix_seconds)) > 300
        for: 5m
        labels:
          severity: page
          service: realtime-analytics
        annotations:
          summary: "Orders analytics is >5m stale"
          description: "Downstream dashboards are likely wrong. Check Kafka lag, Flink checkpoints, and sink merge backlog."

Two practical tips from the trenches:

Track event-time lag, not just Kafka consumer lag. Late events are where truth goes to die.
Build a one-command replay runbook. When incidents happen (they will), you don’t want bespoke heroics.

Business value delivery: define metrics once, then make them hard to misuse

Even if your pipeline is perfect, you can still lose the plot with metric chaos:

“Revenue” means gross in one dashboard, net in another
“Active user” includes bots in one place, excludes them elsewhere
Teams build parallel pipelines because they don’t trust the main one

The fix is boring—and effective:

Publish curated tables (or views) as data products with owners
Implement a semantic layer (dbt metrics, MetricFlow, LookML, Cube) so dashboards reuse definitions
Version metrics like code, and review changes

When we do this with clients, the measurable outcomes are usually:

30–60% reduction in duplicate pipelines within a quarter
Fewer Sev-2 “data incidents” because drift is caught at the contract layer
Faster decision cycles: e.g., promo performance visible in < 3 minutes instead of “tomorrow morning”

One retail client (multi-region, heavy promos) went from ~45-minute staleness during peak to p95 freshness under 90 seconds, and their on-call stopped getting nightly pages for “dashboard looks off.” Not because the data got magically better—because we made correctness measurable and enforced it.

A rollout plan that doesn’t explode your roadmap

You don’t need a six-month platform rebuild. You need one pipeline done properly, then repeat.

Pick one decision that matters (pricing, fraud, fulfillment SLA).
Implement CDC → stream processing → idempotent sink for that domain.
Add contracts, tests, and SLO-based alerting.
Publish governed metrics and migrate the dashboards that drive decisions.
Run one game day: simulate late events, duplicates, schema change, and a sink outage.

If you can’t do step 5, you don’t have reliability—you have optimism.

At GitPlumbers, this is where we’re usually called: a real-time pipeline exists, but it’s fragile, untestable, and full of AI-generated glue code that “works on my laptop.” We come in, make it replayable, observable, contract-driven, and boring—in the best way.

Real-time is a feature. Reliability is the product.

If your exec dashboards can move money, they should be held to the same standard as the systems that move the money.

Related Resources

Key takeaways

“Real-time” only matters when you can prove freshness, completeness, and correctness per business KPI.
Use CDC + Kafka with explicit keys and idempotent sinks to survive retries, replays, and backfills.
Treat schema changes as deployments: enforce data contracts with Schema Registry compatibility and CI gates.
Quality checks belong in-stream (for fast feedback) and post-sink (for auditability).
Run your data platform like production software: define SLOs, page on breaches, and keep MTTR low with replayable pipelines.

Implementation checklist

Define the business decision and its tolerance: max staleness, acceptable error rate, and blast radius.
Pick one source of truth per entity and a stable primary key for CDC events.
Enforce schema compatibility in `Schema Registry` (or equivalent) and block breaking changes in CI.
Make every sink write idempotent (dedupe keys + upsert semantics + deterministic merges).
Add freshness/completeness/validity metrics and alert on SLO breaches (not vibes).
Create a replay/backfill procedure that does not require heroics or SSH-ing into prod.
Publish metrics via a governed semantic layer and track adoption + decision impact.

Questions we hear from teams

Do we need exactly-once semantics to make real-time analytics reliable?: Not strictly—but you do need deterministic, idempotent outcomes. Flink’s exactly-once with checkpointing helps a lot, but you should still design sinks with upserts/merges on stable keys and handle replays without double-counting.
Kafka vs Kinesis vs Pub/Sub: does the broker choice matter most?: Broker choice matters, but most failures we see are upstream: bad keys, no contracts, no SLOs, and no replay plan. Pick a broker your team can operate, then invest in schema governance, idempotency, and observability.
How do we handle late-arriving events without breaking dashboards?: Use event-time processing with watermarks, publish both “provisional” and “final” metrics where needed, and make freshness visible. Also track late-event rate as a first-class metric—if it spikes, something upstream changed.
What’s the fastest way to improve trust in existing real-time pipelines?: Add freshness/completeness SLOs and page on breaches, implement quarantine for invalid records, and enforce schema compatibility in CI. You’ll usually cut MTTR dramatically before you touch any fancy new tooling.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about stabilizing your real-time data pipelines See how we rescue legacy and AI-assisted systems without rewrites