The Real-Time Dashboard That Lied: Building Pipelines You Can Bet Revenue On
Real-time data is only useful if it’s correct, explainable, and on time. Here’s the battle-tested pattern we use to ship streaming pipelines that don’t wake up your on-call—or mislead your exec team.
Real-time is a feature. Reliability is the product.Back to all posts
The “real-time” incident nobody forgets
I’ve watched this movie more times than I care to admit: the CEO is staring at a “live” revenue dashboard, the numbers are moving, everyone feels modern… and the decision gets made on data that’s wrong by just enough to matter.
Common culprits:
- Late events (mobile clients, edge services, batchy upstreams) that arrive 20–90 minutes after you’ve already acted
- Duplicates from retries, Kafka rebalances, at-least-once delivery, or a “helpful” producer resend
- Schema drift that silently nulls a column and turns your funnel into modern art
- Quality regressions where “nulls increased” is technically not a crash, so nobody gets paged
The business outcome is predictable:
- Pricing changes based on bogus conversion rates
- Inventory decisions based on partially ingested orders
- Fraud rules tuned on missing/duplicated transactions
If your pipeline supports business-critical decision making, the bar isn’t “streams are cool.” The bar is reliability with proof: freshness, completeness, correctness, and traceability—measured continuously.
Start with the decision, then design the SLO
Here’s what actually works: treat each real-time dataset like a production service and give it an SLO. Not a hand-wavy “near real-time,” but an explicit contract with the business.
Define:
- Freshness SLO: e.g., “
orders_analyticsis ≤ 2 minutes behind source p95, ≤ 5 minutes p99” - Completeness SLO: e.g., “≥ 99.9% of order IDs seen in OLTP appear in the warehouse within 10 minutes”
- Validity SLO: e.g., “< 0.1% of events fail domain checks (negative amounts, invalid currency, impossible states)”
- Availability SLO: e.g., “pipeline runs without manual intervention 99.95% monthly”
- MTTR target: e.g., “data incident resolved or mitigated within 30 minutes”
Then map each SLO to how you’ll measure it.
- Freshness: event-time vs processing-time lag, per partition/topic
- Completeness: reconciliation counts against source-of-truth tables or CDC offsets
- Validity: rule failures as metrics, not log lines
This is the difference between “streaming pipeline” and decisioning platform.
A reference architecture that survives retries, replays, and humans
If you’re building something that matters, the architecture should make failures boring.
A pattern we’ve deployed repeatedly (and used to clean up a lot of AI-assisted “vibe code” pipelines) looks like this:
- CDC from OLTP with
DebeziumintoKafka - Schema governance with
Confluent Schema Registry(or equivalent) - Stream processing with
Apache Flink(orSpark Structured Streaming) - Idempotent sink into
Delta Lake/Iceberg(orBigQuery/Snowflakewith merge/upsert) - Semantic metrics layer (dbt metrics / LookML / MetricFlow) so every dashboard uses the same definitions
The non-negotiables:
- Event keys must be real (stable primary keys, not random UUIDs generated in the pipeline)
- Exactly-once semantics are a system property, not a checkbox. You still need idempotent writes.
- Backfills are first-class. If replaying a day of traffic is scary, it’s not production-ready.
A concrete Debezium connector example for Postgres:
{
"name": "orders-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres.prod",
"database.port": "5432",
"database.user": "debezium",
"database.password": "${secrets:dbz_pg_password}",
"database.dbname": "commerce",
"slot.name": "orders_cdc_slot",
"publication.autocreate.mode": "filtered",
"table.include.list": "public.orders,public.order_items",
"topic.prefix": "cdc",
"tombstones.on.delete": "false",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "true"
}
}This gets you:
- Ordered change events per key
- A stable stream you can replay
- Schema enforcement (Avro/Protobuf) instead of “JSON and prayers”
Data contracts: stop schema drift before it hits prod
I’ve seen teams spend six figures on a streaming platform and still get taken out by a renamed column.
Treat schemas like APIs:
- Register schemas per topic
- Enforce compatibility (usually BACKWARD for analytics consumers)
- Block breaking changes in CI
Example Schema Registry compatibility config:
curl -X PUT \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
--data '{"compatibility":"BACKWARD"}' \
http://schema-registry:8081/config/cdc.public.orders-valueAnd here’s the part people skip: business-level contracts. Not just types, but meaning:
order_totalis in minor units? major units?statusis an enum with valid transitions?created_atis event time or processing time?
Write those down, version them, and test them.
Quality gates that work in streaming (without killing throughput)
“Just run Great Expectations on the stream” sounds nice until you’ve melted your job with per-row Python checks.
What actually works is a two-layer approach:
- Inline checks: cheap, fast, produce metrics and route bad records
- Post-sink audit checks: heavier validation on micro-batches or windows
Example: Flink SQL with a quarantine side-output pattern (conceptually):
-- Pseudocode-ish Flink SQL pattern
CREATE TABLE orders_raw (
order_id STRING,
user_id STRING,
status STRING,
order_total_cents BIGINT,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' MINUTE
) WITH (...);
CREATE TABLE orders_quarantine (
order_id STRING,
reason STRING,
raw_payload STRING,
event_time TIMESTAMP(3)
) WITH (...);
CREATE TABLE orders_clean (
order_id STRING,
user_id STRING,
status STRING,
order_total_cents BIGINT,
event_time TIMESTAMP(3)
) WITH (...);
INSERT INTO orders_clean
SELECT * FROM orders_raw
WHERE order_total_cents >= 0
AND status IN ('CREATED','PAID','SHIPPED','CANCELLED');
INSERT INTO orders_quarantine
SELECT order_id,
'invalid_domain_values' AS reason,
CAST(ROW(order_id,user_id,status,order_total_cents,event_time) AS STRING) AS raw_payload,
event_time
FROM orders_raw
WHERE order_total_cents < 0
OR status NOT IN ('CREATED','PAID','SHIPPED','CANCELLED');Post-sink, run dbt tests (fast, deterministic) and a small number of expectation suites for audits.
dbt example:
version: 2
models:
- name: fct_orders
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [order_id]
columns:
- name: order_total_cents
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "order_total_cents >= 0"
- name: status
tests:
- accepted_values:
values: ['CREATED','PAID','SHIPPED','CANCELLED']Measurable outcome we aim for: bad data is visible within 2–5 minutes, and quarantine volume is tracked like errors, not ignored like “data stuff.”
Observability: page on SLO breaches, not on vibes
Most “real-time” pipelines fail silently. The job is green, but your exec dashboard is stale.
Instrument the pipeline like any other service:
- Lag (event-time and consumer lag)
- Throughput (records/sec)
- Error rate (parse failures, contract violations)
- Quarantine rate (bad records/sec)
- Freshness at the serving layer (warehouse table updated_at vs now)
Pipe metrics via OpenTelemetry and alert with Prometheus/Alertmanager.
Example alert rule for freshness breach:
groups:
- name: data-pipeline-slos
rules:
- alert: OrdersAnalyticsFreshnessBreach
expr: |
(time() - max(orders_table_last_success_unix_seconds)) > 300
for: 5m
labels:
severity: page
service: realtime-analytics
annotations:
summary: "Orders analytics is >5m stale"
description: "Downstream dashboards are likely wrong. Check Kafka lag, Flink checkpoints, and sink merge backlog."Two practical tips from the trenches:
- Track event-time lag, not just Kafka consumer lag. Late events are where truth goes to die.
- Build a one-command replay runbook. When incidents happen (they will), you don’t want bespoke heroics.
Business value delivery: define metrics once, then make them hard to misuse
Even if your pipeline is perfect, you can still lose the plot with metric chaos:
- “Revenue” means gross in one dashboard, net in another
- “Active user” includes bots in one place, excludes them elsewhere
- Teams build parallel pipelines because they don’t trust the main one
The fix is boring—and effective:
- Publish curated tables (or views) as data products with owners
- Implement a semantic layer (
dbtmetrics,MetricFlow,LookML,Cube) so dashboards reuse definitions - Version metrics like code, and review changes
When we do this with clients, the measurable outcomes are usually:
- 30–60% reduction in duplicate pipelines within a quarter
- Fewer Sev-2 “data incidents” because drift is caught at the contract layer
- Faster decision cycles: e.g., promo performance visible in < 3 minutes instead of “tomorrow morning”
One retail client (multi-region, heavy promos) went from ~45-minute staleness during peak to p95 freshness under 90 seconds, and their on-call stopped getting nightly pages for “dashboard looks off.” Not because the data got magically better—because we made correctness measurable and enforced it.
A rollout plan that doesn’t explode your roadmap
You don’t need a six-month platform rebuild. You need one pipeline done properly, then repeat.
- Pick one decision that matters (pricing, fraud, fulfillment SLA).
- Implement CDC → stream processing → idempotent sink for that domain.
- Add contracts, tests, and SLO-based alerting.
- Publish governed metrics and migrate the dashboards that drive decisions.
- Run one game day: simulate late events, duplicates, schema change, and a sink outage.
If you can’t do step 5, you don’t have reliability—you have optimism.
At GitPlumbers, this is where we’re usually called: a real-time pipeline exists, but it’s fragile, untestable, and full of AI-generated glue code that “works on my laptop.” We come in, make it replayable, observable, contract-driven, and boring—in the best way.
Real-time is a feature. Reliability is the product.
If your exec dashboards can move money, they should be held to the same standard as the systems that move the money.
Key takeaways
- “Real-time” only matters when you can prove freshness, completeness, and correctness per business KPI.
- Use CDC + Kafka with explicit keys and idempotent sinks to survive retries, replays, and backfills.
- Treat schema changes as deployments: enforce data contracts with Schema Registry compatibility and CI gates.
- Quality checks belong in-stream (for fast feedback) and post-sink (for auditability).
- Run your data platform like production software: define SLOs, page on breaches, and keep MTTR low with replayable pipelines.
Implementation checklist
- Define the business decision and its tolerance: max staleness, acceptable error rate, and blast radius.
- Pick one source of truth per entity and a stable primary key for CDC events.
- Enforce schema compatibility in `Schema Registry` (or equivalent) and block breaking changes in CI.
- Make every sink write idempotent (dedupe keys + upsert semantics + deterministic merges).
- Add freshness/completeness/validity metrics and alert on SLO breaches (not vibes).
- Create a replay/backfill procedure that does not require heroics or SSH-ing into prod.
- Publish metrics via a governed semantic layer and track adoption + decision impact.
Questions we hear from teams
- Do we need exactly-once semantics to make real-time analytics reliable?
- Not strictly—but you do need deterministic, idempotent outcomes. Flink’s exactly-once with checkpointing helps a lot, but you should still design sinks with upserts/merges on stable keys and handle replays without double-counting.
- Kafka vs Kinesis vs Pub/Sub: does the broker choice matter most?
- Broker choice matters, but most failures we see are upstream: bad keys, no contracts, no SLOs, and no replay plan. Pick a broker your team can operate, then invest in schema governance, idempotency, and observability.
- How do we handle late-arriving events without breaking dashboards?
- Use event-time processing with watermarks, publish both “provisional” and “final” metrics where needed, and make freshness visible. Also track late-event rate as a first-class metric—if it spikes, something upstream changed.
- What’s the fastest way to improve trust in existing real-time pipelines?
- Add freshness/completeness SLOs and page on breaches, implement quarantine for invalid records, and enforce schema compatibility in CI. You’ll usually cut MTTR dramatically before you touch any fancy new tooling.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
