Real-Time Pipelines That Don’t Lie: Shipping Decision‑Grade Data Under SLA, Not Vibes
If your CFO is making calls off streaming dashboards, you can’t afford “eventual” truth. Here’s the blueprint we deploy when reliability, quality, and minutes‑level latency actually matter.
Real-time is easy. Decision-grade real-time is an engineering discipline with SLOs, contracts, and brakes.Back to all posts
The 1:07 AM dashboard that lied (and almost cost a quarter)
We’d just launched a promo. The CFO was watching a real-time dashboard in Looker; product was throttling ad spend based on conversion. The graph flatlined for 18 minutes. Engineering got paged. Finance paused campaigns. Later we found it: a “harmless” schema change (amount from INT to DECIMAL) broke a consumer, backpressure cascaded, and the batch backfill contaminated same-day numbers. Classic.
I’ve seen this fail at unicorns and incumbents: everyone wants "real-time" until the first incident reveals the pipeline isn’t decision-grade. If your data drives pricing, fraud, inventory, or SLA credits, "eventual truth" is just a polite way to say "we’re guessing."
Decision‑grade means SLOs, not vibes
Forget “near real-time.” Define SLOs the business can sign.
- Freshness: P95 lag < 120s from DB commit to curated table availability.
- Completeness: > 99.9% events delivered within 5m; < 0.1% late beyond watermark.
- Accuracy: < 0.05% validation failures on guarded fields (
amount,status). - Lineage: 100% of published assets have upstream lineage in OpenLineage.
- Availability: Pipeline publish path SLO 99.9%; partial degradation allowed (read-only).
Put these in writing. Instrument them with Prometheus and page when violated, not “when someone notices Slack is loud.” Example freshness alert:
# prometheus rule
groups:
- name: data-freshness
rules:
- alert: OrdersFreshnessSLOViolated
expr: (time() - max_over_time(orders_curated_last_commit_timestamp_seconds[15m])) > 180
for: 5m
labels:
severity: page
annotations:
description: "orders_curated data lag > 3m for 5m"A reference architecture that survives 3 AM pages
This is the boring, proven stack we deploy when reliability matters.
- Ingress (CDC):
DebeziumonKafka Connectpulls fromPostgres/MySQLintoKafka(3.6+). No cron scrapers; CDC or bust. - Contracts:
AvroorProtobufin ConfluentSchema RegistrywithBACKWARDcompatibility. Producers can’t break consumers. - Stream processing:
Flink1.17+ (or Spark Structured Streaming if you already run Spark). Use watermarks and (at-least) exactly-once sinks. - Storage:
Iceberg/Deltaon S3/GCS/ADLS for ACID, time travel, and efficient upserts. - Transform/Model:
dbt1.8 for semantic models;Great Expectations0.18 for guardrails. - Orchestration:
DagsterorAirflow 2.8emittingOpenLineageevents. - Ops:
ArgoCDGitOps,Terraformfor infra,Prometheus/Grafanafor SLOs.
Debezium connector (Postgres CDC) we ship repeatedly:
{
"name": "orders-cdc-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "pg-primary",
"database.port": "5432",
"database.user": "debezium",
"database.password": "${file:/opt/connect/creds.properties:PG_PASSWORD}",
"database.dbname": "sales",
"database.server.name": "salesdb",
"plugin.name": "pgoutput",
"table.include.list": "public.orders,public.order_items",
"tombstones.on.delete": "false",
"transforms": "unwrap,route",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3"
}
}Kafka topic via Strimzi (replication + compaction where it counts):
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: orders
labels:
strimzi.io/cluster: prod-kafka
spec:
partitions: 12
replicas: 3
config:
cleanup.policy: compact,delete
min.insync.replicas: 2
retention.ms: 172800000
segment.bytes: 1073741824Guardrails that make bad data hard to ship
I don’t trust pipelines that rely on “discipline.” Build brakes into the road.
- Schema contracts: Enforce compatibility at the registry. Reject producers at publish time, not at 2 AM.
- Watermarks + dedup: Late data happens. Model it. Don’t guess. Flink SQL that de-dupes by
order_idand filters invalid states:
CREATE TABLE orders_raw (
order_id STRING,
user_id STRING,
amount DECIMAL(10,2),
status STRING,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' MINUTE
) WITH (
'connector' = 'kafka',
'topic' = 'orders',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'json',
'scan.startup.mode' = 'latest-offset'
);
CREATE TABLE orders_curated (
order_id STRING,
user_id STRING,
amount DECIMAL(10,2),
status STRING,
event_date DATE,
PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
'connector' = 'iceberg',
'catalog-name' = 'prod',
'catalog-type' = 'hive',
'warehouse' = 's3://datalake/prod'
);
INSERT INTO orders_curated
SELECT
order_id,
user_id,
amount,
status,
CAST(event_time AS DATE) AS event_date
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY event_time DESC) AS rn
FROM orders_raw
)
WHERE rn = 1 AND status IN ('PAID','SHIPPED');- Quality gates: Fail closed with Great Expectations. No green, no publish:
from great_expectations.data_context import DataContext
context = DataContext()
batch_request = {
"datasource_name": "spark_stream",
"data_connector_name": "curated",
"data_asset_name": "orders_curated"
}
result = context.run_checkpoint(
checkpoint_name="orders_curated_checkpoint",
batch_request=batch_request,
run_name="flink-commit-{{ds_nodash}}"
)
if not result["success"]:
raise SystemExit("GE failed: blocking publish")- dbt tests for semantic layer:
version: 2
models:
- name: fct_orders
description: "Decision-grade orders facts"
columns:
- name: order_id
tests:
- not_null
- unique
- name: amount
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
strictly: true- Lineage: Emit
OpenLineagefrom Airflow/Dagster so you can blame the right job fast:
# airflow values for Helm
openlineage:
enabled: true
transport:
type: http
url: https://openlineage.yourcompany.com
endpoint: api/v1/lineage
apiKey: ${OPENLINEAGE_API_KEY}- Circuit breakers: If null-rate on
amountspikes > 0.5% in 5m, drop publishes and page. Prefer a brief freeze over silent corruption.
Rollout playbook: ship without burning the business
Don’t big-bang a real-time cutover. This is the sequence that’s saved us repeatedly.
- Shadow pipeline: Mirror CDC topics, build the new stream to a shadow
…_canarydataset. - Data-diff: Compare
canaryvs prod withDatafoldordbt-expectationsrollups. Track row count deltas, aggregates, and key joins. - Replayability: Confirm you can reprocess the last 7–30 days from Kafka. Retention is cheaper than reputation.
- Canary consumers: Point 5–10% of dashboards/services to canary. They don’t decide anything yet.
- Backfill: Use Spark batch with the same business logic as your stream (shared library) to fill historical gaps.
- Promotion: Flip consumer groups, keep the old pipeline warm for 24–72 hours.
- Rollback: Pre-bake the revert plan. It’s not a rollback if you haven’t practiced it.
Pro tip: Treat data model changes like API changes. Version your topics/models and deprecate with a schedule, not a memo.
What “good” looks like in numbers
At a fintech we helped last quarter:
- Freshness: P95 end-to-end fell from 47m (batch) to 85s (stream). P99 to 140s.
- Completeness: > 99.95% events inside watermark; late data reconciled nightly.
- MTTR: 62% reduction after OpenLineage + Prometheus + runbooks.
- Cost: 28% Snowflake credit reduction by pushing joins/filters upstream in Flink and landing curated Iceberg tables.
- Business: Fraud model moved from 30m windows to rolling 5m features; chargeback loss down 11% QoQ.
We didn’t add magic. We removed “vibe coding,” set SLOs, and built brakes.
Landmines I’d avoid again
- DIY parquet lakes: You’ll reinvent Iceberg/Delta poorly. Use ACID tables with manifests.
- Ignoring schema registry: JSON with “flexible schemas” is a horror movie.
- Pretending exactly-once is free: Understand Flink checkpoints, idempotent sinks, and transactional writes. Test broker failover.
- Time zones: Normalize to
UTCat ingress. Localize at presentation only. - Infinite fan-out: Don’t create 30 downstream topics per table. Curate. Model. Govern.
- Silent late data: Use watermarks and side outputs for extreme late events. Don’t let them poison aggregates.
- AI-generated pipeline code without review: We’ve rescued too many “vibe coded” DAGs that passed CI but failed reality. LLMs can scaffold, but enforce contracts, tests, and reviews.
If you want a second set of eyes on a fragile stream or an AI-coded DAG, we do code rescue and vibe code cleanup weekly at GitPlumbers.
Key takeaways
- Define decision-grade SLOs for freshness, completeness, accuracy, and lineage; wire them to alerts that page humans.
- Use CDC into Kafka with schema contracts, process with Flink/Spark, land to Iceberg/Delta, and validate with Great Expectations/dbt.
- Prevent bad data from shipping: schema compatibility gates, expectation checks, circuit breakers, and lineage-aware rollbacks.
- Shadow, canary, and data-diff before promotion; always keep a replay path (Kafka) and a backfill plan.
- Track business outcomes: P95 freshness, incident MTTR, and cost per decision—then iterate.
Implementation checklist
- Write data SLOs: freshness, completeness, accuracy, volume, lineage, and availability.
- Stand up CDC with Debezium into Kafka; register schemas in Schema Registry with enforced compatibility.
- Pick a streaming engine (Flink 1.17+ or Spark 3.5 Structured Streaming) and implement watermarks + dedup.
- Use Iceberg/Delta for ACID lakehouse sinks; avoid hand-rolled parquet folders.
- Gate publishes with Great Expectations and dbt tests; fail closed on violations.
- Emit OpenLineage events from jobs for traceability and root cause.
- Add Prometheus metrics and alerts for data lag, null-rate, and schema rejections.
- Roll out via GitOps (ArgoCD) with shadow pipelines, canaries, and data-diff before cutover.
- Keep replay/backfill runbooks; test them quarterly like DR.
- Instrument business KPIs (conversion, fraud catch-rate) tied to pipeline deployments.
Questions we hear from teams
- Do we really need streaming, or is micro-batch “right-time” enough?
- If your decisions can tolerate 5–15 minutes of delay (e.g., BI dashboards), micro-batch with aggressive scheduling may be fine. But if you’re driving pricing, fraud prevention, inventory allocation, or SLA credits, you need event-driven CDC, watermarks, and low-latency error handling. We often start with micro-batch during shadowing, then promote hot paths to Flink once the SLO delta is clear.
- How do you guarantee exactly-once?
- You don’t, universally. You engineer for idempotency. In practice: Kafka idempotent producers + transactions where possible; Flink checkpointed state and transactional sinks (Iceberg/Snowflake MERGE with deterministic keys); dedup on keys + event-time ordering with watermarks. Then prove it with chaos testing and data-diff.
- Snowflake vs Iceberg/Delta for the curated layer?
- If your org is Snowflake-first, use Snowpipe Streaming and MERGE for upserts, but keep Kafka retention for replay. If you want open formats and cheap long-term storage, Iceberg/Delta on S3/GCS with Flink writes is excellent. We deploy hybrids: stream to Iceberg for bronze/silver, expose gold marts in Snowflake/BigQuery for analysts.
- Our pipelines were partially generated by an LLM. Is that a problem?
- Not inherently. It’s a problem when it ships without contracts, tests, and lineage. We do AI code refactoring and vibe code cleanup: add schema contracts, GE/dbt tests, SLO instrumentation, and runbooks. Treat AI output like a junior engineer’s draft — useful, but verify.
- What’s the fastest path from zero to a decision-grade MVP?
- 30–45 days if you’ve got cloud access: Terraform a managed Kafka (Confluent/Redpanda), Debezium for 1–2 critical tables, Flink SQL for the hot path, Iceberg on S3, Great Expectations gate, dbt model, Airflow/Dagster with OpenLineage, Prometheus alerts. Shadow for two weeks, then canary. Keep scope brutally narrow.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
