Stop Blaming the Model: Build a Feature Store That Doesn’t Lie in Prod
Your model is fine. Your features aren’t. Here’s the feature store architecture and guardrails we deploy when teams keep getting burned by drift, hallucinations, and latency spikes.
Boring prod is good prod. Your feature store is either making that possible—or making your pager go off.Back to all posts
The incident we walked into last quarter
A Series D fintech called us after their fraud declines spiked 18% overnight. Same model, new feature set. Offline tests looked perfect. In prod, p99 latency for feature lookups popped from 45ms to 310ms, cache hit rate cratered, and the last_30d_txn_count feature started showing up as 0 for loyal customers.
Root cause: classic offline/online skew. The offline pipeline (Spark on Delta Lake) used a fillna(0) on missing counts; the online feature service (Redis-backed) did not. To make it worse, the TTL on Redis was 24h, but the underlying Kafka topic had late events and out-of-order updates. Features went stale; the model got confidently wrong; declines shot up.
We didn’t retrain. We rebuilt the feature store path: unified transformations, point-in-time joins, stricter TTLs, and hard guardrails. Declines normalized in 48 hours; p95 lookup latency back under 80ms; incident closed without touching model weights.
If your model behaves differently in prod than in your notebook, 9 times out of 10 it’s your feature store lying to you.
What a production feature store actually is
Skip the vendor bingo for a second. A usable production feature store has these parts:
- Offline store: your truth for historical features (e.g.,
Delta Lakeon S3/ADLS,BigQuery,Snowflake). Supports backfills, point-in-time correctness, and lineage. - Online store: a low-latency KV for serving (e.g.,
Redis,DynamoDB,Cassandra,Bigtable). Handles upserts, TTL, and high QPS with predictable tail latency. - Registry: versioned metadata: feature definitions, sources, ownership, SLAs.
- Transformation layer: batch (
Spark) and streaming (Flink/Kafka Streams) that produce identical features for offline and online. - Materialization: jobs that keep the online store warm with fresh features.
- Access API: consistent SDK/HTTP surface for models and services.
Tools that make this sane:
- OSS:
Feast(+Spark/Flink),Kafka,Redis,Delta Lake. - Managed:
Tecton,Databricks Feature Store,SageMaker Feature Store,Vertex AI Feature Store.
Here’s what a minimal Feast setup looks like:
# feature_store.yaml
project: risk_scoring
registry: data/registry.db
provider: local
online_store:
type: redis
connection_string: redis:6379,db=1
offline_store:
type: file
path: s3://my-bucket/feature_repo# features.py
from datetime import timedelta
from feast import FeatureView, Field, FileSource
from feast.types import Int64, Float32
source = FileSource(
path="s3://my-bucket/transactions.parquet",
timestamp_field="event_ts",
)
transactions_fv = FeatureView(
name="tx_30d",
entities=["customer_id"],
ttl=timedelta(hours=12),
schema=[
Field(name="txn_count_30d", dtype=Int64),
Field(name="avg_ticket_30d", dtype=Float32),
],
online=True,
source=source,
)The important part isn’t the YAML—it’s enforcing that the same transformation code and time semantics produce features for both offline evaluation and online serving.
Design the data path for parity and speed
You want to optimize for two things that fight each other: parity (no training/serving skew) and speed (sub-100ms lookup at p95). What actually works:
Single source of transformation truth
- Put aggregations in one repo, one language. If you’re using Spark/Flink for batch/streaming, generate both from shared libs. Feast’s on-demand features or Tecton’s transforms help.
- Kill dueling
fillnabehavior. Define null handling in the feature definition with tests.
Point-in-time correctness
- No peeking. Use built-in point-in-time joins (e.g., Feast’s
get_historical_features) so training sees only data available att.
- No peeking. Use built-in point-in-time joins (e.g., Feast’s
Materialize predictably
- Stream updates from
Kafka/Flinkto the online store for hot features; batch backfill via Airflow/Argo for daily aggregates. - Use idempotent upserts. Deduplicate on
(entity_id, event_ts).
- Stream updates from
Set realistic TTLs and freshness SLAs
- Align TTL with your business logic. If a 30-day aggregate changes hourly, TTL of 12h is wrong. Track freshness lag explicitly.
Budget your latency
- End-to-end SLO: 200ms p95? Save 80ms for feature fetches, 80ms for model inference, 40ms for network/overhead.
Example materialization run:
# Backfill last 7 days into Redis with Feast
feast materialize 2025-09-27T00:00:00 2025-10-04T00:00:00
# Then keep it warm with streaming (Flink job) or periodic refresh
feast materialize-incremental 2025-10-04T00:00:00For RAG systems, treat embeddings as features: version your encoder, track corpus snapshot ID, and set a freshness SLA for the vector index (e.g., pgvector, Weaviate). Stale embeddings are just drift with better branding.
Instrumentation and observability that save your weekend
Feature stores aren’t “data infra.” They’re user-facing prod services. Instrument like one.
Track these at minimum:
- Latency: histogram for feature lookup p50/p95/p99.
- Freshness lag: seconds since last successful materialization per feature.
- Null/NaN rate: by feature, by entity, over time.
- Skew checks: offline vs. online distribution deltas.
- Cache hit rate: Redis hits/misses.
- Error budget burn: tie SLO to paging policy.
Quick-and-dirty Python example for a FastAPI feature service:
# app.py
from fastapi import FastAPI
from prometheus_client import Histogram, Counter, Gauge, generate_latest
from starlette.responses import Response
from opentelemetry import trace
app = FastAPI()
H_FEATURE_LATENCY = Histogram("feature_lookup_seconds", "Feature fetch latency")
C_FEATURE_ERRORS = Counter("feature_lookup_errors_total", "Feature fetch errors")
G_FRESHNESS_LAG = Gauge("feature_freshness_seconds", "Seconds since last update", ["feature"])
G_NULL_RATE = Gauge("feature_null_ratio", "Null ratio", ["feature"])
tracer = trace.get_tracer(__name__)
@app.get("/metrics")
def metrics():
return Response(generate_latest(), media_type="text/plain")
@app.get("/features/{customer_id}")
def get_features(customer_id: str):
with H_FEATURE_LATENCY.time():
with tracer.start_as_current_span("feature_lookup") as span:
try:
feats = redis_client.hgetall(f"cust:{customer_id}")
span.set_attribute("feature.count", len(feats))
if not feats:
G_NULL_RATE.labels("txn_count_30d").set(1.0)
raise KeyError("no features")
# Update gauges (in real code, compute per-feature)
G_NULL_RATE.labels("txn_count_30d").set(0.0)
return feats
except Exception:
C_FEATURE_ERRORS.inc()
raiseDrift monitoring: run a scheduled job comparing recent online distributions to offline training baselines. EvidentlyAI, Arize, Fiddler, WhyLabs—take your pick. Example with Evidently:
from evidently.report import Report
from evidently.metrics import ColumnDriftMetric
report = Report(metrics=[ColumnDriftMetric(column_name="txn_count_30d")])
report.run(reference_data=offline_df, current_data=online_sample_df)
res = report.as_dict()
if res["metrics"][0]["result"]["drift_detected"]:
# alert + auto reduce traffic via rollout controller
print("Drift detected for txn_count_30d")Wire alerts in Prometheus/Grafana for:
feature_lookup_seconds_bucket{le="0.1"}under target percentilefeature_freshness_seconds{feature="*"} > SLA- Sudden jump in
feature_null_ratio
Guardrails for hallucinations, drift, and latency spikes
You won’t stop every incident, but you can make them boring.
- Timeouts and retries with budgets: Don’t let feature lookups eat your latency SLO. Timebox requests and fail fast.
- Circuit breakers: Eject bad upstreams and shed load.
- Fallbacks: Serve a simpler model or cached score if features are missing.
- Content safety for LLMs: Ground outputs with retrieval, add moderation/guardrails, and log prompt/response pairs for audit.
Istio example: sane timeouts/retries and outlier detection around your feature API.
# virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: feature-svc
spec:
hosts: ["feature-svc"]
http:
- route:
- destination:
host: feature-svc
timeout: 250ms
retries:
attempts: 2
perTryTimeout: 100ms
retryOn: 5xx,gateway-error,connect-failure,refused-stream# destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: feature-svc
spec:
host: feature-svc
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 1000
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30sFallback in code when features are late or missing:
def score(customer_id: str):
try:
feats = get_features(customer_id)
if missing_rate(feats) > 0.3:
raise ValueError("too many missing features")
return model.predict(feats)
except Exception:
# Safe fallback: cached score or rules-based baseline
return rules_engine_score(customer_id)For LLM flows, guard hallucinations by enforcing grounding and filtering output:
- Retrieve top-k docs with
RAGand pass citations; reject answers with low retrieval score. - Use
NeMo Guardrails/Guardrails AI/Llama Guardfor safety and PII redaction. - Add an automatic evaluator (e.g.,
ragas) on a sample of traffic; page on drop in groundedness.
Deployment patterns that don’t blow up SLOs
If you flip straight to 100% traffic, you’re asking for a retro.
- Shadow traffic: New feature pipeline reads real requests but doesn’t affect responses. Compare distributions and latency off-path.
- Canary deployments: 1% -> 5% -> 25% with automated analysis.
- Feature flags: Use
LaunchDarkly/Unleashto gate new features by cohort.
Argo Rollouts example with a Prometheus analysis on p95 latency:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: feature-svc
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 300}
- analysis:
templates:
- templateName: p95-latency
- setWeight: 25
- pause: {duration: 600}
- setWeight: 50
trafficRouting:
istio:
virtualService: feature-svc
weight: 0
template:
spec:
containers:
- name: feature-svc
image: ghcr.io/acme/feature-svc:1.8.3
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: p95-latency
spec:
metrics:
- name: feature_p95
interval: 1m
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.95, sum(rate(feature_lookup_seconds_bucket[5m])) by (le)) > 0.12If p95 goes above 120ms, rollout pauses or rolls back automatically. No heroics.
Operating the store: contracts, validation, and hygiene
What bites teams isn’t fancy ML—it’s boring data hygiene.
- Data contracts & schema registry: Define Avro/Protobuf schemas; register in
Confluent Schema Registry. No breaking changes without a migration plan. - Validation: Use
Great Expectations/Deequon both offline and streaming paths. - Governance: PII tagging, RBAC, audit logs. Encrypt at rest and in transit. Treat feature stores like prod DBs.
- Backfills & replays: Tested, idempotent, and throttled. Recompute under a new feature version.
Example Great Expectations test snippet:
import great_expectations as ge
df = ge.from_pandas(online_sample_df)
df.expect_column_values_to_not_be_null("txn_count_30d")
df.expect_column_values_to_be_between("avg_ticket_30d", min_value=0, max_value=10000)
res = df.validate()
assert res.success, "Feature validation failed"And protect yourself from training/serving divergence by snapshotting everything:
- Pin model artifacts in
MLflowwith the exact feature registry version and corpus snapshot (for RAG). - Store a “data build ID” on each request so you can replay prod behavior in a notebook.
What I’d do again tomorrow
I’ve seen teams throw months at retraining when their real problem was a leaky feature pipe. The playbook that actually works:
- Start with clear SLOs for features and end-to-end latency.
- Pick a feature store with a credible offline/online story and don’t DIY the hard parts unless you must.
- Unify transforms; enforce point-in-time correctness; set TTLs based on business, not vibes.
- Instrument like it matters. Because it does.
- Add guardrails so incidents degrade gracefully instead of paging everyone.
- Deploy with canaries and shadow. Let metrics decide.
Do this, and your “mysterious model regressions” turn into routine ops work. That’s the point. Boring prod is good prod.
Key takeaways
- Design for offline/online parity from day one: same transformations, same time semantics, same versioned data.
- Instrument feature pipelines like any critical service: latency, freshness, null rates, and drift on dashboards with alerts tied to SLOs.
- Guardrails matter: circuit breakers, rate limits, fallbacks, and safety checks for generative outputs reduce fire drills.
- Deploy with canaries and shadow traffic; measure p95/p99 of feature lookups, not just model inference.
- Automate validation with data contracts, schema registry, and pre-prod checks for feature quality and leakage.
Implementation checklist
- Define AI SLOs: feature lookup p95, end-to-end latency budget, freshness max lag, error budget.
- Choose a feature store with clear offline/online story (e.g., Feast + Redis + Delta/S3 or Tecton/Databricks FS).
- Unify transformations via a single code path; enforce point-in-time correctness and TTLs.
- Instrument feature services with OpenTelemetry traces and Prometheus metrics (null rates, freshness lag, cache hit).
- Add guardrails: Istio/Envoy timeouts, retries with jitter, circuit breakers, and safe fallbacks.
- Deploy via canary + shadow using Argo Rollouts; auto-rollback on SLO regression.
- Continuously monitor drift (Evidently/WhyLabs/Arize) and validate data with Great Expectations/Deequ.
- Lock in governance: schema registry, data contracts, PII handling, RBAC, audit logs.
Questions we hear from teams
- Do I need a feature store if we’re only serving an LLM?
- If you’re doing RAG or re-ranking, you already have features: embeddings, recency, click signals. Version your encoder and corpus, set freshness SLAs for the index, and track groundedness. A lightweight store (Feast + Redis) works; managed options reduce toil.
- Feast vs. Tecton vs. Databricks Feature Store—what should I pick?
- If you have strong platform engineering and want OSS control, Feast is solid. If you want enterprise support, lineage, and governance out of the box, Tecton is strong. If you’re all-in on Databricks (Delta/Unity Catalog), their Feature Store keeps everything in one plane. The real win is standardizing transforms and enforcing parity—tooling is second.
- What SLOs should we set for features?
- Common targets: p95 feature lookup under 100–150ms, freshness lag under 2–5 minutes for streaming features, null rate under 1% per feature, and error budget aligned with your customer SLO. Track cache hit rate and distribution skew alerts.
- How do we avoid training/serving skew?
- Single code path for transforms, point-in-time joins for training, versioned feature definitions, and pre-prod validation comparing offline vs. shadow online samples. Block deploys if skew exceeds thresholds.
- How do we deploy safely without stalling the roadmap?
- Shadow traffic + canary with Argo Rollouts. Automate rollback based on Prometheus metrics (p95/p99, freshness, null rate). Feature flag new features to cohorts. This adds hours, not weeks, and saves you days of incident response.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
