Feature Stores That Don’t Gaslight You: Serving the Same Truth Online and Offline
If your model returns different answers depending on the path that fed it features, you don’t have an AI system — you have improv. Here’s the feature store architecture and guardrails that actually hold up in prod.
“If your features aren’t consistent online/offline, your model isn’t deterministic; it’s improv.”Back to all posts
The day your model started lying
I’ve watched a credit decisioning team ship a “minor” feature tweak on Friday: swapped a streaming enrichment for an hourly batch. Same model, same traffic, suddenly approval rates drifted by 7% and p99 jumped 300ms. Monday was a blamestorm. Root cause: training used clean, point-in-time data. Serving stitched features from three code paths with different freshness and defaulting rules. The model wasn’t wrong — the features were inconsistent.
If your features aren’t consistent online/offline, your model isn’t deterministic; it’s improv.
This is the class of problem a real feature store solves — but only if you treat instrumentation and guardrails as table stakes, not “we’ll add later.”
Architect the feature store for the world you deploy to
The happy-path slideware says “single pane of glass.” Reality: you need a boring, predictable pipeline where training and serving share the exact same logic and data contracts.
What actually works:
- One feature definition used by both training and serving (no drift-y forks in notebooks vs microservices).
- Point-in-time correctness for training (ban ‘as of now’ joins).
- Offline store in your lakehouse (Parquet/Delta/Iceberg on S3/GCS/ADLS).
- Online store for low-latency serving (Redis, DynamoDB, Scylla/Cassandra).
- Streaming source (Kafka/PubSub/Kinesis) for freshness where needed.
- Orchestration with
Airflow/Dagsterand lineage viaOpenLineage/Marquez.
If you’re greenfield, Feast is a solid starting point. Tecton and Hopsworks are battle-tested managed options. Databricks Feature Store works well if you’re already all-in on DBX.
Minimal Feast-style config:
# repo.yaml
project: credit_scoring
registry: data/registry.db
provider: local
offline_store:
type: file
path: ./data
online_store:
type: redis
connection_string: redis://redis:6379/0Feature definitions in code (shared by training and serving):
# features.py
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
customers = Entity(name="customer_id", join_keys=["customer_id"])
transactions_source = FileSource(
path="data/transactions.parquet",
timestamp_field="event_ts",
)
avg_spend_30d = FeatureView(
name="avg_spend_30d",
entities=[customers],
ttl=None,
schema=[
Field(name="avg_spend", dtype=Float32),
Field(name="txn_count", dtype=Int64),
],
source=transactions_source,
)Training with point-in-time joins (no leakage):
from feast import FeatureStore
store = FeatureStore(repo_path=".")
training_df = store.get_historical_features(
entity_df=my_labels_df, # includes event_ts, customer_id, label
features=[
"avg_spend_30d:avg_spend",
"avg_spend_30d:txn_count",
],
).to_df()For auditability, I still like an explicit point-in-time SQL check:
SELECT t.customer_id, t.event_ts, f.avg_spend
FROM labels t
LEFT JOIN features f
ON f.customer_id = t.customer_id
AND f.event_ts <= t.event_ts
QUALIFY ROW_NUMBER() OVER (
PARTITION BY t.customer_id, t.event_ts ORDER BY f.event_ts DESC
) = 1;Serving path must use the same transformation code and materialize to the online store with the same defaults/TTLs. No “re-implemented in Go for speed” unless it’s literally generated from the same spec.
# materialize features online
feast materialize-incremental $(date +%Y-%m-%d)Rules that save weekends:
- Default values are explicit and versioned.
- TTLs are part of the feature contract.
- Backfills are replayable from Kafka/object storage.
- Every feature is traceable: who computed it, from what, when.
Instrumentation first: observe the features, not just the model
I’ve seen teams graph model accuracy while flying blind on feature freshness. Don’t do that. You need first-class metrics on the data plane, emitted per request and aggregated per feature.
Track at minimum:
- Feature freshness (seconds since last update) and staleness SLOs.
- Null/default ratios per feature.
- Online lookup latency p50/p95/p99.
- Feature version and feature set hash on every inference request.
- End-to-end trace: retrieval → model → policy → downstream side effects.
A thin Python serving layer with Prometheus and OpenTelemetry:
# app.py
from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, Gauge, make_asgi_app
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry import trace
import time
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
tracer = trace.get_tracer(__name__)
FEATURE_LOOKUP_LATENCY = Histogram(
"feature_lookup_seconds", "Latency of online feature reads", ["feature"]
)
FEATURE_FRESHNESS = Gauge(
"feature_freshness_seconds", "Seconds since last update", ["feature"]
)
PREDICTION_LATENCY = Histogram("prediction_seconds", "Model latency")
PREDICTIONS = Counter("predictions_total", "Count", ["model_version", "feature_hash"])
@app.get("/metrics")
def metrics():
return make_asgi_app()
@app.post("/score")
async def score(req: Request):
payload = await req.json()
with tracer.start_as_current_span("score"):
t0 = time.time()
features = {}
for f in ["avg_spend_30d:avg_spend", "avg_spend_30d:txn_count"]:
s = time.time()
val, freshness = online_store_read(f, payload["customer_id"]) # your client
FEATURE_LOOKUP_LATENCY.labels(feature=f).observe(time.time() - s)
FEATURE_FRESHNESS.labels(feature=f).set(freshness)
features[f] = val
pred = model.predict(features)
PREDICTION_LATENCY.observe(time.time() - t0)
PREDICTIONS.labels(model_version=model.version, feature_hash=hash_features(features)).inc()
return {"score": float(pred)}Add OTel baggage like feature_set_version so traces tie back to what was served. When something goes sideways, you want a single trace showing which features were stale, not a Slack thread guessing.
Safety guardrails for AI flows (LLMs and classic models)
Hallucination, PII leakage, jailbreaks — you can’t “train those away.” You wrap inference with strict contracts and fallbacks.
What we deploy in practice:
- Schema validation on outputs (
pydanticwith strict types/lengths). Reject non-conformant outputs. - Retrieval-augmented generation (RAG) with citation-binding (outputs must reference retrieved IDs).
- Confidence thresholds (calibrated scores or entailment checks) → fallback to deterministic templates/search.
- Content safety filters (OpenAI/Azure/Google or open-source like
reliablyai/guardrails). - Rate limits + cost guards per tenant.
Example: force JSON with citations and block ungrounded answers.
from pydantic import BaseModel, Field, ValidationError
class Answer(BaseModel):
summary: str = Field(..., max_length=512)
citations: list[str] = Field(..., min_items=1)
retrieved = retriever(query)
raw = llm.invoke({
"query": query,
"context": format_ctx(retrieved),
}, response_format={"type": "json_object"}) # use JSON mode if available
try:
ans = Answer.model_validate_json(raw)
except ValidationError:
return fallback_search_answer(query) # deterministic path
if not all(cid in {d.id for d in retrieved} for cid in ans.citations):
return fallback_search_answer(query)
if toxicity(ans.summary) > 0.01:
return "I can’t answer that."
return ans.model_dump()For classic classifiers/regressors, enforce min confidence (after calibration with Platt/Isotonic) and route low-confidence to a human or safer heuristic. Log rejections as first-class events; that data becomes your next training set.
Mitigate drift before it pages you at 3am
Drift isn’t a binary event; it’s a slow leak that becomes an outage. Watch both data and model behavior, and act before SLOs burn.
- Data drift: PSI/KL on key features; alert if sustained over N windows.
- Concept drift: monitor label distribution, error rates per segment.
- Feature store drift: null/default ratios, freshness decay.
- Automated triggers: open a retrain ticket/PR when drift crosses thresholds.
Lightweight drift job with Evidently that exports Prometheus metrics:
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
from prometheus_client import Gauge, start_http_server
import pandas as pd
start_http_server(8001)
DRIFT_SCORE = Gauge("feature_psis", "PSI by feature", ["feature"])
ref = pd.read_parquet("s3://bucket/ref_window.parquet")
curr = pd.read_parquet("s3://bucket/curr_window.parquet")
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref, current_data=curr)
for m in report.as_dict()["metrics"]:
if m["metric"].endswith("DataDriftTable"):
for row in m["result"]["drift_by_columns"].values():
DRIFT_SCORE.labels(feature=row["column_name"]).set(row["drift_score"]) If you need online drift/anomaly detection, alibi-detect with KSDrift or MMDDrift on feature vectors works well in a sidecar.
Tie alerts to SLOs that matter (e.g., “approval rate delta > 2% for 30m” or “LLM refusal rate > 5% p95 latency > 2s”). Don’t spam PagerDuty with every wobble.
Latency spikes: how to stop the thundering herd
Most AI outages I see are latency avalanches: bursty traffic + cold caches + downstream retries. Fix the topology, not just the instance size.
- Cache hot features/responses (Redis with per-feature TTLs and negative caching on misses).
- Warmers for embeddings/LLM sessions.
- Timeouts, retries, jitter everywhere; never infinite retries.
- Bulkheads: threadpools/queues per dependency; backpressure.
- Circuit breakers and canaries with automated rollback.
Istio policy that’s saved us more than once:
# destinationrule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: model-svc
spec:
host: model-svc.default.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 200
http:
http1MaxPendingRequests: 1000
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 3m
maxEjectionPercent: 50Canary with Argo Rollouts and Prometheus guardrails:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-svc
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 300}
- setWeight: 50
- pause: {duration: 600}
analysis:
templates:
- templateName: prometheus
analysisRunMetadata: {}
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: prometheus
spec:
metrics:
- name: error-rate
successCondition: result < 0.02
provider:
prometheus:
query: rate(http_requests_total{app="model-svc",status=~"5.."}[5m])
- name: p95-latency
successCondition: result < 0.8
provider:
prometheus:
query: histogram_quantile(0.95, sum(rate(prediction_seconds_bucket[5m])) by (le))If either metric trips, Argo halts and rolls back. Set thresholds to business tolerance, not vanity numbers.
Runbook: roll out a feature store without nuking your roadmap
- Inventory every feature feeding your top 2-3 models. Document owners, sources, refresh cadence, defaults.
- Pick a stack that fits your infra gravity: Feast + Redis + S3 covers 80% of cases; managed if your team is small.
- Port 1-2 high-impact features first. Define them once in code; kill the duplication in notebooks/microservices.
- Enforce point-in-time training. Add a data test that fails CI on leakage.
- Materialize to the online store. Wrap a thin client that injects feature version/hash into logs and traces.
- Add metrics: freshness, null ratios, lookup/model latency, error rates. Wire Prometheus + OpenTelemetry day one.
- Gate outputs with schema/thresholds and add deterministic fallbacks. Log all rejections and fallbacks.
- Roll out behind a canary. Define rollback rules on error rate and p95. Don’t promote without passing drift checks.
Results we routinely see with this pattern:
- 25–40% reduction in MTTR on data-caused incidents.
- 20–50% reduction in p99 spikes under bursty traffic.
- Measurable business stability: approval/refusal rates hold steady across deploys.
If you can’t show that last bullet, you don’t have control — you have excuses.
Key takeaways
- Design your feature store around point-in-time correctness and a single feature definition used by both training and serving.
- Instrument features, not just models: freshness, staleness, drift, and per-feature null ratios need first-class metrics.
- Wrap AI flows with safety guardrails: schema validation, content filters, confidence thresholds, and deterministic fallbacks.
- Use tracing to connect feature retrieval, model inference, and downstream effects; set SLOs where business pain is felt.
- Plan for drift and latency spikes with proactive monitors, canaries, circuit breakers, and backpressure.
Implementation checklist
- Define features once; use the exact same transformations for training and serving.
- Guarantee point-in-time joins; ban ‘as of now’ leakage in training.
- Add Prometheus metrics for feature freshness, staleness, null ratios, and p95/p99 latencies.
- Trace requests with OpenTelemetry across retrieval → model → policy → response.
- Implement output schema validation and confidence thresholds with deterministic fallbacks.
- Detect drift (PSI/KL) and retrain triggers with alerts tied to SLOs.
- Protect against latency spikes with caches, timeouts, retries, and Istio circuit breakers.
- Roll out changes with Argo canaries and automatic rollback on error/latency regressions.
Questions we hear from teams
- Do we need a managed feature store or is Feast enough?
- If you already run Redis/Kafka/S3 and have platform/SRE coverage, Feast plus a small amount of glue code is plenty for a first production system. If your team is lean or compliance is heavy (PII controls, fine-grained RBAC, HIPAA/PCI), managed platforms like Tecton or Hopsworks reduce blast radius and speed up audits.
- How do we prevent training-serving skew in practice?
- Define features once in code and use the exact same transformation logic for historical and online materialization. Enforce point-in-time joins in training, set explicit defaults/TTLs, and block PRs that introduce alternate code paths. Add a CI check that compares feature schema and versions between training and serving artifacts.
- What’s the minimum viable observability for AI in prod?
- Prometheus metrics for feature freshness, null ratios, and p95/p99 latencies; model counters with version/feature-hash labels; OpenTelemetry traces across retrieval → model → policy; and drift metrics (PSI/KL) on top features. Everything else is a nice-to-have until this is green.
- How do we guard against LLM hallucinations without killing UX?
- Constrain outputs with strict schemas, require citations tied to retrieved context, set confidence thresholds, and implement fast fallbacks (search/templates). Users prefer a reliable, occasionally conservative response over confident nonsense. Log rejections and refine prompts/contexts with those cases.
- What SLOs make sense for feature stores?
- Start with: online lookup p95 < 25ms, freshness SLO per feature (e.g., 95% of reads < 2 minutes old), null/default ratio < 1% for critical features, materialization lag < 5 minutes for streaming features. Tie business SLOs (approval/answer rates) to deploy gates in canaries.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
