Do I need a feature store if we’re only serving an LLM?

If you’re doing RAG or re-ranking, you already have features: embeddings, recency, click signals. Version your encoder and corpus, set freshness SLAs for the index, and track groundedness. A lightweight store (Feast + Redis) works; managed options reduce toil.

Feast vs. Tecton vs. Databricks Feature Store—what should I pick?

If you have strong platform engineering and want OSS control, Feast is solid. If you want enterprise support, lineage, and governance out of the box, Tecton is strong. If you’re all-in on Databricks (Delta/Unity Catalog), their Feature Store keeps everything in one plane. The real win is standardizing transforms and enforcing parity—tooling is second.

What SLOs should we set for features?

Common targets: p95 feature lookup under 100–150ms, freshness lag under 2–5 minutes for streaming features, null rate under 1% per feature, and error budget aligned with your customer SLO. Track cache hit rate and distribution skew alerts.

How do we avoid training/serving skew?

Single code path for transforms, point-in-time joins for training, versioned feature definitions, and pre-prod validation comparing offline vs. shadow online samples. Block deploys if skew exceeds thresholds.

How do we deploy safely without stalling the roadmap?

Shadow traffic + canary with Argo Rollouts. Automate rollback based on Prometheus metrics (p95/p99, freshness, null rate). Feature flag new features to cohorts. This adds hours, not weeks, and saves you days of incident response.

Ai-delivery · Oct 4, 2025 · 9 minute read

Stop Blaming the Model: Build a Feature Store That Doesn’t Lie in Prod

Your model is fine. Your features aren’t. Here’s the feature store architecture and guardrails we deploy when teams keep getting burned by drift, hallucinations, and latency spikes.

Alex Kim

Principal Architect, GitPlumbers

20 years shipping and fixing distributed systems at scale. Ex-Stripe, AWS, and a few startups that no longer exist. I lead GitPlumbers’ AI reliability practice.

Boring prod is good prod. Your feature store is either making that possible—or making your pager go off.

Back to all posts

The incident we walked into last quarter

A Series D fintech called us after their fraud declines spiked 18% overnight. Same model, new feature set. Offline tests looked perfect. In prod, p99 latency for feature lookups popped from 45ms to 310ms, cache hit rate cratered, and the last_30d_txn_count feature started showing up as 0 for loyal customers.

Root cause: classic offline/online skew. The offline pipeline (Spark on Delta Lake) used a fillna(0) on missing counts; the online feature service (Redis-backed) did not. To make it worse, the TTL on Redis was 24h, but the underlying Kafka topic had late events and out-of-order updates. Features went stale; the model got confidently wrong; declines shot up.

We didn’t retrain. We rebuilt the feature store path: unified transformations, point-in-time joins, stricter TTLs, and hard guardrails. Declines normalized in 48 hours; p95 lookup latency back under 80ms; incident closed without touching model weights.

If your model behaves differently in prod than in your notebook, 9 times out of 10 it’s your feature store lying to you.

What a production feature store actually is

Skip the vendor bingo for a second. A usable production feature store has these parts:

Offline store: your truth for historical features (e.g., Delta Lake on S3/ADLS, BigQuery, Snowflake). Supports backfills, point-in-time correctness, and lineage.
Online store: a low-latency KV for serving (e.g., Redis, DynamoDB, Cassandra, Bigtable). Handles upserts, TTL, and high QPS with predictable tail latency.
Registry: versioned metadata: feature definitions, sources, ownership, SLAs.
Transformation layer: batch (Spark) and streaming (Flink/Kafka Streams) that produce identical features for offline and online.
Materialization: jobs that keep the online store warm with fresh features.
Access API: consistent SDK/HTTP surface for models and services.

Tools that make this sane:

OSS: Feast (+ Spark/Flink), Kafka, Redis, Delta Lake.
Managed: Tecton, Databricks Feature Store, SageMaker Feature Store, Vertex AI Feature Store.

Here’s what a minimal Feast setup looks like:

# feature_store.yaml
project: risk_scoring
registry: data/registry.db
provider: local
online_store:
  type: redis
  connection_string: redis:6379,db=1
offline_store:
  type: file
  path: s3://my-bucket/feature_repo

# features.py
from datetime import timedelta
from feast import FeatureView, Field, FileSource
from feast.types import Int64, Float32

source = FileSource(
    path="s3://my-bucket/transactions.parquet",
    timestamp_field="event_ts",
)

transactions_fv = FeatureView(
    name="tx_30d",
    entities=["customer_id"],
    ttl=timedelta(hours=12),
    schema=[
        Field(name="txn_count_30d", dtype=Int64),
        Field(name="avg_ticket_30d", dtype=Float32),
    ],
    online=True,
    source=source,
)

The important part isn’t the YAML—it’s enforcing that the same transformation code and time semantics produce features for both offline evaluation and online serving.

Design the data path for parity and speed

You want to optimize for two things that fight each other: parity (no training/serving skew) and speed (sub-100ms lookup at p95). What actually works:

Single source of transformation truth
- Put aggregations in one repo, one language. If you’re using Spark/Flink for batch/streaming, generate both from shared libs. Feast’s on-demand features or Tecton’s transforms help.
- Kill dueling fillna behavior. Define null handling in the feature definition with tests.
Point-in-time correctness
- No peeking. Use built-in point-in-time joins (e.g., Feast’s get_historical_features) so training sees only data available at t.
Materialize predictably
- Stream updates from Kafka/Flink to the online store for hot features; batch backfill via Airflow/Argo for daily aggregates.
- Use idempotent upserts. Deduplicate on (entity_id, event_ts).
Set realistic TTLs and freshness SLAs
- Align TTL with your business logic. If a 30-day aggregate changes hourly, TTL of 12h is wrong. Track freshness lag explicitly.
Budget your latency
- End-to-end SLO: 200ms p95? Save 80ms for feature fetches, 80ms for model inference, 40ms for network/overhead.

Example materialization run:

# Backfill last 7 days into Redis with Feast
feast materialize 2025-09-27T00:00:00 2025-10-04T00:00:00

# Then keep it warm with streaming (Flink job) or periodic refresh
feast materialize-incremental 2025-10-04T00:00:00

For RAG systems, treat embeddings as features: version your encoder, track corpus snapshot ID, and set a freshness SLA for the vector index (e.g., pgvector, Weaviate). Stale embeddings are just drift with better branding.

Instrumentation and observability that save your weekend

Feature stores aren’t “data infra.” They’re user-facing prod services. Instrument like one.

Track these at minimum:

Latency: histogram for feature lookup p50/p95/p99.
Freshness lag: seconds since last successful materialization per feature.
Null/NaN rate: by feature, by entity, over time.
Skew checks: offline vs. online distribution deltas.
Cache hit rate: Redis hits/misses.
Error budget burn: tie SLO to paging policy.

Quick-and-dirty Python example for a FastAPI feature service:

# app.py
from fastapi import FastAPI
from prometheus_client import Histogram, Counter, Gauge, generate_latest
from starlette.responses import Response
from opentelemetry import trace

app = FastAPI()

H_FEATURE_LATENCY = Histogram("feature_lookup_seconds", "Feature fetch latency")
C_FEATURE_ERRORS = Counter("feature_lookup_errors_total", "Feature fetch errors")
G_FRESHNESS_LAG = Gauge("feature_freshness_seconds", "Seconds since last update", ["feature"])
G_NULL_RATE = Gauge("feature_null_ratio", "Null ratio", ["feature"])

tracer = trace.get_tracer(__name__)

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type="text/plain")

@app.get("/features/{customer_id}")
def get_features(customer_id: str):
    with H_FEATURE_LATENCY.time():
        with tracer.start_as_current_span("feature_lookup") as span:
            try:
                feats = redis_client.hgetall(f"cust:{customer_id}")
                span.set_attribute("feature.count", len(feats))
                if not feats:
                    G_NULL_RATE.labels("txn_count_30d").set(1.0)
                    raise KeyError("no features")
                # Update gauges (in real code, compute per-feature)
                G_NULL_RATE.labels("txn_count_30d").set(0.0)
                return feats
            except Exception:
                C_FEATURE_ERRORS.inc()
                raise

Drift monitoring: run a scheduled job comparing recent online distributions to offline training baselines. EvidentlyAI, Arize, Fiddler, WhyLabs—take your pick. Example with Evidently:

from evidently.report import Report
from evidently.metrics import ColumnDriftMetric

report = Report(metrics=[ColumnDriftMetric(column_name="txn_count_30d")])
report.run(reference_data=offline_df, current_data=online_sample_df)
res = report.as_dict()
if res["metrics"][0]["result"]["drift_detected"]:
    # alert + auto reduce traffic via rollout controller
    print("Drift detected for txn_count_30d")

Wire alerts in Prometheus/Grafana for:

feature_lookup_seconds_bucket{le="0.1"} under target percentile
feature_freshness_seconds{feature="*"} > SLA
Sudden jump in feature_null_ratio

Guardrails for hallucinations, drift, and latency spikes

You won’t stop every incident, but you can make them boring.

Timeouts and retries with budgets: Don’t let feature lookups eat your latency SLO. Timebox requests and fail fast.
Circuit breakers: Eject bad upstreams and shed load.
Fallbacks: Serve a simpler model or cached score if features are missing.
Content safety for LLMs: Ground outputs with retrieval, add moderation/guardrails, and log prompt/response pairs for audit.

Istio example: sane timeouts/retries and outlier detection around your feature API.

# virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: feature-svc
spec:
  hosts: ["feature-svc"]
  http:
    - route:
        - destination:
            host: feature-svc
      timeout: 250ms
      retries:
        attempts: 2
        perTryTimeout: 100ms
        retryOn: 5xx,gateway-error,connect-failure,refused-stream

# destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: feature-svc
spec:
  host: feature-svc
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1000
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s

Fallback in code when features are late or missing:

def score(customer_id: str):
    try:
        feats = get_features(customer_id)
        if missing_rate(feats) > 0.3:
            raise ValueError("too many missing features")
        return model.predict(feats)
    except Exception:
        # Safe fallback: cached score or rules-based baseline
        return rules_engine_score(customer_id)

For LLM flows, guard hallucinations by enforcing grounding and filtering output:

Retrieve top-k docs with RAG and pass citations; reject answers with low retrieval score.
Use NeMo Guardrails/Guardrails AI/Llama Guard for safety and PII redaction.
Add an automatic evaluator (e.g., ragas) on a sample of traffic; page on drop in groundedness.

Deployment patterns that don’t blow up SLOs

If you flip straight to 100% traffic, you’re asking for a retro.

Shadow traffic: New feature pipeline reads real requests but doesn’t affect responses. Compare distributions and latency off-path.
Canary deployments: 1% -> 5% -> 25% with automated analysis.
Feature flags: Use LaunchDarkly/Unleash to gate new features by cohort.

Argo Rollouts example with a Prometheus analysis on p95 latency:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: feature-svc
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: {duration: 300}
        - analysis:
            templates:
              - templateName: p95-latency
        - setWeight: 25
        - pause: {duration: 600}
        - setWeight: 50
      trafficRouting:
        istio:
          virtualService: feature-svc
          weight: 0
  template:
    spec:
      containers:
        - name: feature-svc
          image: ghcr.io/acme/feature-svc:1.8.3
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: p95-latency
spec:
  metrics:
    - name: feature_p95
      interval: 1m
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.95, sum(rate(feature_lookup_seconds_bucket[5m])) by (le)) > 0.12

If p95 goes above 120ms, rollout pauses or rolls back automatically. No heroics.

Operating the store: contracts, validation, and hygiene

What bites teams isn’t fancy ML—it’s boring data hygiene.

Data contracts & schema registry: Define Avro/Protobuf schemas; register in Confluent Schema Registry. No breaking changes without a migration plan.
Validation: Use Great Expectations/Deequ on both offline and streaming paths.
Governance: PII tagging, RBAC, audit logs. Encrypt at rest and in transit. Treat feature stores like prod DBs.
Backfills & replays: Tested, idempotent, and throttled. Recompute under a new feature version.

Example Great Expectations test snippet:

import great_expectations as ge

df = ge.from_pandas(online_sample_df)
df.expect_column_values_to_not_be_null("txn_count_30d")
df.expect_column_values_to_be_between("avg_ticket_30d", min_value=0, max_value=10000)
res = df.validate()
assert res.success, "Feature validation failed"

And protect yourself from training/serving divergence by snapshotting everything:

Pin model artifacts in MLflow with the exact feature registry version and corpus snapshot (for RAG).
Store a “data build ID” on each request so you can replay prod behavior in a notebook.

What I’d do again tomorrow

I’ve seen teams throw months at retraining when their real problem was a leaky feature pipe. The playbook that actually works:

Start with clear SLOs for features and end-to-end latency.
Pick a feature store with a credible offline/online story and don’t DIY the hard parts unless you must.
Unify transforms; enforce point-in-time correctness; set TTLs based on business, not vibes.
Instrument like it matters. Because it does.
Add guardrails so incidents degrade gracefully instead of paging everyone.
Deploy with canaries and shadow. Let metrics decide.

Do this, and your “mysterious model regressions” turn into routine ops work. That’s the point. Boring prod is good prod.

Related Resources

Key takeaways

Design for offline/online parity from day one: same transformations, same time semantics, same versioned data.
Instrument feature pipelines like any critical service: latency, freshness, null rates, and drift on dashboards with alerts tied to SLOs.
Guardrails matter: circuit breakers, rate limits, fallbacks, and safety checks for generative outputs reduce fire drills.
Deploy with canaries and shadow traffic; measure p95/p99 of feature lookups, not just model inference.
Automate validation with data contracts, schema registry, and pre-prod checks for feature quality and leakage.

Implementation checklist

Define AI SLOs: feature lookup p95, end-to-end latency budget, freshness max lag, error budget.
Choose a feature store with clear offline/online story (e.g., Feast + Redis + Delta/S3 or Tecton/Databricks FS).
Unify transformations via a single code path; enforce point-in-time correctness and TTLs.
Instrument feature services with OpenTelemetry traces and Prometheus metrics (null rates, freshness lag, cache hit).
Add guardrails: Istio/Envoy timeouts, retries with jitter, circuit breakers, and safe fallbacks.
Deploy via canary + shadow using Argo Rollouts; auto-rollback on SLO regression.
Continuously monitor drift (Evidently/WhyLabs/Arize) and validate data with Great Expectations/Deequ.
Lock in governance: schema registry, data contracts, PII handling, RBAC, audit logs.

Questions we hear from teams

Do I need a feature store if we’re only serving an LLM?: If you’re doing RAG or re-ranking, you already have features: embeddings, recency, click signals. Version your encoder and corpus, set freshness SLAs for the index, and track groundedness. A lightweight store (Feast + Redis) works; managed options reduce toil.
Feast vs. Tecton vs. Databricks Feature Store—what should I pick?: If you have strong platform engineering and want OSS control, Feast is solid. If you want enterprise support, lineage, and governance out of the box, Tecton is strong. If you’re all-in on Databricks (Delta/Unity Catalog), their Feature Store keeps everything in one plane. The real win is standardizing transforms and enforcing parity—tooling is second.
What SLOs should we set for features?: Common targets: p95 feature lookup under 100–150ms, freshness lag under 2–5 minutes for streaming features, null rate under 1% per feature, and error budget aligned with your customer SLO. Track cache hit rate and distribution skew alerts.
How do we avoid training/serving skew?: Single code path for transforms, point-in-time joins for training, versioned feature definitions, and pre-prod validation comparing offline vs. shadow online samples. Block deploys if skew exceeds thresholds.
How do we deploy safely without stalling the roadmap?: Shadow traffic + canary with Argo Rollouts. Automate rollback based on Prometheus metrics (p95/p99, freshness, null rate). Feature flag new features to cohorts. This adds hours, not weeks, and saves you days of incident response.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about hardening your feature store Download our AI SLO checklist