Point-in-Time Feature Store Playbook
Preventing Training-Serving Skew in Real Systems
Date: 2026-02-27
Category: software / ml-systems
Use case fit: Quant signals, ranking/recommendation, risk scoring, anomaly detection
1) Why this matters
Most model failures in production are not model-architecture failures. They are data-timing failures:
- training rows saw information that would not have existed at prediction time (leakage)
- online features are computed with a different definition than offline features
- late-arriving events silently change historical feature values
- backfills rewrite “truth” and make evaluation look better than reality
If your model says “90% confidence” but your feature pipeline is time-inconsistent, you have expensive fiction.
2) Core principle: “As-of” truth
Every feature value must answer one strict question:
“What would this value have been as of prediction timestamp T, given only data available by T?”
This implies two time axes per event:
- event_time: when the real-world event happened
- ingest_time (or publish_time): when your system could first see/use it
For point-in-time correctness, feature retrieval for label timestamp T must use:
event_time <= Tingest_time <= T(or an allowed cutoff window)
Using only event_time is a common leak in delayed-data systems.
3) Minimum data contract for a feature store
For each feature table/entity key, store at least:
entity_idfeature_name(or column)feature_valueevent_timeingest_timefeature_version(logic version)source_version(schema/source lineage)
Without feature_version, you cannot explain drift caused by code changes.
Without ingest_time, you cannot defend against delayed-data leakage.
4) Offline training set construction (the safe pattern)
Given labels (entity_id, label_time, y):
- Build a spine of prediction events (
entity_id,label_time) - For each feature group, do an as-of join against history
- Pick latest row where:
- same
entity_id event_time <= label_timeingest_time <= label_time
- same
- Apply deterministic fallback (null/default/last known within TTL)
- Persist the materialized training set with:
- feature logic commit hash
- data snapshot id
- generation timestamp
SQL sketch (warehouse style)
WITH candidates AS (
SELECT
s.entity_id,
s.label_time,
f.feature_value,
f.event_time,
f.ingest_time,
ROW_NUMBER() OVER (
PARTITION BY s.entity_id, s.label_time
ORDER BY f.event_time DESC, f.ingest_time DESC
) AS rn
FROM spine s
JOIN feature_history f
ON f.entity_id = s.entity_id
AND f.event_time <= s.label_time
AND f.ingest_time <= s.label_time
)
SELECT *
FROM candidates
WHERE rn = 1;
Do not “simplify” to a latest snapshot join. That usually leaks future state.
5) Online serving parity
Training parity is impossible if offline and online use different logic paths.
Preferred architecture
- One feature definition spec (declarative)
- Two executors:
- offline materializer
- online low-latency serving path
- Shared test vectors for both paths
Practical guardrails
- Keep online transforms minimal and deterministic
- Push heavy logic upstream (stream/batch precompute)
- Enforce feature TTLs (stale features should fail closed or degrade explicitly)
- Log served feature values + versions for replay/audit
6) Training-serving skew taxonomy
A. Definition skew
Different formulas in offline vs online code.
Signal: distribution shift immediately after deploy, even with stable traffic.
B. Time skew
Online uses fresher (or staler) data than what training assumed.
Signal: strong hourly/sessional error patterns.
C. Null/default skew
Offline imputed mean; online uses zero/null; or opposite.
Signal: error spikes on cold entities / sparse cohorts.
D. Entity-resolution skew
Different key mapping logic (user merge, symbol rename, account hierarchy).
Signal: cohort-specific degradation with high join-miss rates.
E. Version skew
Model expects feature v3; service emits v2 after partial rollback.
Signal: sudden quality cliff with no model change.
7) Monitoring that actually catches skew
Track this per feature and per important cohort:
- Freshness lag:
now - feature_event_time - Availability rate: non-null ratio in serving
- Join hit rate (offline build + online requests)
- Population Stability Index (PSI) or JS divergence between train and serve
- Definition parity checks on sampled requests (recompute offline)
- Version mismatch count (
model_feature_spec_version != served_feature_version)
Set page-worthy alerts for:
- high-impact feature missing rate
- version mismatches
- sudden freshness lag jump
8) Backfill and replay policy (often ignored)
Backfills are where many teams quietly corrupt evaluation.
Rules:
- Backfill writes must be versioned, never silent overwrite
- Keep old and new derivations side-by-side during validation
- Recompute benchmark windows before promoting backfill output
- Tag experiments with feature snapshot ids
If you cannot reproduce the exact features used by a model last month, you do not have governance.
9) Quant-specific notes (execution/risk models)
For trading/execution contexts:
- distinguish exchange timestamp, gateway receive timestamp, strategy decision timestamp
- features near market events (auction, open/close, halts) need stricter ingest-time constraints
- avoid retrospective data vendor “clean-up” values in training unless those values were truly available live
- maintain “live-available” and “revised” datasets separately
A model trained on revised bars/ticks can look brilliant and fail at the open.
10) Rollout checklist
Before shipping a new feature or model:
- point-in-time unit tests include late-arrival scenarios
- offline/online parity tests pass on shared fixtures
- feature/version lineage visible in prediction logs
- null/default behavior is explicitly documented
- skew monitors and alerts are active
- rollback plan includes feature version pinning
11) Anti-patterns to avoid
- “Latest snapshot join is close enough.”
- “We’ll add ingest_time later.”
- “Backfill replaced history, but metrics improved, so it’s fine.”
- “Online code path is different but mathematically equivalent.” (Usually false in edge cases.)
12) Bottom line
A feature store is not mainly a storage product. It is a time-consistency and reproducibility system.
If you protect point-in-time correctness and definition parity, model quality becomes a real signal. If you don’t, your AUC/Sharpe/precision can be mostly pipeline artifact.
Suggested references
- Feast architecture/docs (offline-online feature consistency patterns)
- Uber Michelangelo feature platform papers/talks
- Tecton point-in-time correctness materials
- “Leakage in Data Mining” literature (classic framing)
- Industry postmortems on training-serving skew incidents