Point-in-Time Feature Store Playbook

Preventing Training-Serving Skew in Real Systems

Date: 2026-02-27
Category: software / ml-systems
Use case fit: Quant signals, ranking/recommendation, risk scoring, anomaly detection

1) Why this matters

Most model failures in production are not model-architecture failures. They are data-timing failures:

training rows saw information that would not have existed at prediction time (leakage)
online features are computed with a different definition than offline features
late-arriving events silently change historical feature values
backfills rewrite “truth” and make evaluation look better than reality

If your model says “90% confidence” but your feature pipeline is time-inconsistent, you have expensive fiction.

2) Core principle: “As-of” truth

Every feature value must answer one strict question:

“What would this value have been as of prediction timestamp T, given only data available by T?”

This implies two time axes per event:

event_time: when the real-world event happened
ingest_time (or publish_time): when your system could first see/use it

For point-in-time correctness, feature retrieval for label timestamp T must use:

event_time <= T
ingest_time <= T (or an allowed cutoff window)

Using only event_time is a common leak in delayed-data systems.

3) Minimum data contract for a feature store

For each feature table/entity key, store at least:

entity_id
feature_name (or column)
feature_value
event_time
ingest_time
feature_version (logic version)
source_version (schema/source lineage)

Without feature_version, you cannot explain drift caused by code changes. Without ingest_time, you cannot defend against delayed-data leakage.

4) Offline training set construction (the safe pattern)

Given labels (entity_id, label_time, y):

Build a spine of prediction events (entity_id, label_time)
For each feature group, do an as-of join against history
Pick latest row where:
- same entity_id
- event_time <= label_time
- ingest_time <= label_time
Apply deterministic fallback (null/default/last known within TTL)
Persist the materialized training set with:
- feature logic commit hash
- data snapshot id
- generation timestamp

SQL sketch (warehouse style)

WITH candidates AS (
  SELECT
    s.entity_id,
    s.label_time,
    f.feature_value,
    f.event_time,
    f.ingest_time,
    ROW_NUMBER() OVER (
      PARTITION BY s.entity_id, s.label_time
      ORDER BY f.event_time DESC, f.ingest_time DESC
    ) AS rn
  FROM spine s
  JOIN feature_history f
    ON f.entity_id = s.entity_id
   AND f.event_time <= s.label_time
   AND f.ingest_time <= s.label_time
)
SELECT *
FROM candidates
WHERE rn = 1;

Do not “simplify” to a latest snapshot join. That usually leaks future state.

5) Online serving parity

Training parity is impossible if offline and online use different logic paths.

Preferred architecture

One feature definition spec (declarative)
Two executors:
- offline materializer
- online low-latency serving path
Shared test vectors for both paths

Practical guardrails

Keep online transforms minimal and deterministic
Push heavy logic upstream (stream/batch precompute)
Enforce feature TTLs (stale features should fail closed or degrade explicitly)
Log served feature values + versions for replay/audit

6) Training-serving skew taxonomy

A. Definition skew

Different formulas in offline vs online code.

Signal: distribution shift immediately after deploy, even with stable traffic.

B. Time skew

Online uses fresher (or staler) data than what training assumed.

Signal: strong hourly/sessional error patterns.

C. Null/default skew

Offline imputed mean; online uses zero/null; or opposite.

Signal: error spikes on cold entities / sparse cohorts.

D. Entity-resolution skew

Different key mapping logic (user merge, symbol rename, account hierarchy).

Signal: cohort-specific degradation with high join-miss rates.

E. Version skew

Model expects feature v3; service emits v2 after partial rollback.

Signal: sudden quality cliff with no model change.

7) Monitoring that actually catches skew

Track this per feature and per important cohort:

Freshness lag: now - feature_event_time
Availability rate: non-null ratio in serving
Join hit rate (offline build + online requests)
Population Stability Index (PSI) or JS divergence between train and serve
Definition parity checks on sampled requests (recompute offline)
Version mismatch count (model_feature_spec_version != served_feature_version)

Set page-worthy alerts for:

high-impact feature missing rate
version mismatches
sudden freshness lag jump

8) Backfill and replay policy (often ignored)

Backfills are where many teams quietly corrupt evaluation.

Rules:

Backfill writes must be versioned, never silent overwrite
Keep old and new derivations side-by-side during validation
Recompute benchmark windows before promoting backfill output
Tag experiments with feature snapshot ids

If you cannot reproduce the exact features used by a model last month, you do not have governance.

9) Quant-specific notes (execution/risk models)

For trading/execution contexts:

distinguish exchange timestamp, gateway receive timestamp, strategy decision timestamp
features near market events (auction, open/close, halts) need stricter ingest-time constraints
avoid retrospective data vendor “clean-up” values in training unless those values were truly available live
maintain “live-available” and “revised” datasets separately

A model trained on revised bars/ticks can look brilliant and fail at the open.

10) Rollout checklist

Before shipping a new feature or model:

point-in-time unit tests include late-arrival scenarios
offline/online parity tests pass on shared fixtures
feature/version lineage visible in prediction logs
null/default behavior is explicitly documented
skew monitors and alerts are active
rollback plan includes feature version pinning

11) Anti-patterns to avoid

“Latest snapshot join is close enough.”
“We’ll add ingest_time later.”
“Backfill replaced history, but metrics improved, so it’s fine.”
“Online code path is different but mathematically equivalent.” (Usually false in edge cases.)

12) Bottom line

A feature store is not mainly a storage product. It is a time-consistency and reproducibility system.

If you protect point-in-time correctness and definition parity, model quality becomes a real signal. If you don’t, your AUC/Sharpe/precision can be mostly pipeline artifact.

Suggested references

Feast architecture/docs (offline-online feature consistency patterns)
Uber Michelangelo feature platform papers/talks
Tecton point-in-time correctness materials
“Leakage in Data Mining” literature (classic framing)
Industry postmortems on training-serving skew incidents