Purged + Embargoed Cross-Validation for Trading Models
Date: 2026-04-11
Category: finance
Purpose: A practical playbook for choosing between walk-forward, purged CV, embargoed CV, and combinatorial purged cross-validation (CPCV) when validating trading or execution models.
Why this matters
A lot of trading research dies the same embarrassing death: the model did not find alpha, it found temporal leakage.
In finance, samples are rarely IID:
- labels often depend on a future path,
- features often use rolling windows,
- fills and returns have serial dependence,
- regime changes make a single historical path a shaky judge.
That means ordinary k-fold CV is usually too optimistic, and even naive walk-forward can still be too flattering if you do not handle overlapping labels or feature lookbacks correctly.
This note is the practical answer to four questions:
- When is plain walk-forward enough?
- When do you need purging?
- When do you add embargo?
- When is CPCV worth the compute bill?
1) The core leakage problem
In trading datasets, a sample is often not a single timestamped row with independent meaning.
A label may span:
trade_time— when the signal/order decision startsevent_end_time— when the label is finally determined- fixed horizon end
- stop-loss / take-profit hit
- barrier touch
- exit condition reached
If a train sample's event window overlaps a test sample's information window, the model can indirectly learn from the future.
Typical failure modes
- Overlapping labels
- Example: a 30-minute forward return label overlaps the next several samples.
- Feature lookback bleed
- Example: realized vol, imbalance, or EWMA features at the start of a post-test train fold still depend on prices from the test fold.
- Serial correlation bleed
- Immediately adjacent post-test samples are not independent, even if labels do not literally overlap.
- Single-path false confidence
- A single walk-forward history can make a model look robust when it only matched one lucky path.
2) The four validation modes
2.1 Plain chronological walk-forward
Use sequential train→test blocks:
- train on past
- test on the next block
- roll forward
Good for
- operational realism
- refit cadence design
- production-like simulation
- final “can I deploy this tomorrow?” checks
Weakness
Walk-forward preserves time order, but it does not automatically remove overlap leakage. If labels or features straddle fold boundaries, you can still cheat without noticing.
Default verdict
Use walk-forward as the baseline, not the whole validation story.
2.2 Purged cross-validation
Purging removes from the training set any observation whose label horizon overlaps the test set.
Intuition
If a train label still depends on future prices that occur inside the test period, that train row is contaminated and must go.
Best for
- path-dependent labels
- triple-barrier style labeling
- forward-return classification/regression
- execution outcomes measured over a future window
Rule of thumb
If your label depends on anything after trade_time, ask:
“Could this train label still be partially determined by price action inside the test fold?”
If yes, purge it.
2.3 Embargoed cross-validation
Embargo adds a buffer after the test set by removing the earliest post-test training observations.
Why purging alone is not enough
Even without direct label overlap, the first rows after a test fold can still inherit test-fold information via:
- long lookback indicators
- exponentially weighted features
- volatility estimators
- lagged microstructure state
- serially correlated flow / queue state
Best for
- rolling features with nontrivial lookback
- autocorrelated returns/order flow
- high-frequency microstructure data
- event-driven labels with lingering post-event dependence
Default verdict
Embargo is the cheap extra safety layer that saves you from “technically non-overlapping, practically still contaminated” samples.
2.4 Combinatorial Purged Cross-Validation (CPCV)
CPCV splits the timeline into N ordered groups and chooses k groups at a time as test folds.
For each combination, training uses the remaining groups, with purging + embargo applied.
Number of split combinations:
C(N, k)
Approximate number of reconstructed test paths:
C(N, k) * k / N- Example:
N=6,k=2→15combinations and5test paths
Why people use CPCV
Walk-forward gives you one historical path. CPCV gives you a distribution of out-of-sample outcomes across many recombined paths.
That makes it better for questions like:
- “Is this Sharpe robust or just one lucky ordering?”
- “How wide is my model-performance distribution?”
- “Does this model win only on one specific path?”
Best for
- model-family selection
- robustness ranking
- backtest-overfitting control
- strategy comparison before promotion
Weakness
- compute cost grows fast
- more moving parts to explain
- easy to misuse as a fancy badge without fixing the underlying data model
Default verdict
Use CPCV when promotion confidence matters more than raw simplicity.
3) When to use what
Use plain walk-forward when
- labels do not overlap much
- features have short memory
- you mainly care about production realism
- you are testing refit cadence or live handoff rules
Add purging when
- labels span future horizons
- barrier/event-based labeling is used
- execution-quality labels depend on post-decision path
- neighboring rows share the same future realization
Add embargo when
- feature lookbacks cross fold boundaries
- rolling/EWMA statistics are important
- adjacent post-test samples are still “warm” with test information
- your data is microstructure-heavy and serially dependent
Use CPCV when
- you are comparing model families, not just one candidate
- you need a distribution of OOS outcomes, not one number
- you want a stronger defense against path luck / backtest overfitting
- compute budget is available
4) Sizing purge and embargo windows
This is where teams often bluff. Do not pick these values because they “feel conservative.” Tie them to the actual data-generating process.
4.1 Purge size
Purge size should cover the maximum label overlap risk.
Practical rule
Purge at least the maximum future horizon used by the label.
Examples:
- 5-minute forward return label → purge at least 5 minutes around overlap risk
- triple-barrier label with max holding period 30 minutes → purge at least 30 minutes worth of overlapping candidates
- parent-order execution label ending at fill/timeout → purge by the max event horizon used for label finalization
If event end times are irregular, the correct implementation is not “drop a fixed count everywhere” but:
- compute
trade_timeandevent_end_timeper row - remove any training row whose
[trade_time, event_end_time]intersects test-label intervals
4.2 Embargo size
Embargo size should cover the largest post-test dependency tail that is not eliminated by purging.
Practical rule
Embargo by the larger of:
- effective feature lookback tail
- serial dependence decay window you care about
Examples:
- 63-day realized vol feature → embargo roughly 63 trading days if used naively across folds
- EWMA volatility with long half-life → embargo until the residual influence of test-period data is negligible
- short-horizon order-flow models → embargo by several bars / seconds / micro-batches if adjacent dependence is strong
Important nuance
Embargo is not just about literal lookback length. For exponentially weighted or stateful features, the relevant quantity is influence decay, not nominal window size.
5) A simple decision framework
Use this in order:
Step 1 — Define clocks
For every sample, store:
feature_timetrade_timeevent_end_timedecision_timeif different from feature time
If you cannot define these cleanly, your CV discussion is premature.
Step 2 — Ask label-overlap question
If a train row's label horizon can intersect test time, use purging.
Step 3 — Ask feature-memory question
If post-test train rows can still depend on test data via lookbacks or serial state, add embargo.
Step 4 — Ask robustness question
If a single walk-forward path is too fragile for model selection, upgrade to CPCV.
6) Recommended policy for quant research stacks
For day-to-day idea triage
Use:
- chronological walk-forward
- plus purging if labels overlap
- plus embargo if lookbacks are long
Goal: fast rejection of bad ideas without obvious leakage.
For model-family ranking / promotion
Use:
- purged + embargoed CV as the minimum
- CPCV when selecting between serious contenders
Goal: estimate robustness distribution, not just mean score.
For production readiness
Always end with:
- a final walk-forward / rolling refit simulation
- realistic costs, capacity, and latency assumptions
- no parameter retuning after seeing final OOS results
Goal: operational realism beats statistical elegance here.
7) What to report instead of one heroic Sharpe
For each candidate, report:
- median OOS Sharpe / IR
- 10th percentile OOS Sharpe
- hit rate across folds or reconstructed paths
- turnover / cost sensitivity by fold
- feature-drift or calibration drift by fold
- worst-fold drawdown / tail loss
- number of folds/paths where performance is negative
For CPCV specifically, summarize the distribution, not just the best path.
If the median is mediocre and only one path looks amazing, the amazing path is the liar.
8) Anti-patterns to ban
- Using random k-fold on time series
- Using walk-forward but ignoring overlapping labels
- Using embargo as a cosmetic fixed percentage with no link to features
- Tuning purge/embargo after inspecting OOS results until metrics look pretty
- Reporting only the best CPCV path
- Treating CPCV as a replacement for realistic walk-forward deployment tests
CPCV is not a magic spell. It is just a better microscope.
9) Minimal implementation sketch
1. Split timeline into ordered folds/groups
2. Choose test fold(s)
3. Build raw train set from remaining folds
4. Purge train rows whose label intervals overlap test intervals
5. Embargo earliest post-test rows still exposed to test-period influence
6. Fit model on cleaned train set
7. Evaluate on untouched test fold(s)
8. Aggregate metrics across folds / paths
For CPCV:
- partition timeline into N ordered groups
- choose every combination of k test groups (or a sampled subset if compute is large)
- for each split, apply purge + embargo
- recombine test results into multiple pseudo-paths
- study the full OOS metric distribution
10) A practical starting template
For many trading ML problems, a sane first pass is:
- Walk-forward for final operational evaluation
- Purged CV whenever labels use future horizons
- Embargo based on longest meaningful feature-memory tail
- CPCV only for serious model-selection / promotion decisions
In other words:
- do not overcomplicate early research,
- do not under-defend promotion decisions.
11) Vellab-style practical suggestion
For a research platform like Vellab, a strong default policy would be:
- Research mode
- purged walk-forward with explicit
trade_time/event_end_time
- purged walk-forward with explicit
- Promotion mode
- CPCV on shortlisted models only
- Live-go mode
- final rolling refit simulation with realistic slippage/capacity controls
That keeps the expensive validation where it matters, instead of spending CPCV compute on every disposable idea.
Checklist
[ ] Defined trade_time and event_end_time for every label
[ ] Mapped feature lookback / memory tails explicitly
[ ] Added purging wherever labels overlap future windows
[ ] Added embargo wherever post-test dependence remains
[ ] Used walk-forward for final production realism
[ ] Used CPCV only where robustness distribution matters
[ ] Reported distributions, not just best-path performance
References
- López de Prado, M. (2018), Advances in Financial Machine Learning.
- López de Prado, M. (2020), Machine Learning for Asset Managers.
- Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014), The Probability of Backtest Overfitting.
- Joubert, J., Sestovic, D., Barziy, I., Distaso, W., & López de Prado, M. (2024), Enhanced Backtesting for Practitioners.
- skfolio documentation:
CombinatorialPurgedCVAPI notes onn_folds,n_test_folds,purged_size, andembargo_size.
TL;DR
Walk-forward answers “would this have worked in order?” Purging answers “did my train labels overlap my test future?” Embargo answers “did post-test train rows still inherit test information?” CPCV answers “is this robust across many plausible paths, or just one lucky historical sequence?”
If you validate trading models without asking all four questions, your Sharpe is probably better at storytelling than trading.