Backtesting Without Lookahead Bias: Combinatorial Purged Cross-Validation

Quantitative Finance

Machine Learning

Backtesting

Why standard cross-validation fails on financial time series and how Combinatorial Purged Cross-Validation (CPCV) provides reliable backtest estimates.

Author

Kwizera Jean

Published

March 1, 2026

Why Standard Cross-Validation Fails in Finance

Cross-validation is the gold standard for model evaluation in machine learning. Split your data into folds, train on some, test on others, and you get an unbiased estimate of out-of-sample performance. It works beautifully for i.i.d. data — images, text, tabular datasets where observations are independent.

Financial time series violate this assumption fundamentally. Stock prices, forex rates, and other market data exhibit:

Serial correlation — today’s price depends on yesterday’s
Regime changes — the statistical properties of returns shift over time
Label leakage — if your target variable is a forward return, adjacent observations share information

When you apply standard k-fold cross-validation to financial data, training folds contain information about test folds. The model “sees” the future through correlated observations near the fold boundaries. The result: backtest performance that looks better than what you’ll achieve in live trading.

The Purging and Embargo Solution

Marcos Lopez de Prado’s Combinatorial Purged Cross-Validation (CPCV) addresses these issues through two mechanisms:

Purging

Purging removes observations from the training set that overlap temporally with the test set’s label window. If your strategy predicts 5-day returns, then observations within 5 days of any test-set boundary are excluded from training.

Timeline:  |---Train---|xxxPURGEDxxx|---Test---|xxxPURGEDxxx|---Train---|

This eliminates the most direct form of information leakage: training on data whose label period overlaps with test observations.

Embargo

An embargo period extends the purge beyond the strict label window. Even after purging label overlap, serial correlation means that observations just outside the purge zone still carry information about the test period. The embargo adds a buffer (typically 1-2% of the dataset length) to ensure genuine independence.

Timeline:  |---Train---|xxPURGExx|--EMBARGO--|---Test---|--EMBARGO--|xxPURGExx|---Train---|

The Combinatorial Approach

Standard walk-forward testing uses the data once: train on the first 80%, test on the last 20%. This is wasteful — you get a single estimate of performance from one specific market regime.

CPCV generates all possible combinations of contiguous training and test groups, subject to purging and embargo constraints. For a dataset split into N groups with k test groups, CPCV produces \(\binom{N}{k}\) unique backtest paths.

This gives you:

Multiple independent performance estimates rather than a single point estimate
A distribution of backtest results that reveals strategy robustness
More efficient use of limited data — every observation appears in test sets

R Implementation

Here is a simplified implementation of the CPCV framework:

#' Generate CPCV train/test splits with purging and embargo
#'
#' @param n_obs Number of observations
#' @param n_groups Number of groups to split into
#' @param n_test Number of groups to use as test in each split
#' @param purge_length Number of observations to purge at boundaries
#' @param embargo_pct Embargo as a fraction of dataset length
#' @return List of train/test index pairs
generate_cpcv_splits <- function(n_obs, n_groups = 6, n_test = 2,
                                  purge_length = 5, embargo_pct = 0.01) {

  embargo_length <- ceiling(n_obs * embargo_pct)
  group_size <- floor(n_obs / n_groups)

  # Generate group boundaries
  groups <- lapply(seq_len(n_groups), function(g) {
    start <- (g - 1) * group_size + 1
    end <- min(g * group_size, n_obs)
    start:end
  })

  # Generate all combinations of test groups
  test_combos <- combn(n_groups, n_test, simplify = FALSE)

  splits <- lapply(test_combos, function(test_groups) {
    test_idx <- unlist(groups[test_groups])
    train_groups <- setdiff(seq_len(n_groups), test_groups)
    train_idx <- unlist(groups[train_groups])

    # Apply purging: remove training observations near test boundaries
    test_range <- range(test_idx)
    purge_zone <- c(
      (test_range[1] - purge_length):(test_range[1] - 1),
      (test_range[2] + 1):(test_range[2] + purge_length)
    )

    # Apply embargo
    embargo_zone <- c(
      (test_range[1] - purge_length - embargo_length):(test_range[1] - purge_length - 1),
      (test_range[2] + purge_length + 1):(test_range[2] + purge_length + embargo_length)
    )

    exclusion_zone <- unique(c(purge_zone, embargo_zone))
    exclusion_zone <- exclusion_zone[exclusion_zone > 0 & exclusion_zone <= n_obs]

    train_idx <- setdiff(train_idx, exclusion_zone)

    list(train = train_idx, test = test_idx)
  })

  splits
}

Applying CPCV to a Strategy

library(dplyr)
library(purrr)

# Generate splits
splits <- generate_cpcv_splits(
  n_obs = nrow(market_data),
  n_groups = 6,
  n_test = 2,
  purge_length = 10,
  embargo_pct = 0.02
)

# Evaluate strategy on each split
results <- map_dfr(splits, function(split) {
  train_data <- market_data[split$train, ]
  test_data <- market_data[split$test, ]

  # Fit strategy on training data
  model <- fit_strategy(train_data)

  # Evaluate on test data
  signals <- predict_signals(model, test_data)
  returns <- compute_strategy_returns(signals, test_data)

  tibble(
    sharpe = mean(returns) / sd(returns) * sqrt(252),
    max_drawdown = max_drawdown(cumsum(returns)),
    n_trades = sum(abs(diff(signals)) > 0)
  )
})

# Summary: distribution of out-of-sample performance
summary(results$sharpe)

CPCV vs Other Methods

Method	Lookahead Bias	Data Efficiency	Multiple Estimates
Walk-Forward	Low	Low	No
Standard k-Fold CV	High	High	Yes
Time-Series Split	Low	Low	Limited
CPCV	None	High	Yes

Walk-forward testing avoids lookahead bias but gives you a single estimate from one market regime. Standard CV is efficient but leaks information. CPCV achieves both: no lookahead bias and multiple independent estimates.

Practical Considerations

Choosing Parameters

n_groups: More groups = more combinations but smaller test sets. 5-8 groups is typical for multi-year datasets.
n_test: 2 test groups is the most common choice, providing a good balance between the number of combinations and test set size.
purge_length: Should match or exceed your strategy’s maximum lookahead window (e.g., if you predict 5-day returns, purge at least 5 observations).
embargo_pct: 1-2% is typical. Higher for strategies that are more sensitive to serial correlation.

Interpreting Results

The distribution of Sharpe Ratios across CPCV splits tells you more than any single backtest number:

Consistently positive across splits → Robust strategy with genuine edge
High variance across splits → Strategy is regime-dependent; proceed with caution
Negative in any splits → Strategy may not generalise; investigate which market conditions cause failure

Integration in the Kwiz Quants Pipeline

CPCV is the second validation gate in our pipeline, applied after the Deflated Sharpe Ratio screening. Strategies that pass DSR are subjected to CPCV to verify that their performance generalises across different time periods — not just the specific window that happened to produce the best-looking backtest.

Only strategies that show consistently positive risk-adjusted returns across all CPCV splits proceed to MT5 online backtesting. This ensures that when we deploy a strategy to live trading, we have evidence of robustness, not just a single favourable backtest.