Biases in Backtesting | Trading Glass

Your backtest results might be lying to you — here’s how to spot it and fix it before going live.

Introduction

Backtesting is critical.

But bad backtesting?

It’s worse than no testing at all — because it gives you false confidence.

Most traders don’t lose because they’re lazy. They lose because they over-trusted a strategy that looked great in hindsight — but was built on invisible flaws.

Three forces conspire to inflate every backtest: (1) you only ever publish the strategies that worked on history (selection), (2) you tune until they work (overfit), (3) your cost model understates real execution friction. The result is a Sharpe distribution centered well above your live distribution. Understanding why is what separates testing from theatre.

Prereqs: comfort with in-sample vs out-of-sample, basic Monte Carlo, Sharpe ratio. Module path: this lesson covers the structural errors that make a backtest lie. The next lesson, Edge Degradation, covers what happens to honest edges over time. Outliers covers the third lie: a single fat tail masquerading as skill.

The 3 Most Common Backtest Biases

1. What is lookahead bias?

Using future information that wasn’t actually available at the time of trade.

Examples:

Entering based on the close of a candle — before it’s actually closed
Calculating moving average crossovers using the current, unclosed bar's price — the cross only becomes valid after the bar closes; using its in-progress value is lookahead
Using signals from a candle that hasn’t fully formed

Why it's dangerous: Your entries and exits appear “accurate,” but they’re unrealistically perfect — because you’re cheating time.

How to fix it:

Only act on closed candles (use bar replay or time-based logic)
Avoid functions that reference “future bars” in code
Simulate entries realistically (e.g. next bar open, bid/ask spreads)

2. What is overfitting in trading strategies?

Creating a strategy that performs well only on past data — but fails in real-time.

Symptoms:

Too many filters (volume spike + RSI + MA + pattern + moon phase)
Perfect equity curve in one market — but breaks in others
Strategy only works on one pair, one timeframe, one year

Why it's dangerous: You’re not discovering an edge — you’re memorizing noise. Bailey & López de Prado (2014) showed that with as few as 7 trials at the standard 5% level, the probability of a false positive exceeds 30%. Their Deflated Sharpe Ratio adjusts your reported Sharpe for the number of trials run.

How to fix it:

Test across multiple instruments & time periods
Keep your rules simple and robust
Apply Combinatorially Symmetric Cross-Validation (CSCV) and report Probability of Backtest Overfitting (PBO). At PBO > 0.5 your "best" strategy is more likely overfit than not
Walk-forward (anchored): fit on [t0, t0+12m], test on [t0+12m, t0+15m], roll the test window forward 3 months, re-fit, repeat. Report only the concatenated test equity. Never reuse a test slice in fitting

Equity Curve Simulator

Win Rate: 55%Payoff: 1.5:1

Final: $34281 (+242.8%)

3. What is survivorship bias?

Only testing systems or assets that still exist — ignoring those that failed or changed drastically.

Examples:

Backtesting an index-style universe using today's constituents (e.g. current S&P 500 members) and projecting their history backward — winners are over-represented because losers were delisted
Treating cost mismodeling (slippage, spread, fees, black-swan blowups) as part of survivorship — they're a distinct bias and need their own fix

Why it’s dangerous: You’re assuming the conditions that created your edge will always exist.

How to fix it:

Use complete historical datasets, not just what exists now
Include “dead” assets in portfolio-level testing
Simulate volatility regimes, liquidity drops, and spreads increasing

4. What is data-snooping (multiple-testing) bias?

If you test 100 strategies at the 5% significance level, ~5 will look "good" by pure chance.

This is the bias that makes most public backtests garbage. Every parameter you sweep, every variation you tweak, every chart you eyeball is another silent trial — and the more trials you run, the higher the probability that something looks like edge purely from noise.

How to fix it:

Track the number of trials honestly (every parameter combination, every variant counts)
Apply the Deflated Sharpe Ratio (Bailey & López de Prado, 2014) — it adjusts your reported Sharpe for the number of trials
Use CSCV (Combinatorially Symmetric Cross-Validation) to estimate the Probability of Backtest Overfitting (PBO). PBO > 0.5 → your "best" strategy is more likely overfit than not
Tools: López de Prado's mlfinlab library, or hand-roll CSCV in ~50 lines of Python

Other Biases to Watch For

Bias Type	Description	Fix
Selection bias	Only testing your “favorite” trades	Include every trade in your data sample
Cherry picking	Manually excluding ugly outcomes	Log every result, good or bad
Optimism bias	Assuming you’ll always get filled at ideal prices	Simulate slippage and order book depth realistically
Anchoring bias	Refusing to retest or abandon old systems	Let data guide decisions, not nostalgia

Best Practices for Honest Backtesting

1. Use realistic assumptions

Apply a cost floor before judging edge: crypto perp ≈ 5 bps fee + 2–5 bps slippage round-trip; equities ≈ half-spread + 0.1·σ·√(size/ADV). If your edge dies under realistic costs, it was never edge
Account for execution delay (e.g. not entering at candle close)
Simulate partial fills for large size

Cost-floor models by asset class. Apply these before judging edge.

Asset class	Fee	Slippage	Total round-trip floor
Crypto perp	~5 bps	2 to 5 bps	7 to 10 bps
Equities	~half-spread	0.1 * sigma * sqrt(size/ADV)	Spread plus impact term

2. Separate in-sample and out-of-sample periods

Train your strategy on one period
Validate it on a completely different one → If performance holds across both: more robust

3. Keep your strategy as simple as possible

“A system is only as good as its worst assumption.”

Fewer moving parts = less overfitting risk. (See Outliers and Their Impact on Metrics for how a single bar can create the illusion of a fitted edge, and Sharpe Ratio & Sortino Ratio for the metric most degraded by these biases.) The simpler it is, the easier it is to test, improve, and trust

FAQ

What is lookahead bias in backtesting?

Lookahead bias is using future information that wasn't actually available at the time of the simulated trade — for example, acting on a candle's close price before that candle has fully closed. It produces unrealistically perfect entries that vanish in live trading.

What is overfitting in trading strategies?

Overfitting is creating a strategy that performs well only on past data because it has memorized noise rather than discovered structure. Bailey & López de Prado (2014) showed that with as few as 7 trials at the standard 5% level, the probability of a false positive exceeds 30%.

How much should I discount my backtest Sharpe ratio?

Working heuristic: discount your backtest Sharpe by 30–50% before believing it. Even with rigorous IS/OOS, regime change and execution friction take roughly that bite. If your strategy is unprofitable after the haircut, it has no edge — only fitting.

Final Thought

Most failed traders didn’t skip testing. They trusted flawed testing.

Your system’s performance is only as reliable as the integrity of your backtest — and the Working heuristic is to discount your backtest Sharpe by 30–50% before believing it. Even with rigorous IS/OOS, regime change and execution friction take roughly that bite. If your strategy is unprofitable after the haircut, it has no edge — only fitting.

Backtest Sharpe haircut

Even with rigorous IS/OOS, regime change and execution friction take roughly this bite out of reported Sharpe. Apply the haircut before you judge edge.

30 to 50%

Pre-trust checklist (5 items). Before you bet a dollar on a backtest, run this:

IS/OOS split with no peeking (re-running OOS after a poor result silently turns it into IS)
Costs floored at realistic exchange numbers (≈ 5–10 bps round-trip for crypto perps)
Tested across >1 instrument and >1 regime
Parameter count << degrees of freedom in the data
Sharpe deflated for trial count (DSR), and PBO < 0.5 from CSCV

Fail any one → assume your edge is artifact.

Further reading: Bailey, Borwein, López de Prado, Zhu (2014) Pseudo-Mathematics and Financial Charlatanism — the PBO/DSR paper. Bessembinder (2018) Do Stocks Outperform Treasury Bills? — the canonical survivorship-bias study. López de Prado (2018) Advances in Financial Machine Learning, ch. 11–14.

Your backtest results might be lying to you — here’s how to spot it and fix it before going live.

Introduction

Backtesting is critical.

But bad backtesting?

It’s worse than no testing at all — because it gives you false confidence.

Most traders don’t lose because they’re lazy. They lose because they over-trusted a strategy that looked great in hindsight — but was built on invisible flaws.

The 3 Most Common Backtest Biases

1. What is lookahead bias?

Using future information that wasn’t actually available at the time of trade.

Examples:

Entering based on the close of a candle — before it’s actually closed
Calculating moving average crossovers using the current, unclosed bar's price — the cross only becomes valid after the bar closes; using its in-progress value is lookahead
Using signals from a candle that hasn’t fully formed

Why it's dangerous: Your entries and exits appear “accurate,” but they’re unrealistically perfect — because you’re cheating time.

How to fix it:

Only act on closed candles (use bar replay or time-based logic)
Avoid functions that reference “future bars” in code
Simulate entries realistically (e.g. next bar open, bid/ask spreads)

2. What is overfitting in trading strategies?

Creating a strategy that performs well only on past data — but fails in real-time.

Symptoms:

Too many filters (volume spike + RSI + MA + pattern + moon phase)
Perfect equity curve in one market — but breaks in others
Strategy only works on one pair, one timeframe, one year

How to fix it:

Test across multiple instruments & time periods
Keep your rules simple and robust
Apply Combinatorially Symmetric Cross-Validation (CSCV) and report Probability of Backtest Overfitting (PBO). At PBO > 0.5 your "best" strategy is more likely overfit than not
Walk-forward (anchored): fit on [t0, t0+12m], test on [t0+12m, t0+15m], roll the test window forward 3 months, re-fit, repeat. Report only the concatenated test equity. Never reuse a test slice in fitting

Equity Curve Simulator

Win Rate: 55%Payoff: 1.5:1

Final: $34281 (+242.8%)

3. What is survivorship bias?

Only testing systems or assets that still exist — ignoring those that failed or changed drastically.

Examples:

Backtesting an index-style universe using today's constituents (e.g. current S&P 500 members) and projecting their history backward — winners are over-represented because losers were delisted
Treating cost mismodeling (slippage, spread, fees, black-swan blowups) as part of survivorship — they're a distinct bias and need their own fix

Why it’s dangerous: You’re assuming the conditions that created your edge will always exist.

How to fix it:

Use complete historical datasets, not just what exists now
Include “dead” assets in portfolio-level testing
Simulate volatility regimes, liquidity drops, and spreads increasing

4. What is data-snooping (multiple-testing) bias?

If you test 100 strategies at the 5% significance level, ~5 will look "good" by pure chance.

How to fix it:

Track the number of trials honestly (every parameter combination, every variant counts)
Apply the Deflated Sharpe Ratio (Bailey & López de Prado, 2014) — it adjusts your reported Sharpe for the number of trials
Use CSCV (Combinatorially Symmetric Cross-Validation) to estimate the Probability of Backtest Overfitting (PBO). PBO > 0.5 → your "best" strategy is more likely overfit than not
Tools: López de Prado's mlfinlab library, or hand-roll CSCV in ~50 lines of Python

Other Biases to Watch For

Bias Type	Description	Fix
Selection bias	Only testing your “favorite” trades	Include every trade in your data sample
Cherry picking	Manually excluding ugly outcomes	Log every result, good or bad
Optimism bias	Assuming you’ll always get filled at ideal prices	Simulate slippage and order book depth realistically
Anchoring bias	Refusing to retest or abandon old systems	Let data guide decisions, not nostalgia

Best Practices for Honest Backtesting

1. Use realistic assumptions

Apply a cost floor before judging edge: crypto perp ≈ 5 bps fee + 2–5 bps slippage round-trip; equities ≈ half-spread + 0.1·σ·√(size/ADV). If your edge dies under realistic costs, it was never edge
Account for execution delay (e.g. not entering at candle close)
Simulate partial fills for large size

Cost-floor models by asset class. Apply these before judging edge.

Asset class	Fee	Slippage	Total round-trip floor
Crypto perp	~5 bps	2 to 5 bps	7 to 10 bps
Equities	~half-spread	0.1 * sigma * sqrt(size/ADV)	Spread plus impact term

2. Separate in-sample and out-of-sample periods

Train your strategy on one period
Validate it on a completely different one → If performance holds across both: more robust

3. Keep your strategy as simple as possible

“A system is only as good as its worst assumption.”

FAQ

What is lookahead bias in backtesting?

What is overfitting in trading strategies?

How much should I discount my backtest Sharpe ratio?

Final Thought

Most failed traders didn’t skip testing. They trusted flawed testing.

Backtest Sharpe haircut

Even with rigorous IS/OOS, regime change and execution friction take roughly this bite out of reported Sharpe. Apply the haircut before you judge edge.

30 to 50%

Pre-trust checklist (5 items). Before you bet a dollar on a backtest, run this:

IS/OOS split with no peeking (re-running OOS after a poor result silently turns it into IS)
Costs floored at realistic exchange numbers (≈ 5–10 bps round-trip for crypto perps)
Tested across >1 instrument and >1 regime
Parameter count << degrees of freedom in the data
Sharpe deflated for trial count (DSR), and PBO < 0.5 from CSCV

Fail any one → assume your edge is artifact.