Biases in Backtesting
9 min read
Avoid survivorship bias, overfitting, and lookahead bias that make backtest results lie to you before you go live.
9 min read
Avoid survivorship bias, overfitting, and lookahead bias that make backtest results lie to you before you go live.
Your backtest results might be lying to you — here’s how to spot it and fix it before going live.
Backtesting is critical.
But bad backtesting?
It’s worse than no testing at all — because it gives you false confidence.
Most traders don’t lose because they’re lazy. They lose because they over-trusted a strategy that looked great in hindsight — but was built on invisible flaws.
Three forces conspire to inflate every backtest: (1) you only ever publish the strategies that worked on history (selection), (2) you tune until they work (overfit), (3) your cost model understates real execution friction. The result is a Sharpe distribution centered well above your live distribution. Understanding why is what separates testing from theatre.
Prereqs: comfort with in-sample vs out-of-sample, basic Monte Carlo, Sharpe ratio. Module path: this lesson covers the structural errors that make a backtest lie. The next lesson, Edge Degradation, covers what happens to honest edges over time. Outliers covers the third lie: a single fat tail masquerading as skill.
Using future information that wasn’t actually available at the time of trade.
Examples:
Why it's dangerous: Your entries and exits appear “accurate,” but they’re unrealistically perfect — because you’re cheating time.
How to fix it:
Creating a strategy that performs well only on past data — but fails in real-time.
Symptoms:
Why it's dangerous: You’re not discovering an edge — you’re memorizing noise. Bailey & López de Prado (2014) showed that with as few as 7 trials at the standard 5% level, the probability of a false positive exceeds 30%. Their Deflated Sharpe Ratio adjusts your reported Sharpe for the number of trials run.
How to fix it:
[t0, t0+12m], test on [t0+12m, t0+15m], roll the test window forward 3 months, re-fit, repeat. Report only the concatenated test equity. Never reuse a test slice in fittingOnly testing systems or assets that still exist — ignoring those that failed or changed drastically.
Examples:
Why it’s dangerous: You’re assuming the conditions that created your edge will always exist.
How to fix it:
If you test 100 strategies at the 5% significance level, ~5 will look "good" by pure chance.
This is the bias that makes most public backtests garbage. Every parameter you sweep, every variation you tweak, every chart you eyeball is another silent trial — and the more trials you run, the higher the probability that something looks like edge purely from noise.
How to fix it:
mlfinlab library, or hand-roll CSCV in ~50 lines of Python| Bias Type | Description | Fix |
|---|---|---|
| Selection bias | Only testing your “favorite” trades | Include every trade in your data sample |
| Cherry picking | Manually excluding ugly outcomes | Log every result, good or bad |
| Optimism bias | Assuming you’ll always get filled at ideal prices | Simulate slippage and order book depth realistically |
| Anchoring bias | Refusing to retest or abandon old systems | Let data guide decisions, not nostalgia |
Cost-floor models by asset class. Apply these before judging edge.
| Asset class | Fee | Slippage | Total round-trip floor |
|---|---|---|---|
| Crypto perp | ~5 bps | 2 to 5 bps | 7 to 10 bps |
| Equities | ~half-spread | 0.1 * sigma * sqrt(size/ADV) | Spread plus impact term |
“A system is only as good as its worst assumption.”
Fewer moving parts = less overfitting risk. (See Outliers and Their Impact on Metrics for how a single bar can create the illusion of a fitted edge, and Sharpe Ratio & Sortino Ratio for the metric most degraded by these biases.) The simpler it is, the easier it is to test, improve, and trust
Lookahead bias is using future information that wasn't actually available at the time of the simulated trade — for example, acting on a candle's close price before that candle has fully closed. It produces unrealistically perfect entries that vanish in live trading.
Overfitting is creating a strategy that performs well only on past data because it has memorized noise rather than discovered structure. Bailey & López de Prado (2014) showed that with as few as 7 trials at the standard 5% level, the probability of a false positive exceeds 30%.
Working heuristic: discount your backtest Sharpe by 30–50% before believing it. Even with rigorous IS/OOS, regime change and execution friction take roughly that bite. If your strategy is unprofitable after the haircut, it has no edge — only fitting.
Most failed traders didn’t skip testing. They trusted flawed testing.
Your system’s performance is only as reliable as the integrity of your backtest — and the Working heuristic is to discount your backtest Sharpe by 30–50% before believing it. Even with rigorous IS/OOS, regime change and execution friction take roughly that bite. If your strategy is unprofitable after the haircut, it has no edge — only fitting.
Even with rigorous IS/OOS, regime change and execution friction take roughly this bite out of reported Sharpe. Apply the haircut before you judge edge.
Pre-trust checklist (5 items). Before you bet a dollar on a backtest, run this:
<< degrees of freedom in the data< 0.5 from CSCVFail any one → assume your edge is artifact.
Further reading: Bailey, Borwein, López de Prado, Zhu (2014) Pseudo-Mathematics and Financial Charlatanism — the PBO/DSR paper. Bessembinder (2018) Do Stocks Outperform Treasury Bills? — the canonical survivorship-bias study. López de Prado (2018) Advances in Financial Machine Learning, ch. 11–14.