Why Most Backtests Are Lying to You (And How to Fix It)
Survivorship bias, look-ahead bias, overfitting, unrealistic fills. Your backtest probably has at least two of these problems. Here is how I know because mine did too.
Let me tell you something that took me two years to fully understand. Most backtests are completely worthless. And I mean MOST. Not just amateur ones. Even published academic papers have backtesting issues that make their results unreliable.
I know because my early backtests were just as bad.
The Four Horsemen of Bad Backtests
1. Survivorship Bias
This one is sneaky. If you are backtesting on today's S&P 500 constituents going back 10 years, you are only testing on companies that survived. All the companies that went bankrupt or got delisted are not in your dataset. This makes everything look better than it actually was.
I dealt with this by using point-in-time data for my equity cluster. For forex and metals it is less of an issue since currency pairs dont get delisted (usually), but for equity indices you absolutely need to account for this.
2. Look-Ahead Bias
Using information that would not have been available at the time of the trade. Classic example: using adjusted prices that include future corporate actions. Or calculating a 14-day RSI at bar 10 of your dataset (you do not have 14 bars yet).
My feature engineering pipeline has strict checks for this. Every feature is calculated using only data available at the time of signal generation. Sounds obvious but you would be surprised how many people mess this up.
3. Overfitting
I already talked about this in another post but it deserves repeating. If your system has more parameters than you have trades per parameter, you are probably overfitting. Rule of thumb: you want at least 100 trades per free parameter.
V7 has about 15 meaningful parameters across the L1/L2/L3 pipeline and 4,505 trades in the backtest. That is 300 trades per parameter. Comfortable.
PBO (Probability of Backtest Overfitting) is the gold standard here. My PBO is 0.112, well below the 0.50 threshold that indicates overfitting.
4. Unrealistic Fills
Assuming you get filled at the exact price you see. In reality there is spread, slippage, and sometimes your order just does not get filled at all because there is no liquidity.
I model average spread per instrument plus a slippage estimate based on the bar's ATR. It is conservative but I would rather underestimate my results than overestimate them.
What Good Validation Looks Like
After getting burned by bad backtests, I built a three-layer validation framework:
Walk-Forward Analysis (S31): Split data into chunks. Train on chunks 1-3, test on chunk 4. Then train on chunks 2-4, test on chunk 5. And so on. If your system only works on in-sample data, walk-forward will expose it immediately.
PBO Analysis (S21): Combinatorial symmetric cross-validation to estimate the probability that your backtest performance is due to overfitting vs genuine alpha. You want PBO < 0.50. I got 0.112.
Monte Carlo Simulation: Take your actual trade results and randomize the order. Run 5,000+ simulations. What is the worst-case drawdown? What is the probability of hitting a 10% drawdown? My 95th percentile worst case is 6.79% and breach probability is 0.08%.
The Honest Numbers
After implementing all three validation layers, my backtest results dropped. Obviously. When you remove all the biases, real performance is always lower than the inflated version.
But here is the thing. The numbers that survived validation are REAL. 59.2% win rate, +533.9R over 7.5 years, 1.49% max drawdown. These are not hypothetical. They passed walk-forward, PBO, and Monte Carlo.
I would rather have honest 59% than inflated 75% any day.
Practical Advice
If you are building a trading system:
- Start with validation, not with strategy design. Build the testing framework first.
- Use point-in-time data for equities. No survivorship bias.
- Strict feature lag enforcement. No data from the future.
- Model ALL costs. Spread, slippage, commission, financing.
- PBO < 0.50 or go home. If your system cannot pass this, it is overfit.
- Monte Carlo everything. If your system cannot handle randomized trade ordering, it is fragile.
- Walk-forward test always. In-sample performance means nothing.
Building a rigorous validation framework takes months. But it saves you from deploying a system that blows up in production. Ask me how I know.
The Uncomfortable Part
The hardest part of validation is not the math. It is the willingness to let your strategy fail. Most people skip rigorous testing because they do not actually want to know if their system is overfit. They have already decided it works and the backtest is just there to confirm their belief. But genuinely wanting to find the flaws is how you avoid finding them in production, where they cost real money. I have killed more strategies than I have deployed. That is not failure. That is the process working.