The promise of AI-generated trading strategies has attracted enormous attention. But what happens when these strategies face rigorous backtesting? The results are more nuanced than either enthusiasts or skeptics suggest.
The Experiment
We generated 500 trading strategies using various LLM providers (Claude, GPT-4, Gemini) and backtested each against 10 years of historical data across multiple asset classes. Strategies ranged from simple moving average crossovers to complex multi-factor models.
Key Findings
About 15% of LLM-generated strategies showed statistically significant alpha after transaction costs — comparable to the hit rate of human-generated strategy ideas. However, LLMs excelled at rapid iteration: generating and testing 500 strategies took 2 hours vs weeks for manual development.
Where LLMs Add Value
The real benefit isn't replacing human traders but accelerating the ideation phase. LLMs are particularly good at combining known factors in novel ways, adapting strategies across asset classes, and generating parameter sweep ranges that cover non-obvious configurations.
Pitfalls
LLMs tend to overfit to well-known patterns from their training data. Strategies based on textbook examples (MACD crossover, RSI divergence) showed the worst out-of-sample performance. The most successful strategies came from prompts that specified unusual constraints or novel market microstructure assumptions.

