Your LLM’s Alpha Might Be Mere Memorization
A new benchmark, Look-Ahead-Bench, shows that LLM-based investment strategies can generate spectacular in-sample alpha that collapses once the test window moves past the model's training cutoff. For example, DeepSeek 3.2 went from +20.73% annualized alpha to -1.04%. The cause is look-ahead bias: the models have already "seen" the future in their training corpus. The fix requires point-in-time discipline for the entire AI pipeline — not just the data, but the models themselves. Without a verifiable, independently auditable history, performance claims from strategies built on commercial LLMs aren't credible.
Large language models are turning up everywhere in quantitative finance: generating trading signals, running sentiment analysis, even managing entire portfolios through agentic workflows. The results in backtests often look impressive. But a new benchmark paper raises an uncomfortable question: how much of this performance is real?
Look-Ahead-Bench (Benhenda, 2026) applied commercial LLMs to stock selection across two carefully matched six-month periods. Period 1, in-sample (Apr–Sep 2021) falls within the models’ training data. Period 2, out-of-sample (Jul–Dec 2024) falls after their training cutoff. Both periods had similar buy-and-hold returns (~25%), so any performance differences point to biases rather than changes in the market regime.
The results were striking.
Alpha Decay: Standard LLMs vs. Point-in-time LLMs
Annualized alpha (in percentage points) across two matched six-month periods. In-sample falls within each commercial LLM’s training window; out-of-sample falls after it.
Standard LLMs
Trained on public text through 2024 · alpha collapses out-of-sample
| Model | In-Sample Alpha | Out-of-Sample Alpha | Alpha Decay |
|---|---|---|---|
| Llama 3.1 8BMeta · open-source | +13.81 | −3.42 | −17.23 |
| Llama 3.1 70BMeta · open-source | +19.27 | +4.02 | −15.25 |
| DeepSeek 3.2 671BDeepSeek · open-source | +20.73 | −1.04 | −21.77 |
Point-in-Time LLMs
No access to future information by design · alpha is stable or improves
| Model | In-Sample Alpha | Out-of-Sample Alpha | Alpha Decay |
|---|---|---|---|
| Pitinf-SmallPiT-Inference | −0.25 | +0.06 | +0.31 |
| Pitinf-MediumPiT-Inference | +2.44 | +3.29 | +0.85 |
| Pitinf-LargePiT-Inference · frontier | +6.02 | +7.32 | +1.30 |
Source: Benhenda (2026), Look-Ahead-Bench. Alpha shown in percentage points (pp), annualized. Negative alpha decay indicates performance collapse out-of-sample; positive decay indicates stable or improving alpha.
How Much Alpha Disappears Out-of-Sample?
Strategy using DeepSeek 3.2 showed +20.73% annualized alpha in Period 1 (in-sample), then swung to -1.04% in Period 2 (out-of-sample): a decay of -21.77%. Strategy using Llama 3.1 8B dropped from +13.81% to -3.42%. The pattern held across all three standard LLMs tested.
More surprising was the “Scaling Paradox” notable in the results. The largest model, Deepseek 3.2 with 671B parameters and its greater memorization capacity, exhibited worse alpha decay than the smaller Llama models. Bigger models develop stronger priors from training data. When those priors meet new market conditions, they appear to become a liability rather than an asset.
It recalls the future rather than reasons about it.
By contrast, purpose-built Point-in-Time (PiT) models, which cannot access future information by design, delivered stable alpha across both periods. Strategy built with Pitinf-Large went from +6.02% to +7.32%, actually improving out-of-sample. The PiT models also showed a positive scaling law: larger models performed better because they scaled reasoning rather than memorization.
Why Look-Ahead Bias in LLMs Goes Beyond Training Data
The quant industry has long understood that financial data must be point-in-time. Backfilled fundamentals and restated earnings leaking into training sets can ruin a backtest. Survivorship bias and bitemporal data management are well-studied problems with known solutions.
Look-Ahead-Bench shows that the models themselves are a source of look-ahead bias. An LLM trained on text through 2024 has already “seen” every earnings surprise, every Fed decision, every market crash in that window. Prompt it with a date in January 2022, and it already knows what happened next. It recalls the future rather than reasons about it.
This distinction between recall and reasoning is the key insight: point-in-time discipline must extend beyond data to the entire workflow, including the models. A backtest is only as clean as its dirtiest input, and in LLM-based strategies, the model itself may be the dirtiest input of all.
How Verifiable History Solves the Point-in-Time Problem
Claims of point-in-time compliance are easy to make and generally impossible to verify after the fact. A data producer can assert that its signal was generated before an event, but without independent proof, there is no way to distinguish genuine predictive power from memorization, subtle data-hygiene, and causality errors.
Proving genuine predictive power in high-stakes domains such as quantitative investing is where auditably point-in-time data and models become essential. When historical data and models are independently verifiable by their consumers, they can be confident in the accuracy of resulting simulations and backtests.
Any backtest built on LLMs is suspect until proven otherwise. The alpha may look real in-sample because the model has memorized the answers. Bigger models can mean more bias, not less, because memorization scales with the number of parameters.
A backtest is only as clean as its dirtiest input, and in LLM-based strategies, the model itself may be the dirtiest input of all.
Conclusion
Look-Ahead-Bench provides the clearest evidence yet that commercial LLMs carry severe look-ahead bias in financial applications. The alpha they generate in backtests may be an artifact of memorization that vanishes on new data. For practitioners, the lesson is that point-in-time integrity must extend across data, models, and workflows for the resulting backtests and simulations to be credible. Likewise, datasets built using commercial LLMs are only credible when they are verifiably point-in-time. This is the problem that validityBase is built to solve: auditable point-in-time data and model infrastructure for quantitative investors.
Greg Kapoustin is the co-founder and CTO of validityBase, where he builds data reliability and evaluation infrastructure for systematic investing.