Your LLM’s alpha might be mere memorization

A new benchmark shows commercial LLMs lose most of their backtest alpha out-of-sample. What looks like edge is mostly memorization.

Bar chart showing annualized alpha decay for commercial LLMs versus point-in-time models in the Look-Ahead-Bench benchmark

Large language models are turning up everywhere in quantitative finance: generating trading signals, running sentiment analysis, even managing entire portfolios through agentic workflows. The results in backtests often look impressive. But a new benchmark paper raises an uncomfortable question: how much of this performance is real?

Look-Ahead-Bench (Benhenda, 2026) applied commercial LLMs to stock selection across two carefully matched six-month periods. Period 1, in-sample (Apr–Sep 2021) falls within the models’ training data. Period 2, out-of-sample (Jul–Dec 2024) falls after their training cutoff. Both periods had similar buy-and-hold returns (~25%), so any performance differences point to biases rather than changes in the market regime.

The results were striking.

Alpha decay: standard LLMs vs. point-in-time LLMs

Annualized alpha (in percentage points) across two matched six-month periods. In-sample falls within each commercial LLM’s training window; out-of-sample falls after it.

In-sample: Apr–Sep 2021Out-of-sample: Jul–Dec 2024

Standard LLMs

Trained on public text through 2024 · alpha collapses out-of-sample

Model In-Sample Alpha Out-of-Sample Alpha Alpha Decay
Llama 3.1 8BMeta · open-source +13.81 −3.42 −17.23
Llama 3.1 70BMeta · open-source +19.27 +4.02 −15.25
DeepSeek 3.2 671BDeepSeek · open-source +20.73 −1.04 −21.77
All three standard LLMs show large negative alpha decay: impressive in-sample alpha vanishes (or flips negative) once the test window moves past their training cutoff.
vs.

Point-in-time LLMs

No access to future information by design · alpha is stable or improves

Model In-Sample Alpha Out-of-Sample Alpha Alpha Decay
Pitinf-SmallPiT-Inference −0.25 +0.06 +0.31
Pitinf-MediumPiT-Inference +2.44 +3.29 +0.85
Pitinf-LargePiT-Inference · frontier +6.02 +7.32 +1.30
PiT models show positive alpha decay across the board: out-of-sample alpha matches or exceeds in-sample. Scale also helps — the bigger the PiT model, the stronger the out-of-sample alpha.

Source: Benhenda (2026), Look-Ahead-Bench. Alpha shown in percentage points (pp), annualized. Negative alpha decay indicates performance collapse out-of-sample; positive decay indicates stable or improving alpha.

How much alpha disappears out-of-sample?

Strategy using DeepSeek 3.2 showed +20.73% annualized alpha in Period 1 (in-sample), then swung to -1.04% in Period 2 (out-of-sample): a decay of -21.77%. Strategy using Llama 3.1 8B dropped from +13.81% to -3.42%. The pattern held across all three standard LLMs tested.

More surprising was the “Scaling Paradox” notable in the results. The largest model, Deepseek 3.2 with 671B parameters and its greater memorization capacity, exhibited worse alpha decay than the smaller Llama models. Bigger models develop stronger priors from training data. When those priors meet new market conditions, they appear to become a liability rather than an asset.

It recalls the future rather than reasons about it.

By contrast, purpose-built Point-in-Time (PiT) models, which cannot access future information by design, delivered stable alpha across both periods. Strategy built with Pitinf-Large went from +6.02% to +7.32%, actually improving out-of-sample. The PiT models also showed a positive scaling law: larger models performed better because they scaled reasoning rather than memorization.

Why look-ahead bias in LLMs goes beyond training data

The quant industry has long understood that financial data must be point-in-time. Backfilled fundamentals and restated earnings leaking into training sets can ruin a backtest. Survivorship bias and bitemporal data management are well-studied problems with known solutions.

Look-Ahead-Bench shows that the models themselves are a source of look-ahead bias. An LLM trained on text through 2024 has already “seen” every earnings surprise, every Fed decision, every market crash in that window. Prompt it with a date in January 2022, and it already knows what happened next. It recalls the future rather than reasons about it.

This distinction between recall and reasoning is the key insight: point-in-time discipline must extend beyond data to the entire workflow, including the models. A backtest is only as clean as its dirtiest input, and in LLM-based strategies, the model itself may be the dirtiest input of all.

How verifiable history solves the point-in-time problem

Claims of point-in-time compliance are easy to make and generally impossible to verify after the fact. A data producer can assert that its signal was generated before an event, but without independent proof, there is no way to distinguish genuine predictive power from memorization, subtle data-hygiene, and causality errors.

Proving genuine predictive power in high-stakes domains such as quantitative investing is where auditably point-in-time data and models become essential. When historical data and models are independently verifiable by their consumers, they can be confident in the accuracy of resulting simulations and backtests.

Any backtest built on LLMs is suspect until proven otherwise. The alpha may look real in-sample because the model has memorized the answers. Bigger models can mean more bias, not less, because memorization scales with the number of parameters.

A backtest is only as clean as its dirtiest input, and in LLM-based strategies, the model itself may be the dirtiest input of all.

Conclusion

Look-Ahead-Bench provides the clearest evidence yet that commercial LLMs carry severe look-ahead bias in financial applications. The alpha they generate in backtests may be an artifact of memorization that vanishes on new data. For practitioners, the lesson is that point-in-time integrity must extend across data, models, and workflows for the resulting backtests and simulations to be credible. Likewise, datasets built using commercial LLMs are only credible when they are verifiably point-in-time. This is the problem that validityBase is built to solve: auditable point-in-time data and model infrastructure for quantitative investors.

Greg Kapoustin
About the author
Greg Kapoustin linkedin
CTO and co-founder of validityBase

Greg has spent over 20 years building data and analytics infrastructure for predictive modeling and quantitative investment research. He previously co-founded ABW, whose analytics solutions supported more than $30 billion in AUM. At validityBase, he builds the systems that make validation and historical reproducibility scalable.