What is look-ahead bias in LLMs?

Look-ahead bias in LLMs happens when a model has already 'seen' future events in its training corpus. An LLM trained on text through 2024 has implicitly observed every earnings surprise, Fed decision, and market move in that window, so prompting it with an earlier date lets it recall outcomes rather than reason about them. The result is backtest alpha that looks real in-sample but collapses once the test window moves past the training cutoff.

How much alpha do LLM-based strategies lose out-of-sample?

In the Look-Ahead-Bench benchmark, strategies built on three commercial LLMs showed severe alpha decay between matched in-sample and out-of-sample six-month periods. DeepSeek 3.2 fell from +20.73% to -1.04% annualized alpha (-21.77 percentage points). Llama 3.1 8B went from +13.81% to -3.42%, and Llama 3.1 70B dropped from +19.27% to +4.02%. All three collapsed once the test window moved past their training cutoff.

How can quantitative investors avoid look-ahead bias when using LLMs?

Point-in-time discipline must extend to the models themselves, not just the data. Purpose-built point-in-time (PiT) LLMs, which cannot access future information by design, deliver stable alpha across in-sample and out-of-sample windows. Beyond using PiT models, backtests and datasets built with commercial LLMs are only credible when their inputs are independently verifiable and auditable, so that claims of point-in-time compliance can be checked after the fact rather than taken on trust.

LLM Alpha & Look-Ahead Bias: Why Backtests Overstate Returns

Q: Why do larger LLMs show worse alpha decay?

Look-Ahead-Bench calls this the Scaling Paradox: the largest standard model tested (DeepSeek 3.2, 671B parameters) exhibited the worst alpha decay of any model in the benchmark. Larger models develop stronger priors from training data and have more capacity to memorize specific outcomes. When those priors meet new market conditions, the extra memorization capacity becomes a liability rather than an asset.

Large language models are turning up everywhere in quantitative finance: generating trading signals, running sentiment analysis, even managing entire portfolios through agentic workflows. The results in backtests often look impressive. But a new benchmark paper raises an uncomfortable question: how much of this performance is real?

Look-Ahead-Bench (Benhenda, 2026) applied commercial LLMs to stock selection across two carefully matched six-month periods. Period 1, in-sample (Apr–Sep 2021) falls within the models’ training data. Period 2, out-of-sample (Jul–Dec 2024) falls after their training cutoff. Both periods had similar buy-and-hold returns (~25%), so any performance differences point to biases rather than changes in the market regime.

The results were striking.

Alpha Decay: Standard LLMs vs. Point-in-time LLMs

Annualized alpha (in percentage points) across two matched six-month periods. In-sample falls within each commercial LLM’s training window; out-of-sample falls after it.

In-sample: Apr–Sep 2021Out-of-sample: Jul–Dec 2024

Standard LLMs

Trained on public text through 2024 · alpha collapses out-of-sample

Model	In-Sample Alpha	Out-of-Sample Alpha	Alpha Decay
Llama 3.1 8BMeta · open-source	+13.81	−3.42	−17.23
Llama 3.1 70BMeta · open-source	+19.27	+4.02	−15.25
DeepSeek 3.2 671BDeepSeek · open-source	+20.73	−1.04	−21.77

All three standard LLMs show large negative alpha decay: impressive in-sample alpha vanishes (or flips negative) once the test window moves past their training cutoff.

vs.

Point-in-Time LLMs

No access to future information by design · alpha is stable or improves

Model	In-Sample Alpha	Out-of-Sample Alpha	Alpha Decay
Pitinf-SmallPiT-Inference	−0.25	+0.06	+0.31
Pitinf-MediumPiT-Inference	+2.44	+3.29	+0.85
Pitinf-LargePiT-Inference · frontier	+6.02	+7.32	+1.30

PiT models show positive alpha decay across the board: out-of-sample alpha matches or exceeds in-sample. Scale also helps — the bigger the PiT model, the stronger the out-of-sample alpha.

Source: Benhenda (2026), Look-Ahead-Bench. Alpha shown in percentage points (pp), annualized. Negative alpha decay indicates performance collapse out-of-sample; positive decay indicates stable or improving alpha.

How Much Alpha Disappears Out-of-Sample?

Strategy using DeepSeek 3.2 showed +20.73% annualized alpha in Period 1 (in-sample), then swung to -1.04% in Period 2 (out-of-sample): a decay of -21.77%. Strategy using Llama 3.1 8B dropped from +13.81% to -3.42%. The pattern held across all three standard LLMs tested.

More surprising was the “Scaling Paradox” notable in the results. The largest model, Deepseek 3.2 with 671B parameters and its greater memorization capacity, exhibited worse alpha decay than the smaller Llama models. Bigger models develop stronger priors from training data. When those priors meet new market conditions, they appear to become a liability rather than an asset.

It recalls the future rather than reasons about it.

By contrast, purpose-built Point-in-Time (PiT) models, which cannot access future information by design, delivered stable alpha across both periods. Strategy built with Pitinf-Large went from +6.02% to +7.32%, actually improving out-of-sample. The PiT models also showed a positive scaling law: larger models performed better because they scaled reasoning rather than memorization.

Why Look-Ahead Bias in LLMs Goes Beyond Training Data

The quant industry has long understood that financial data must be point-in-time. Backfilled fundamentals and restated earnings leaking into training sets can ruin a backtest. Survivorship bias and bitemporal data management are well-studied problems with known solutions.

Look-Ahead-Bench shows that the models themselves are a source of look-ahead bias. An LLM trained on text through 2024 has already “seen” every earnings surprise, every Fed decision, every market crash in that window. Prompt it with a date in January 2022, and it already knows what happened next. It recalls the future rather than reasons about it.

This distinction between recall and reasoning is the key insight: point-in-time discipline must extend beyond data to the entire workflow, including the models. A backtest is only as clean as its dirtiest input, and in LLM-based strategies, the model itself may be the dirtiest input of all.

How Verifiable History Solves the Point-in-Time Problem

Claims of point-in-time compliance are easy to make and generally impossible to verify after the fact. A data producer can assert that its signal was generated before an event, but without independent proof, there is no way to distinguish genuine predictive power from memorization, subtle data-hygiene, and causality errors.

Proving genuine predictive power in high-stakes domains such as quantitative investing is where auditably point-in-time data and models become essential. When historical data and models are independently verifiable by their consumers, they can be confident in the accuracy of resulting simulations and backtests.

Any backtest built on LLMs is suspect until proven otherwise. The alpha may look real in-sample because the model has memorized the answers. Bigger models can mean more bias, not less, because memorization scales with the number of parameters.

A backtest is only as clean as its dirtiest input, and in LLM-based strategies, the model itself may be the dirtiest input of all.

Conclusion

Look-Ahead-Bench provides the clearest evidence yet that commercial LLMs carry severe look-ahead bias in financial applications. The alpha they generate in backtests may be an artifact of memorization that vanishes on new data. For practitioners, the lesson is that point-in-time integrity must extend across data, models, and workflows for the resulting backtests and simulations to be credible. Likewise, datasets built using commercial LLMs are only credible when they are verifiably point-in-time. This is the problem that validityBase is built to solve: auditable point-in-time data and model infrastructure for quantitative investors.

Greg Kapoustin is the co-founder and CTO of validityBase, where he builds data reliability and evaluation infrastructure for systematic investing.

Your LLM’s Alpha Might Be Mere Memorization

Alpha Decay: Standard LLMs vs. Point-in-time LLMs

Standard LLMs

Point-in-Time LLMs

How Much Alpha Disappears Out-of-Sample?

Why Look-Ahead Bias in LLMs Goes Beyond Training Data

How Verifiable History Solves the Point-in-Time Problem

Conclusion

vBase Blog

Recent Posts

Why alternative data trials fail

Why quants pay more for point-in-time data

Whose data is worth more?