2026-05-21scalingdata

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

Nick Merrill, Jaeho Lee, Ezra Karger

Key claim

Larger models worsen upper tail forecasts in certain tasks.

The paper reveals that larger language models perform worse in forecasting tasks with superlinear growth and tail risks, particularly in the upper tail of distributions. This inverse scaling effect suggests that more capable models may misestimate extreme outcomes while maintaining lower tail accuracy. The authors recommend using continuous accuracy measures for better evaluation of LLM forecasting.

Novelty

8.0/10

The paper identifies a new phenomenon of inverse scaling in LLMs related to forecasting.

Reliability

7.0/10

The study uses both simulated and real-world datasets, providing a solid methodological foundation.

Deep reliability assessment

The methodology supports the claim that more capable language models can produce worse forecasts on time series with superlinear growth and regime change, but the generalization to all forecasting tasks may be overclaimed.

Reproducibility

Yes, the paper mentions that the generation pipeline, evaluation harness, and scoring code are included in supplementary material.

Discussion questions

How does the assumption of superlinear growth and regime change apply to other domains outside of finance and epidemiology?
What are the practical implications for deploying LLMs in real-world forecasting tasks where tail risks are significant?
What specific evidence or results would falsify the claim that more capable models perform worse on these types of forecasting tasks?

Key figure

Figure 1 shows that more capable models produce worse forecasts on a continuous proper scoring rule but not on a threshold-based rule.

Read on arXiv →