← Back to feed
2026-05-25agentsreasoningcode

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, Siddharth Mishra-Sharma, Andrew Gordon Wilson, Pavel Izmailov, Carolina Cuesta-Lázaro

PDF preview unavailable
Read on arXiv →

Key claim

LLMs struggle with complex physics reasoning tasks.

DiscoverPhysics is a new benchmark that challenges LLMs to discover physics laws in simulated worlds with unique rules. The study reveals that even the best models struggle with complex tasks requiring hypothesis refinement and experimental design. This highlights the gap between predictive accuracy and conceptual understanding in LLMs.

In plain English

DiscoverPhysics is a new benchmark that challenges LLMs to discover physics laws in simulated worlds with unique rules. The study reveals that even the best models struggle with complex tasks requiring hypothesis refinement and experimental design. This highlights the gap between predictive accuracy and conceptual understanding in LLMs.

Novelty
8.0/10

The introduction of DiscoverPhysics as an interactive benchmark for LLMs represents a significant advancement in evaluating reasoning capabilities in physics.

Reliability
7.5/10

The evaluation methodology includes multiple axes and comparisons across frontier models, providing solid evidence for the claims made.

Deep reliability assessment

The methodology supports the evaluation of LLMs in discovering physical laws through interactive experimentation, but it may overclaim the generalizability of results to broader scientific discovery tasks due to the specific nature of the simulated worlds used.

Reproducibility

yes, the simulator, public world definitions, and evaluation framework are released for community use.

Discussion questions

  1. How might the findings change if the worlds were procedurally generated rather than curated?
  2. What are the implications of the performance gap between proprietary and open-source models for future AI development?
  3. What experimental designs would lead to a failure of the LLMs in discovering the underlying laws?

Key figure

Figure 1 illustrates the DISCOVERPHYSICS benchmarking pipeline, detailing the interaction between the LLM agent and the N-body simulator for discovering physical laws.

Benchmark results

DISCOVERPHYSICSpass@5: 50vs gpt-5.5+10.0%SOTA
Codelink
github.com/DiscoverPhysicsOfficial