DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, Siddharth Mishra-Sharma, Andrew Gordon Wilson, Pavel Izmailov, Carolina Cuesta-Lázaro
Read on arXiv →Key claim
LLMs struggle with complex physics reasoning tasks.
DiscoverPhysics is a new benchmark that challenges LLMs to discover physics laws in simulated worlds with unique rules. The study reveals that even the best models struggle with complex tasks requiring hypothesis refinement and experimental design. This highlights the gap between predictive accuracy and conceptual understanding in LLMs.
In plain English
DiscoverPhysics is a new benchmark that challenges LLMs to discover physics laws in simulated worlds with unique rules. The study reveals that even the best models struggle with complex tasks requiring hypothesis refinement and experimental design. This highlights the gap between predictive accuracy and conceptual understanding in LLMs.
The introduction of DiscoverPhysics as an interactive benchmark for LLMs represents a significant advancement in evaluating reasoning capabilities in physics.
The evaluation methodology includes multiple axes and comparisons across frontier models, providing solid evidence for the claims made.
Deep reliability assessment
The methodology supports the evaluation of LLMs in discovering physical laws through interactive experimentation, but it may overclaim the generalizability of results to broader scientific discovery tasks due to the specific nature of the simulated worlds used.
Reproducibility
yes, the simulator, public world definitions, and evaluation framework are released for community use.
Discussion questions
- How might the findings change if the worlds were procedurally generated rather than curated?
- What are the implications of the performance gap between proprietary and open-source models for future AI development?
- What experimental designs would lead to a failure of the LLMs in discovering the underlying laws?
Key figure
Figure 1 illustrates the DISCOVERPHYSICS benchmarking pipeline, detailing the interaction between the LLM agent and the N-body simulator for discovering physical laws.