LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu, Bing Qin, XingYu
Read on arXiv →Key claim
LLM agents rely more on intrinsic knowledge than external evidence.
This paper investigates whether LLM-based search agents genuinely search the web or rely on their intrinsic knowledge. The key finding is that agents often depend on pre-existing knowledge, performing poorly when external evidence is removed, which highlights the limitations of static search benchmarks.
In plain English
This paper investigates whether LLM-based search agents genuinely search the web or rely on their intrinsic knowledge. The key finding is that agents often depend on pre-existing knowledge, performing poorly when external evidence is removed, which highlights the limitations of static search benchmarks.
The introduction of LiveBrowseComp as a new benchmark significantly extends the evaluation of LLM-based search agents.
The study uses multiple diagnostics and a well-defined benchmark, providing solid evidence for its claims.
Deep reliability assessment
The methodology supports the claim that, on BrowseComp-style static benchmarks and the evaluated agents, a substantial fraction of apparent search performance can come from parametric knowledge plus web verification rather than evidence-driven discovery. It is more overclaimed if read as a universal statement about all search agents, because LiveBrowseComp is small, uses a single search backend, and its 90-day cutoff is only an approximate proxy for being outside model knowledge.
Reproducibility
Dataset yes: LiveBrowseComp is released at https://huggingface.co/datasets/Forival/LiveBrowseComp, with archived benchmark snapshots mentioned. Code repository is not mentioned; prompts, tool setup, closed-book configuration, and retrieval/evidence-blocking setup are described in the paper, but full reproduction also depends on serper.dev and model APIs.
Discussion questions
- 1.Does low closed-book accuracy on recent long-tail facts really isolate search ability, or does it partly measure search-index coverage, source freshness, and tool-use prompt robustness?
- 2.For builders of AI search products, should systems be optimized to avoid hypothesis-led verification, or is memory-backed hypothesis generation actually useful if paired with stronger evidence checking?
- 3.What result would falsify the Intrinsic Knowledge Dependence explanation: for example, would an agent that maintains high LiveBrowseComp performance while having near-zero closed-book accuracy and demonstrably evidence-derived queries be enough?
Key figure
Figure 1 illustrates that static benchmark facts are gradually absorbed into model parameters over time, causing apparent benchmark difficulty to collapse, while LiveBrowseComp refreshes questions with recent facts to reduce this erosion.