← Back to feed
2026-05-27agentsdatacode

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu, Bing Qin, XingYu

PDF preview unavailable
Read on arXiv →

Key claim

LLM agents rely more on intrinsic knowledge than external evidence.

This paper investigates whether LLM-based search agents genuinely search the web or rely on their intrinsic knowledge. The key finding is that agents often depend on pre-existing knowledge, performing poorly when external evidence is removed, which highlights the limitations of static search benchmarks.

In plain English

This paper investigates whether LLM-based search agents genuinely search the web or rely on their intrinsic knowledge. The key finding is that agents often depend on pre-existing knowledge, performing poorly when external evidence is removed, which highlights the limitations of static search benchmarks.

Novelty
8.0/10

The introduction of LiveBrowseComp as a new benchmark significantly extends the evaluation of LLM-based search agents.

Reliability
7.5/10

The study uses multiple diagnostics and a well-defined benchmark, providing solid evidence for its claims.

Deep reliability assessment

The methodology supports the claim that, on BrowseComp-style static benchmarks and the evaluated agents, a substantial fraction of apparent search performance can come from parametric knowledge plus web verification rather than evidence-driven discovery. It is more overclaimed if read as a universal statement about all search agents, because LiveBrowseComp is small, uses a single search backend, and its 90-day cutoff is only an approximate proxy for being outside model knowledge.

Reproducibility

Dataset yes: LiveBrowseComp is released at https://huggingface.co/datasets/Forival/LiveBrowseComp, with archived benchmark snapshots mentioned. Code repository is not mentioned; prompts, tool setup, closed-book configuration, and retrieval/evidence-blocking setup are described in the paper, but full reproduction also depends on serper.dev and model APIs.

Discussion questions

  1. 1.Does low closed-book accuracy on recent long-tail facts really isolate search ability, or does it partly measure search-index coverage, source freshness, and tool-use prompt robustness?
  2. 2.For builders of AI search products, should systems be optimized to avoid hypothesis-led verification, or is memory-backed hypothesis generation actually useful if paired with stronger evidence checking?
  3. 3.What result would falsify the Intrinsic Knowledge Dependence explanation: for example, would an agent that maintains high LiveBrowseComp performance while having near-zero closed-book accuracy and demonstrably evidence-derived queries be enough?

Key figure

Figure 1 illustrates that static benchmark facts are gradually absorbed into model parameters over time, causing apparent benchmark difficulty to collapse, while LiveBrowseComp refreshes questions with recent facts to reduce this erosion.

Benchmark results

~BrowseComppass@4 accuracy: 44.5vs N/AN/A
~BrowseComp-Plusaccuracy: 8vs MiniMax M2.5 closed-book baseline-36.5 points
~BrowseComp-Plusaccuracy: 2.3vs Kimi-K2.6 closed-book baseline-23.2 points
~LiveBrowseCompaccuracy: 2vs all evaluated agentsbelow 2%
Codelink
huggingface.co/datasets/Forival/LiveBrowseCompOfficial