2026-05-26agentsreasoningscaling

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

Shijin Gong, Erhan Xu, Kai Ye, Francesco Quinzan, Giulia Livieri, Chengchun Shi

Key claim

BASIS reduces MSE by 69% with one rollout.

BASIS is a new algorithm that enhances the efficiency of value function estimation in reinforcement learning. It achieves a 69% reduction in MSE compared to a strong baseline while using only one rollout per prompt, leading to better policy optimization with less training time.

Novelty

8.0/10

BASIS introduces a novel critic-free algorithm that significantly improves value estimation efficiency.

Reliability

8.0/10

The claims are well-supported by experiments demonstrating substantial improvements over established baselines.

Deep reliability assessment

The methodology supports improved sample efficiency and robustness in value estimation for RLVR tasks, but the claims of outperforming multi-rollout baselines with significantly less computation may be overclaimed without broader validation across diverse tasks.

Reproducibility

No open source code or dataset is mentioned in the paper.

Discussion questions

How does BASIS handle tasks with non-binary or noisy reward signals, and what assumptions are made about reward distributions?
What are the practical implications of BASIS for real-time applications where computational resources are limited?
What specific scenarios or datasets would challenge the effectiveness of BASIS and potentially falsify its claimed advantages?

Key figure

Figure 1 illustrates the BASIS method for constructing advantage estimates by sampling one completion per prompt and using batch reward information to estimate value baselines.

Benchmark results

MATHaccuracy: 0.892vs GRPO+1.2%SOTA

Read on arXiv →