BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning
Shijin Gong, Erhan Xu, Kai Ye, Francesco Quinzan, Giulia Livieri, Chengchun Shi
Key claim
BASIS reduces MSE by 69% with one rollout.
BASIS is a new algorithm that enhances the efficiency of value function estimation in reinforcement learning. It achieves a 69% reduction in MSE compared to a strong baseline while using only one rollout per prompt, leading to better policy optimization with less training time.
BASIS introduces a novel critic-free algorithm that significantly improves value estimation efficiency.
The claims are well-supported by experiments demonstrating substantial improvements over established baselines.
Deep reliability assessment
The methodology supports improved sample efficiency and robustness in value estimation for RLVR tasks, but the claims of outperforming multi-rollout baselines with significantly less computation may be overclaimed without broader validation across diverse tasks.
Reproducibility
No open source code or dataset is mentioned in the paper.
Discussion questions
- How does BASIS handle tasks with non-binary or noisy reward signals, and what assumptions are made about reward distributions?
- What are the practical implications of BASIS for real-time applications where computational resources are limited?
- What specific scenarios or datasets would challenge the effectiveness of BASIS and potentially falsify its claimed advantages?
Key figure
Figure 1 illustrates the BASIS method for constructing advantage estimates by sampling one completion per prompt and using batch reward information to estimate value baselines.