DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models
Xinrui Shi, Kai Liu, Ziqing Zhang, Jianze Li, Anqi Li, Yulun Zhang
Read on arXiv →Key claim
Structured supervision improves lightweight model performance in reasoning.
This paper presents DRBench, a new benchmark for evaluating dense-scene reasoning in vision-language models, and DRScaffold, a fine-tuning framework that enhances grounded reasoning. The key result shows that a smaller model trained with structured supervision can outperform a larger frozen model, highlighting the effectiveness of the proposed approach.
In plain English
This paper presents DRBench, a new benchmark for evaluating dense-scene reasoning in vision-language models, and DRScaffold, a fine-tuning framework that enhances grounded reasoning. The key result shows that a smaller model trained with structured supervision can outperform a larger frozen model, highlighting the effectiveness of the proposed approach.
The introduction of DRBench and DRScaffold represents a significant advancement in grounding reasoning in vision-language models.
The experiments demonstrate substantial gains with solid baselines and the release of code and models supports reproducibility.