2026-05-25reasoningvisionmultimodal

DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

Xinrui Shi, Kai Liu, Ziqing Zhang, Jianze Li, Anqi Li, Yulun Zhang

PDF preview unavailable

Key claim

Structured supervision improves lightweight model performance in reasoning.

This paper presents DRBench, a new benchmark for evaluating dense-scene reasoning in vision-language models, and DRScaffold, a fine-tuning framework that enhances grounded reasoning. The key result shows that a smaller model trained with structured supervision can outperform a larger frozen model, highlighting the effectiveness of the proposed approach.

In plain English

Novelty

8.0/10

The introduction of DRBench and DRScaffold represents a significant advancement in grounding reasoning in vision-language models.

Reliability

8.0/10

The experiments demonstrate substantial gains with solid baselines and the release of code and models supports reproducibility.