2026-05-22visionmultimodaldata

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

Rim Assouel, Amir Bar, Michal Drozdzal, Adriana Romero-Soriano

Key claim

PGT significantly improves fine-grained visual understanding in MLLMs.

This paper introduces Procedurally Generated Tasks (PGT) to enhance fine-grained visual understanding in Multimodal Large Language Models. The key result shows that instruction tuning with PGT data improves performance by up to +20% on the What'sUp benchmark, indicating that better supervision can address spatial reasoning deficits.

Novelty

8.0/10

The introduction of Procedurally Generated Tasks (PGT) offers a novel approach to enhancing fine-grained visual understanding in MLLMs.

Reliability

7.5/10

The experiments conducted on multiple benchmarks and the clear improvements in performance suggest a solid methodology.

Deep reliability assessment

The methodology supports the claim that procedurally generated tasks (PGT) can improve fine-grained visual grounding in MLLMs by providing additional supervision signals. However, the extent to which these improvements generalize across different datasets and real-world scenarios may be overclaimed without further validation.

Reproducibility

No open source code or dataset is explicitly mentioned in the paper, which may limit reproducibility. The paper describes the procedural generation of tasks, but without access to the code or specific datasets, reproducing the exact results could be challenging.

Discussion questions

How do the authors ensure that the improvements from PGT are not just due to overfitting to the specific tasks introduced?
What are the practical implications of using PGT in real-world applications, especially in terms of computational overhead and integration with existing systems?
What specific experimental results or scenarios would falsify the claim that PGT can effectively improve fine-grained visual grounding in MLLMs?

Key figure

Figure 1 provides an overview of the PGT framework, illustrating how abstract geometric primitives are overlaid onto training data to enhance instruction tuning and improve performance on relational, quantitative, and 3D/depth understanding tasks.

Read on arXiv →