The Abstraction Gap in Vision-Language Causal Reasoning
Chinh Hoang, Mohammad Rashedul Hasan
Read on arXiv →Key claim
One model achieves near-zero Abstraction Gap in causal reasoning.
This paper presents a new methodology for evaluating vision-language models by distinguishing between linguistic plausibility and causal reasoning. The key finding is that while many models perform well on linguistic quality, they struggle with generating explicit causal chains. One model, however, demonstrates the ability to achieve near-zero Abstraction Gap, indicating potential for improved causal reasoning in VLMs.
In plain English
This paper presents a new methodology for evaluating vision-language models by distinguishing between linguistic plausibility and causal reasoning. The key finding is that while many models perform well on linguistic quality, they struggle with generating explicit causal chains. One model, however, demonstrates the ability to achieve near-zero Abstraction Gap, indicating potential for improved causal reasoning in VLMs.
The introduction of a dual-probe methodology and the CAGE benchmark significantly advances the evaluation of VLMs.
The study evaluates multiple models with a large dataset, providing solid evidence for its claims.
Deep reliability assessment
The methodology supports a useful diagnostic: many VLMs can produce plausible causal answers but perform much worse when asked to explicitly generate simple linear causal chains first. It overclaims if interpreted as proving absence of causal reasoning or internal unfaithfulness, because failures may reflect output-format brittleness, instruction following, or LLM-judge bias rather than a pure causal-reasoning deficit.
Reproducibility
Dataset: yes, the paper states that the CAGE benchmark is released, but no dataset URL is provided in the supplied text. Code: no public code repository is mentioned in the supplied abstract, introduction, discussion, conclusion, or footnotes.
Discussion questions
- 1.Does requiring a model to emit an explicit linear causal chain actually test causal understanding, or does it mainly test compliance with a particular symbolic output format?
- 2.For builders deploying VLMs in medicine, robotics, or inspections, should models be rejected if they have a high Abstraction Gap even when their final answers are useful and empirically accurate?
- 3.What experiment would falsify the paper's conclusion: for example, would strong performance across multiple causal representations such as graphs, natural-language rationales, and interventions eliminate the claimed gap?
Key figure
Figure 1 uses a beach-scene example to show Pearl Level 1 association questions answered directly, while Level 2 intervention and Level 3 counterfactual questions require an explicit causal chain before the final textual answer.
