← Back to feed
2026-05-25reasoningvisionmultimodalcode

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

Yiming Liang, Yixiao Chen, Yiyang Zhou, Yixuan Wang, Shoubin Yu, Andong Deng, Fuxiao Liu, Qin Zhang, Chen Chen, Mohit Bansal, Huaxiu Yao

PDF preview unavailable
Read on arXiv →

Key claim

STORMS improves video reasoning accuracy and reduces latency.

The STORMS framework enhances video reasoning by internalizing the reasoning process through latent trajectories instead of relying on external tools or textual chains. This approach significantly improves accuracy while reducing inference time. The key result shows that STORMS outperforms existing methods in both efficiency and effectiveness.

In plain English

The STORMS framework enhances video reasoning by internalizing the reasoning process through latent trajectories instead of relying on external tools or textual chains. This approach significantly improves accuracy while reducing inference time. The key result shows that STORMS outperforms existing methods in both efficiency and effectiveness.

Novelty
8.0/10

The proposed STORMS framework introduces a new approach to visual reasoning that reduces reliance on explicit textual representations.

Reliability
7.5/10

The experiments demonstrate improved accuracy and reduced inference overhead across multiple datasets, supporting the claims made.

Deep reliability assessment

The methodology supports the claim that TORM can improve video reasoning accuracy while reducing inference overhead by internalizing reasoning into latent states. However, the reliance on generated thought-video supervision during training may introduce variability depending on the quality of the generated content.

Reproducibility

Yes, the paper mentions that the code is available at https://github.com/aiming-lab/storm.

Discussion questions

  1. How does the quality of the generated thought videos impact the effectiveness of the latent reasoning process?
  2. What are the practical implications of reducing inference-time complexity for real-time video processing applications?
  3. What specific scenarios or datasets could demonstrate the limitations of TORM's internalized reasoning approach?

Key figure

Figure 1 illustrates the TORM training sequence, showing how generated thought videos provide dynamic supervision for latent tokens to encode temporal evidence before generating the final answer.

Benchmark results

VideoMMEaccuracy: 61vs Qwen2.5-VL-7B-SFT+5.6%SOTA
MVBenchaccuracy: 61.1vs Qwen2.5-VL-7B-SFT+0.6%SOTA
TempCompassaccuracy: 74.3vs Qwen2.5-VL-7B-SFT+4.4%SOTA
GitHub1 repo
aiming-lab/stormOfficial
STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models — Frontier Papers