Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
Michal Shlapentokh-Rothman, Prachi Garg, Yu-Xiong Wang, Derek Hoiem
Key claim
ToolMerge outperforms existing methods in caption retrieval.
ToolMerge is a new keyframe retrieval method that leverages LLMs to improve the selection process for long-video question answering. It effectively decomposes queries into tool calls and merges their results, showing a notable 5% improvement in caption retrieval over existing methods. This approach enhances the ability to provide verifiable visual evidence for various types of queries.
ToolMerge introduces a novel approach by using LLMs to decompose queries and merge tool rankings.
The methodology is solid, and the authors provide a new benchmark for evaluation.
Deep reliability assessment
The methodology supports the claim that ToolMerge can effectively decompose queries into tool calls for keyframe retrieval, but the overclaim might be its general applicability across all types of video content without considering specific domain challenges.
Reproducibility
yes, the paper provides open source code and data at https://github.com/michalsr/ToolMerge
Discussion questions
- How does the decomposition approach handle queries that require understanding of complex temporal dynamics?
- What are the practical implications of using ToolMerge for real-time video analysis systems?
- What specific scenarios or datasets would demonstrate the limitations of ToolMerge's retrieval accuracy?
Key figure
Figure 1 illustrates the ToolMerge method, showing how a planner decomposes a query into tool calls and merges their rankings using boolean operators.