Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
CVPR 2025(2025)
Abstract
Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: https://yunzeman.github.io/argus/
MoreTranslated text
Key words
Benchmark,Visual Attention,Language Model,Reasoning Tasks,Multimodal Model,Multimodal Tasks,Contextual Information,Visual Perception,Visual Task,Bounding Box,Architectural Design,Perceptual Task,Visual Search,Text Format,Localization Task,Understanding Tasks,Foundation Model,Visual Context,Predicted Bounding Box,Visual Encoding,Visual Reasoning,Resampling Strategy,Visual Understanding,Visual Attention Mechanism,Involuntary Attention,Drawing Insights,Attention Mechanism
PDF
View via Publisher
AI Read Science
Must-Reading Tree
Example

Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper
Summary is being generated by the instructions you defined