The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards.
This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning,
followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps—surpassing all previous open-source efforts in scale.
This pioneering work reveals three fundamental insights:
1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery.
2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns.
3) Transfer strategically favors high-utility behaviors such as visual reflection.
Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.
Linguistic cognitive behaviors are assumed to be essential for reasoning in large language models (LLMs). We define the visual extensions of the aforementioned behaviors—visual reflection, divide-and-conquer, visual verification, and goal-driven visual tracing. Their formal definitions, examples, and corresponding linguistic counterparts are provided in the table below, while the figure presents a multimodal example that illustrates both linguistic and visual cognitive behaviors.
In the following sections, we present a simple yet effective MLLM training pipeline, comprising a linguistic cold start followed by multimodal reinforcement learning, and systematically analyze the transfer and scaling of these visual cognitive behaviors.
To facilitate efficient cognitive development and cross-modal generalization, we employ the popular "RL with a cold start" paradigm with two sequential training stages:
Our model demonstrates exceptional language reasoning capabilities. On the challenging AIME 2024 and 2025 benchmarks, it dramatically surpasses other 7B open-source models by an average of over 10%, achieving performance comparable to leading 32B models. This superiority extends to general reasoning tasks, with significant gains of +4.6% on MMLU and +10.4% on MMLU-Pro over parameter-matched competitors. These results highlight the effectiveness of our curated, high-quality cold-start training data.
OVR represents a significant breakthrough for 7B-scale models in visual reasoning. It is the first post-trained Qwen2.5-VL-7B model to surpass the 50% threshold on MathVision, while also achieving state-of-the-art performance among 7B models on DynaMath and MathVerse. This strong overall performance is further underscored by a substantial gain on MMMU-Pro (+7.2%) over previous methods. These results demonstrate that reasoning capabilities acquired through language training can effectively transfer to multimodal tasks, resulting in notable improvements in visual reasoning performance.
(1) The cold-start stage shows a step-wise loss decrease.
(2) In the RL stage, reward (purple, left axis) and average response length (orange, right axis) grow steadily, with sharp surges after each sequence length expansion.
@misc{wei2025openvisionreasonertransferring,
title={Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning},
author={Yana Wei and Liang Zhao and Jianjian Sun and Kangheng Lin and Jisheng Yin and
Jingcheng Hu and Yinmin Zhang and En Yu and Haoran Lv and Zejia Weng and Jia Wang and
Chunrui Han and Yuang Peng and Qi Han and Zheng Ge and Xiangyu Zhang and Daxin Jiang and
Vishal M. Patel},
year={2025},
eprint={2507.05255},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.05255},
}
Thanks for your reading and support! If you have any questions or suggestions, please feel free to contact us via email:
ywei66@jh.edu, zhaoliang02@stepfun.com, vpatel36@jhu.edu