logo Open Vision Reasoner
Transferring Linguistic Cognitive Behavior for Visual Reasoning

Johns Hopkins University StepAI BUPT UCAS THU HUST
*Core Contribution
Corresponding Authors

Overview

The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps—surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights:

1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery.
2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns.
3) Transfer strategically favors high-utility behaviors such as visual reflection.

Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.

Fig 1: Performance Evolution on Reasoning Benchmarks. OVR demonstrates sustained and convergent growth across both linguistic and multi-modal benchmarks throughout the cold start and RL training.
Fig 2: Performance comparison with state-of-the-art models on both textual (AIME 2024, AIME 2025, MATH500) and multimodal (MathVista, MathVision, MathVerse) math reasoning benchmarks.

Cognitive Behavior Preliminaries

Linguistic cognitive behaviors are assumed to be essential for reasoning in large language models (LLMs). We define the visual extensions of the aforementioned behaviors—visual reflection, divide-and-conquer, visual verification, and goal-driven visual tracing. Their formal definitions, examples, and corresponding linguistic counterparts are provided in the table below, while the figure presents a multimodal example that illustrates both linguistic and visual cognitive behaviors.

In the following sections, we present a simple yet effective MLLM training pipeline, comprising a linguistic cold start followed by multimodal reinforcement learning, and systematically analyze the transfer and scaling of these visual cognitive behaviors.

Multimodal behavior figure
Multimodal behavior figure

Training Paradigm

To facilitate efficient cognitive development and cross-modal generalization, we employ the popular "RL with a cold start" paradigm with two sequential training stages:

  • Stage 1: Linguistic Cold Start. The LLM module is supervised fine-tuned on language-only reasoning datasets distilled from DeepSeek-R1, establishing core cognitive behaviors such as backtracking and subgoal decomposition within a purely linguistic setting.
  • Stage 2: Multimodal RL. We apply reinforcement learning with Open-Reasoner-Zero setting on both text and multimodal tasks using verifiable match rewards. This promotes reasoning generalization and aligns previously learned cognitive patterns with visual contexts, enabling effective cross-modal transfer.

Model Performance

Language Reasoning

Our model demonstrates exceptional language reasoning capabilities. On the challenging AIME 2024 and 2025 benchmarks, it dramatically surpasses other 7B open-source models by an average of over 10%, achieving performance comparable to leading 32B models. This superiority extends to general reasoning tasks, with significant gains of +4.6% on MMLU and +10.4% on MMLU-Pro over parameter-matched competitors. These results highlight the effectiveness of our curated, high-quality cold-start training data.

Multimodal behavior figure

Visual Reasoning

OVR represents a significant breakthrough for 7B-scale models in visual reasoning. It is the first post-trained Qwen2.5-VL-7B model to surpass the 50% threshold on MathVision, while also achieving state-of-the-art performance among 7B models on DynaMath and MathVerse. This strong overall performance is further underscored by a substantial gain on MMMU-Pro (+7.2%) over previous methods. These results demonstrate that reasoning capabilities acquired through language training can effectively transfer to multimodal tasks, resulting in notable improvements in visual reasoning performance.

Multimodal behavior figure

Training Dynamics

(1) The cold-start stage shows a step-wise loss decrease.

(2) In the RL stage, reward (purple, left axis) and average response length (orange, right axis) grow steadily, with sharp surges after each sequence length expansion.

Image 1 Image 2

In-depth Behavior Analysis

Visual behaviors emerge remarkably early from cold start.
As depicted in the left figure, this vision-specific behavior emerges in significant quantities from the very beginning of the cold-start phase and fluctuates throughout subsequent training steps. Strikingly, we observed that even in linguistic problems, DeepSeekR1’s responses frequently exhibited signs of mental imagery. The model appeared to construct internal visualizations to support mathematical reasoning, often articulated through phrases such as “let me visualize...” or “let me see the image.”

Cold-start learns broadly, large-scale RL discerns critically.
As shown in the left figure, after an initial, rapid instillation of patterns during the aggressive cold-start phase, their prevalence is first suppressed then amplified to unprecedented levels during multimodal RL. This counter-intuitive dynamic suggests a clear division of labor: the cold-start phase learns broadly, indiscriminately memorizing all available patterns. In contrast, RL discerns critically, acting as a strategic filter for the crucial tokens and scaling up pivotal behaviors. This process of RL—discarding the dross to select the essence—is significant for achieving superior generalization.

Visual transfer of cognitive behaviors is strategic.
As shown in the right image, the emergence of backtracking and verification steadily increases across training stages, underscoring their growing importance. Among these, the transfer rate of backtracking shows consistent growth—from 2.5% to 17.3%—while verification exhibits near-zero transfer throughout both the cold-start and RL phases. This indicates that transfer is a strategic process, for which we posit two potential explanations: (1) Backtracking transfers more readily due to DeepSeek-R1’s inherent “mental imagination” capabilities, while verification, lacking a direct linguistic precursor, is more difficult for the MLLM to internalize. (2) Mirroring how humans naturally and instinctively process visual information, backtracking is a more fundamental component of complex visual reasoning, making its amplification a higher priority during the strategic RL phase.

Image 1 Image 2

Visual Perception Analysis and Future Work

Cold start impairs perception, while RL enhances.
We evaluated both stages of OVR, along with the base model Qwen2.5-VL-7B, on a comprehensive set of multimodal benchmarks targeting visual perception and recognition. As shown in the table, performance steadily improves across tasks such as MMBench and PhyX, underscoring the effectiveness of our training paradigm. The cold-start model shows declines on several tasks, notably increased hallucinations, likely due to token distribution shifts from large-scale linguistic data. However, the regained performance on benchmarks such as MMBench and BLINK demonstrate that long-term multimodal RL can effectively mitigate these issues by discerning perceptual capabilities that are critically for multimodal tasks.
Multimodal behavior figure

The current unscalability of RL for perception policy.
Throughout the multimodal RL, we observed a strong correlation between the reward and the average response length in figure, which is a finding consistent with prior practices. This reinforces response length as an effective reward proxy, indicative of a scaling property tied to reasoning depth and computational resources. However, when focusing on specific discriminative perceptual tasks like OCR and counting, we observe a clear divergence. As shown in the figure, while the reward can be effectively increased, the average response length remains largely stagnant. This unscalable training dynamic on such challenging tasks hints at a more fundamental issue: the absence of certain core visual cognitive behaviors.

Multimodal behavior figure

BibTeX

If you find this work useful, please consider citing it:


@misc{wei2025openvisionreasonertransferring,
title={Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning}, 
  author={Yana Wei and Liang Zhao and Jianjian Sun and Kangheng Lin and Jisheng Yin and
  Jingcheng Hu and Yinmin Zhang and En Yu and Haoran Lv and Zejia Weng and Jia Wang and
  Chunrui Han and Yuang Peng and Qi Han and Zheng Ge and Xiangyu Zhang and Daxin Jiang and
  Vishal M. Patel},
  year={2025},
  eprint={2507.05255},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2507.05255}, 
}
    
Thanks for your reading and support! If you have any questions or suggestions, please feel free to contact us via email: ywei66@jh.edu, zhaoliang02@stepfun.com, vpatel36@jhu.edu