Abstract

We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visual perception. This framework is powered by Reflective Perceptual Learning (RPL), which reinforces intrinsic reflective capabilities through a methodically constructed visual reflection dataset and reflective unlikelihood training. Comprehensive experimental evaluation demonstrates RePer’s quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, RePer achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation.


Introduction

🧠 Motivation

Though existing works believe well-trained LVLMs can achieve sufficiently accurate initial perception, hallucinations and misperception are commonly observed. This work is inspired by the fact that human establish cognition through iterative observation.

A reasonable perception paradigm for LVLMs should be iterative rather than a single-pass.

💡 Key Ideas

Reflective Perceptual Learning (RPL): (1) Builds a perception-feedback loop through a curated visual reflection dataset. (2) Utilizes Reflective Unlikelihood Training to capture preferences and prevent behavioral collapse.
Reflective Perception: Employs a policy-critic inference architecture that allows LVLMs to perform perception and reflection separately via multi-turn dialogues

🙌 Highlights

1. RePer progressively shifts image attention toward human-aligned regions through iterative reflection, resulting in perceptual patterns that more closely mirror human focus.
2. Reflective Perceptual Learning serves as a free-form preference optimization that unifies various preference learning paradigms (DPO and LiPO), while enabling fine-grained supervision through explicit feedback signals.
3. RePer is an accurate and informative image captioner that significantly reduces hallucinations.


📊 Visual Reflection Dataset

Visual Reflection Dataset Construction Pipeline and one data sample

Dataset Construction Pipeline
Dataset Examples

🔍 Training and Evaluation Framework

Reflective Unlikelihood Training

Dataset Construction Pipeline

1. Unlikelihood: Prevent the model from overfitting to multi-turn responses and avoid the behavioral collapse
2. Normalized rewards: Balance likelihood and unlikelihood.

Reflective Perception

A collaborative interaction between the well-trained policy and critic agents.

Dataset Examples

✨ Key Insights

RePer Progressively Aligns Human Attention
Dataset Construction Pipeline
The model accurately shifts its attention emphasis towards the correct image tokens, rather than over-focusing on a few insignificant ones.
RPL is a Free-Form Preference Optimization
Dataset Construction Pipeline
Rewards increasing linearly from rounds 1 to T. The model mainly penalizes poor samples (more unlikelihood) in the initial rounds, while it gradually shifts to rewarding good samples (more likelihood) in later rounds. This helps the model avoid overfitting to poor initial samples and, importantly, allows it to progressively learn to distinguish between good and bad samples.

👉 For performance, ablations, in-depth analysis, and more details, please refer to the paper and the github repo.

🏠 Project page under construction… Stay tuned!

💪 The dataset, model, and code will be released soon!

BibTeX

@misc{wei2025perceptionreflection,
      title={Perception in Reflection}, 
      author={Yana Wei and Liang Zhao and Kangheng Lin and En Yu and Yuang Peng and Runpei Dong and Jianjian Sun and Haoran Wei and Zheng Ge and Xiangyu Zhang and Vishal M. Patel},
      year={2025},
      eprint={2504.07165},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.07165}, 
}