PerceptionRubrics
Calibrating Multimodal Evaluation to Human Perception

ICML 2026
Johns Hopkins University StepFun Tsinghua University Independent
*Core Contribution
Corresponding Author

Overview

Current perception benchmarks suffer from an evaluation paradox: leaderboards are increasingly saturated, yet models remain perceptually brittle in real-world use. We introduce PerceptionRubrics, a rubric-based evaluation framework that shifts assessment from holistic semantic matching to rigorous atomic auditing. It pairs 1,038 information-dense images with over 12,000 instance-specific rubrics, derived from Golden Captions constructed through a novel Circular Peer-Review consensus pipeline and distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics.

Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields three insights:

(1) The Reliability Gap. Models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains such as GUIs.
(2) Open–Closed Stratification. Contrary to reasoning trends, a persistent ~8% perception deficit separates the open-source frontier from proprietary leaders.
(3) Human-Aligned Rigor. Our gated metrics substantially out-align conventional benchmarks with human preference, validating that strict perceptual fidelity is a prerequisite for reliable generation.

Motivation of PerceptionRubrics
Fig 1: Motivation. Existing benchmarks can favor responses with key omissions, whereas humans prefer responses that capture more perceptually important details. PerceptionRubrics more clearly distinguishes model capabilities than DetailCaps and DOCCI.

Benchmark Construction

To bypass the visual-grounding gap of direct image-to-rubric generation, we adopt a caption-centric pipeline. We first transcribe each image into a comprehensive Golden Caption, then distill rubrics from it.

  • Step 1 — Circular Peer-Review. Three top-tier MLLMs act as a “jury-and-generator” ensemble: they generate independent descriptions, then iteratively compare, rank, and rewrite them against the visual evidence to synthesize a superior consensus caption.
  • Step 2 — Strict Consensus Filtering. Human experts act as final verifiers under a discard-on-divergence protocol—only high-consensus samples are lightly verified and kept, focusing human effort on high-confidence data.
  • Dual-Stream Rubrics. From each Golden Caption we extract Must-Right rubrics (a priori essential facts) and Easy-Wrong rubrics (a posteriori pitfalls mined from a pool of model error patterns) using domain-specific adaptive prompts.
PerceptionRubrics construction pipeline
Fig 2: Construction pipeline. Golden Captions are synthesized via circular peer-review (top), then serve as anchors to generate Must-Right and Easy-Wrong rubrics via domain-specific prompting (bottom).

The resulting benchmark contains 1,038 images and 12,004 rubrics (4,232 Must-Right + 7,772 Easy-Wrong; ~11.6 per image) spanning seven domains: Natural Scene, Document & OCR, Digital UI/UX, Structured Data, STEM & Expert, Logic & Puzzle, and Creative & Cultural. Golden Captions average 770 words—5–6× denser than prior detailed-captioning benchmarks.

Gated Scoring

We use an LLM-as-a-Judge to verify each rubric as a boolean (True/False), then apply a non-linear, gated aggregation that mirrors human error sensitivity:

  • Must-Right as the Gate. If a response fails any Must-Right criterion, the gate closes (G = 0) and the score collapses to zero—a single fatal hallucination is a binary failure, not a minor fluctuation.
  • Easy-Wrong for Differentiation. For responses that pass the gate (G = 1), the final score is the pass rate over the Easy-Wrong rubrics, rewarding robustness against subtle, density-rich cognitive traps.

This ensures a high score reflects not coarse semantic proximity but genuine perceptual reliability, distinguishing acceptable approximations from catastrophic failures.

Main Results

We evaluate 25 leading MLLMs. PerceptionRubrics reveals a pronounced performance stratification obscured by traditional holistic benchmarks. The strongest model reaches only 70.07%, while a widely used proprietary model (GPT-4o) scores just 12.59%. Performance is highest on natural scenes and lowest in the GUI domain, and the best open-source model still trails the proprietary state-of-the-art by over 8%—a gap that persists despite convergence on reasoning tasks.

# Model Params Doc Logic Creative GUI Natural STEM Structured Overall

Table 1: Fine-grained performance across seven domains. Click a column header to sort; per-column best is highlighted. All values are percentages (%).

Analysis

The Reliability Gap & Failure Modes

Comparing Atomic Accuracy (mean pass rate over individual rubrics) with the stricter Must-Right Pass Rate (all constraints satisfied) exposes a systematic Reliability Gap: models pass most atomic checks yet fail their strict conjunction. The gap narrows as capability increases. The GUI domain dominates perceptual failures across models.

Failure analysis
Fig 3: Failure analysis. (Left) Distribution of error sources across models. (Right) Atomic Accuracy vs. the stricter Must-Right-All-Pass rate.

Consistency of Perceptual Capabilities

We find a near-perfect linear correlation (R2 ≈ 0.98) between Must-Right Pass Rate and Easy-Wrong accuracy: models that fail to ground essential facts inevitably struggle with subtle details and hallucination. Robust fine-grained understanding critically depends on foundational perception.

Correlation analysis
Fig 4: Correlation between basic perceptual reliability and hallucination resistance.

Alignment with Human Preference

Against the Vision Arena human-preference leaderboard, PerceptionRubrics shows the strongest alignment among compared captioning benchmarks, achieving a Pearson 0.916 and a perfect Spearman 1.000 rank correlation. In contrast, DOCCI and DetailCaps assign nearly indistinguishable (or even anti-correlated) scores to models with markedly different human ratings.

Human alignment
Fig 5: Human alignment. Benchmark score vs. Vision Arena Elo across overlapping models. PerceptionRubrics correlates most strongly with human preference.

Resistance to Length Bias

Caption length exhibits a negligible relationship with PerceptionRubrics scores, indicating the metric rewards precise, verifiable perception rather than verbosity.

Length bias Gemini Length bias Kimi
Fig 6: Length bias. Score vs. response word count for representative models.

Evaluation Robustness & Rubric Coverage

Swapping the judge model preserves the ranking order, and evaluation stability improves monotonically as rubric coverage increases—confirming the robustness of both the rubric-generation pipeline and the resulting metric to judge choice and sampling variability.

Judge robustness Rubric coverage vs stability
Fig 7: (Left) Consistent rankings under different judges. (Right) Stability improves with rubric coverage.