VisualFLIP.
Paired visual reasoning benchmarkPaired visual reasoning benchmark

VisualFLIP Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?

Imperial College London

VisualFLIP is a paired benchmark for testing whether multimodal predictions update when task-critical image evidence changes. Each item keeps the question fixed and changes the image evidence so the gold answer deterministically flips. VisualFLIP 是一个配对基准,用于测试多模态模型的预测是否会随任务关键图像证据而更新。每个样本保持问题不变,只改变图像证据,使标准答案确定性翻转。

Motivating example启发性示例 Gold B→E; model B→B标准答案 B→E;模型 B→B
A VisualFLIP paired example where the model keeps the same answer after a task-critical edit
paired example配对样例
Gold answer标准答案 BE task-critical edit flips the label任务关键编辑使标签翻转
Model answer模型答案 BB prediction repeats instead of updating预测重复,而非更新
Outcome判定 Collapse坍塌

The paired edit changes the correct option from B to E. The model's prediction remains B, so the pair is counted as an answer-update failure. 该配对编辑使正确选项从 B 变为 E;模型预测仍为 B,因此这一样本被计为答案更新失败。

A correct answer is not always a grounded one.答对,未必意味着有据。

Modern multimodal LLMs answer many visual-reasoning questions correctly. But single-image accuracy can hide answers that are not driven by the visual evidence — driven instead by language priors, memorized regularities, or whatever a confident reasoning chain happens to settle on. VisualFLIP turns this into a behavioral test. 现代多模态大模型能答对很多视觉推理题。但单图准确率会掩盖那些并非依赖视觉证据的答案——它们可能源自语言先验、记忆的规律,或一段听上去自洽的推理链恰好落到的位置。VisualFLIP 把这件事变成一个行为测试。

One pair = one question, two images. The second image makes a minimal task-critical edit that deterministically flips the gold answer. A prediction sensitive to the evidence should change between the two; a prediction that repeats the same answer despite the gold flipping collapses. The benchmark is behavioural — it measures whether predictions move with the evidence, not the internal mechanism behind any individual answer. 一对样本 = 一个问题,两张图。第二张图做一个对任务关键的最小化编辑,使标准答案确定性翻转。对证据敏感的预测应当随之改变;若两张图给出相同答案而标准答案已翻转,则发生坍塌。这是一个行为基准——它衡量预测是否随证据更新,并不主张任何关于内部机制的结论。

Unit样本单元 same question, paired images同一问题,配对图像
Scale规模 687 pairs, 14 templates687 对样本,14 个模板
Protocol评测协议 independent and sequential modes独立与序列两种模式
Metrics指标 Accp and CRAccp 与 CR
VisualFLIP composition and evaluation modes
Figure 1. Composition and evaluation modes. Left: sample distribution across four perturbation categories and nine task types. Right: pair accuracy for representative MLLMs under independent (light) and sequential (dark) evaluation — prior-answer exposure reduces pair accuracy for several capable systems. 图 1. 构成与评测模式。左:四类扰动 × 九个任务类型的样本分布。右:代表性 MLLM 在独立(浅色)与序列(深色)模式下的配对准确率——序列设置下"先前答案暴露"使多个能力较强系统的 Accp 下降。

687 pairs across four categories and fourteen templates.687 对样本,4 个类别,14 个模板。

Perturbation types in VisualFLIP
Figure 2. Perturbation types. VisualFLIP groups perturbations into four categories: Cardinality Shift alters object counts; Spatial Transformation changes positions or orientations; Attribute Mutation modifies visual properties such as color, shape, or value; Logic Re-mapping inverts logical relationships. 图 2. 扰动类型。VisualFLIP 将扰动分为四类:计数变化改变物体数量;空间变换改变位置或朝向;属性突变修改颜色、形状或数值等视觉属性;逻辑重映射反转逻辑关系。
Category类别 Pairs样本数 Templates模板
Cardinality146hard_dense_count · hard_dense_5panel · stem_count_match · stem_sum_match
Attribute 273color_connectivity · attr_dense_5panel · attr_dense_color_count
Spatial 150layer_order · nested_containment · maze_path
Logic 118logic_set_count · narrative_multi · logic_arrow_path
Total68714 task templates

140 of the 687 pairs additionally carry an irrelevant_image control arm to test answer stability under non-task-critical edits. 其中 140 对样本额外带 irrelevant_image 对照臂,用于检验答案在非任务关键编辑下的稳定性。

Main results across 24 MLLMs.24 个 MLLM 的主结果。

Table 1 uses the independent protocol (each image queried separately); Table 2 uses the sequential two-turn protocol (original then edited in one conversation; SeqCR measures original-answer persistence). Within each block, rows are ranked by Accp; bold marks the best value in a column. 表 1 为独立协议(两图分别独立询问);表 2 为序列两轮协议(同一对话先原图后编辑图,SeqCR 衡量原答案的惯性保持)。每个分块内按 Accp 排名;加粗为该列最优。

Table 1 — Independent evaluation表 1 — 独立评测

# Model模型 Year年份 Overall总体 Cardinality Attribute Spatial Logic
AccpCR ↓ AccpCR AccpCR AccpCR AccpCR
Closed-Source
1Gemini 3.5 Flash202681.27.384.97.190.54.570.711.468.69.5
2Qwen3.6-Plus202680.26.883.68.390.53.469.312.866.15.3
3GPT-5.5202678.65.885.63.786.45.768.07.865.36.6
4Gemini 3.1 Pro202677.110.179.58.688.35.968.716.059.315.5
5Qwen3.5-Flash202672.17.874.06.487.23.852.717.359.37.9
6Seed 2.0 Mini202668.611.861.710.186.28.867.418.440.214.5
7GLM-5V-Turbo202659.710.241.113.378.75.553.617.346.611.1
8Claude Opus 4.7202657.213.143.216.074.08.348.720.346.612.2
9Grok 4.3202652.518.433.622.973.612.040.028.843.217.2
10GPT-5-mini202545.327.628.127.353.829.047.330.444.120.2
11Claude Opus 4.6202628.423.219.225.038.117.330.729.114.432.7
12GPT-4o202423.350.613.057.120.961.740.034.720.334.9
Open-Source
13MiMo-v2.5-310B202651.09.647.38.859.07.646.215.743.28.6
14Kimi K2.6-1T202641.62.948.62.939.91.741.62.435.65.6
15GLM-4.6V-106B202538.726.637.014.042.932.544.732.323.714.3
16Qwen3-VL-235B202526.547.726.747.219.060.138.739.728.024.2
17Qwen3-VL-32B202525.051.427.450.516.867.442.036.619.537.5
18Qwen3-VL-8B202517.652.715.855.111.761.528.043.220.348.5
19Qwen2.5-VL-7B20259.253.03.445.212.861.26.753.611.035.2
Open-Source · Tool-Augmented
20CoF-7B202510.647.47.533.313.254.714.742.03.449.0
21DeepEyesV2-7B20269.242.67.534.48.450.711.341.510.232.8
22PixelReasoner-8B20257.951.64.838.112.155.15.356.25.150.0
23Mini-o3-7B20257.344.74.838.310.649.65.342.15.142.0
24DeepEyes-7B20257.153.04.142.18.164.68.045.67.644.3

Table 2 — Sequential evaluation表 2 — 序列评测

# Model模型 Cardinality Attribute Spatial Logic Avg
AccpSeqCR AccpSeqCR AccpSeqCR AccpSeqCR AccpSeqCR ↓
1Gemini 3.1 Pro76.710.183.210.570.714.568.611.376.611.4
2Gemini 3.5 Flash80.88.166.322.865.39.066.16.869.114.2
3Claude Opus 4.734.922.564.523.450.018.647.519.852.121.6
4Qwen3.6-Plus43.251.153.141.349.332.548.322.149.339.2
5GPT-4o16.425.038.18.545.311.614.415.631.013.2
6GPT-5-mini11.664.831.947.440.737.121.251.527.748.4
7Grok 4.315.141.426.743.840.727.126.325.027.236.5
8GLM-5V-Turbo18.779.419.138.328.041.921.033.321.347.1
9Qwen3-VL-235B11.672.913.668.336.039.424.638.519.956.7
10GLM-4.6V14.557.113.863.421.355.110.447.115.057.5

BibTeX

@article{zhu2026visualflip,
  title   = {VisualFLIP: Do Predictions Depend on Task-Critical
             Visual Evidence in Multimodal Reasoning?},
  author  = {Zhu, Didi and Chen, Changrui and Zafeiriou, Stefanos and Deng, Jiankang},
  year    = {2026},
  journal = {arXiv preprint}
}

If you use the real-image pairs (source == "real_mathvision"), please also cite the MathVision benchmark (Wang et al., 2024). 如果使用了真实图像样本(source == "real_mathvision"),请同时引用 MathVision 基准(Wang et al., 2024)。