Paired visual reasoning benchmarkPaired visual reasoning benchmark

VisualFLIP Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?

Didi Zhu, Changrui Chen, Stefanos Zafeiriou, Jiankang Deng

Imperial College London

VisualFLIP is a paired benchmark for testing whether multimodal predictions update when task-critical image evidence changes. Each item keeps the question fixed and changes the image evidence so the gold answer deterministically flips. VisualFLIP 是一个配对基准，用于测试多模态模型的预测是否会随任务关键图像证据而更新。每个样本保持问题不变，只改变图像证据，使标准答案确定性翻转。

Code代码 Dataset数据集

Motivating example启发性示例 Gold B→E; model B→B标准答案 B→E；模型 B→B

A VisualFLIP paired example where the model keeps the same answer after a task-critical edit

paired example配对样例

Gold answer标准答案 B→E task-critical edit flips the label任务关键编辑使标签翻转

Model answer模型答案 B→B prediction repeats instead of updating预测重复，而非更新

Outcome判定 Collapse坍塌

The paired edit changes the correct option from B to E. The model's prediction remains B, so the pair is counted as an answer-update failure. 该配对编辑使正确选项从 B 变为 E；模型预测仍为 B，因此这一样本被计为答案更新失败。

01 / Overview概览

A correct answer is not always a grounded one.答对，未必意味着有据。

Modern multimodal LLMs answer many visual-reasoning questions correctly. But single-image accuracy can hide answers that are not driven by the visual evidence — driven instead by language priors, memorized regularities, or whatever a confident reasoning chain happens to settle on. VisualFLIP turns this into a behavioral test. 现代多模态大模型能答对很多视觉推理题。但单图准确率会掩盖那些并非依赖视觉证据的答案——它们可能源自语言先验、记忆的规律，或一段听上去自洽的推理链恰好落到的位置。VisualFLIP 把这件事变成一个行为测试。

One pair = one question, two images. The second image makes a minimal task-critical edit that deterministically flips the gold answer. A prediction sensitive to the evidence should change between the two; a prediction that repeats the same answer despite the gold flipping collapses. The benchmark is behavioural — it measures whether predictions move with the evidence, not the internal mechanism behind any individual answer. 一对样本 = 一个问题，两张图。第二张图做一个对任务关键的最小化编辑，使标准答案确定性翻转。对证据敏感的预测应当随之改变；若两张图给出相同答案而标准答案已翻转，则发生坍塌。这是一个行为基准——它衡量预测是否随证据更新，并不主张任何关于内部机制的结论。

Unit样本单元 same question, paired images同一问题，配对图像

Scale规模 687 pairs, 14 templates687 对样本，14 个模板

Protocol评测协议 independent and sequential modes独立与序列两种模式

Metrics指标 Acc_p and CRAcc_p 与 CR

VisualFLIP composition and evaluation modes — **Figure 1. Composition and evaluation modes.** *Left:* sample distribution across four perturbation categories and nine task types. *Right:* pair accuracy for representative MLLMs under independent (light) and sequential (dark) evaluation — prior-answer exposure reduces pair accuracy for several capable systems. **图 1. 构成与评测模式。**左：四类扰动 × 九个任务类型的样本分布。右：代表性 MLLM 在独立（浅色）与序列（深色）模式下的配对准确率——序列设置下"先前答案暴露"使多个能力较强系统的 Acc_p 下降。

02 / Composition数据集构成

687 pairs across four categories and fourteen templates.687 对样本，4 个类别，14 个模板。

Perturbation types in VisualFLIP — **Figure 2. Perturbation types.** VisualFLIP groups perturbations into four categories: *Cardinality Shift* alters object counts; *Spatial Transformation* changes positions or orientations; *Attribute Mutation* modifies visual properties such as color, shape, or value; *Logic Re-mapping* inverts logical relationships. **图 2. 扰动类型。**VisualFLIP 将扰动分为四类：*计数变化*改变物体数量；*空间变换*改变位置或朝向；*属性突变*修改颜色、形状或数值等视觉属性；*逻辑重映射*反转逻辑关系。

Category类别	Pairs样本数	Templates模板
Cardinality	146	`hard_dense_count` · `hard_dense_5panel` · `stem_count_match` · `stem_sum_match`
Attribute	273	`color_connectivity` · `attr_dense_5panel` · `attr_dense_color_count`
Spatial	150	`layer_order` · `nested_containment` · `maze_path`
Logic	118	`logic_set_count` · `narrative_multi` · `logic_arrow_path`
Total	687	14 task templates

140 of the 687 pairs additionally carry an irrelevant_image control arm to test answer stability under non-task-critical edits. 其中 140 对样本额外带 irrelevant_image 对照臂，用于检验答案在非任务关键编辑下的稳定性。

03 / Results实验结果

Main results across 24 MLLMs.24 个 MLLM 的主结果。

Table 1 uses the independent protocol (each image queried separately); Table 2 uses the sequential two-turn protocol (original then edited in one conversation; SeqCR measures original-answer persistence). Within each block, rows are ranked by Acc_p; bold marks the best value in a column. 表 1 为独立协议(两图分别独立询问);表 2 为序列两轮协议(同一对话先原图后编辑图,SeqCR 衡量原答案的惯性保持)。每个分块内按 Acc_p 排名;加粗为该列最优。

Table 1 — Independent evaluation表 1 — 独立评测

#	Model模型	Year年份	Overall总体		Cardinality		Attribute		Spatial		Logic
#	Model模型	Year年份	Acc_p ↑	CR ↓	Acc_p	CR	Acc_p	CR	Acc_p	CR	Acc_p	CR
Closed-Source
1	Gemini 3.5 Flash	2026	81.2	7.3	84.9	7.1	90.5	4.5	70.7	11.4	68.6	9.5
2	Qwen3.6-Plus	2026	80.2	6.8	83.6	8.3	90.5	3.4	69.3	12.8	66.1	5.3
3	GPT-5.5	2026	78.6	5.8	85.6	3.7	86.4	5.7	68.0	7.8	65.3	6.6
4	Gemini 3.1 Pro	2026	77.1	10.1	79.5	8.6	88.3	5.9	68.7	16.0	59.3	15.5
5	Qwen3.5-Flash	2026	72.1	7.8	74.0	6.4	87.2	3.8	52.7	17.3	59.3	7.9
6	Seed 2.0 Mini	2026	68.6	11.8	61.7	10.1	86.2	8.8	67.4	18.4	40.2	14.5
7	GLM-5V-Turbo	2026	59.7	10.2	41.1	13.3	78.7	5.5	53.6	17.3	46.6	11.1
8	Claude Opus 4.7	2026	57.2	13.1	43.2	16.0	74.0	8.3	48.7	20.3	46.6	12.2
9	Grok 4.3	2026	52.5	18.4	33.6	22.9	73.6	12.0	40.0	28.8	43.2	17.2
10	GPT-5-mini	2025	45.3	27.6	28.1	27.3	53.8	29.0	47.3	30.4	44.1	20.2
11	Claude Opus 4.6	2026	28.4	23.2	19.2	25.0	38.1	17.3	30.7	29.1	14.4	32.7
12	GPT-4o	2024	23.3	50.6	13.0	57.1	20.9	61.7	40.0	34.7	20.3	34.9
Open-Source
13	MiMo-v2.5-310B	2026	51.0	9.6	47.3	8.8	59.0	7.6	46.2	15.7	43.2	8.6
14	Kimi K2.6-1T	2026	41.6	2.9	48.6	2.9	39.9	1.7	41.6	2.4	35.6	5.6
15	GLM-4.6V-106B	2025	38.7	26.6	37.0	14.0	42.9	32.5	44.7	32.3	23.7	14.3
16	Qwen3-VL-235B	2025	26.5	47.7	26.7	47.2	19.0	60.1	38.7	39.7	28.0	24.2
17	Qwen3-VL-32B	2025	25.0	51.4	27.4	50.5	16.8	67.4	42.0	36.6	19.5	37.5
18	Qwen3-VL-8B	2025	17.6	52.7	15.8	55.1	11.7	61.5	28.0	43.2	20.3	48.5
19	Qwen2.5-VL-7B	2025	9.2	53.0	3.4	45.2	12.8	61.2	6.7	53.6	11.0	35.2
Open-Source · Tool-Augmented
20	CoF-7B	2025	10.6	47.4	7.5	33.3	13.2	54.7	14.7	42.0	3.4	49.0
21	DeepEyesV2-7B	2026	9.2	42.6	7.5	34.4	8.4	50.7	11.3	41.5	10.2	32.8
22	PixelReasoner-8B	2025	7.9	51.6	4.8	38.1	12.1	55.1	5.3	56.2	5.1	50.0
23	Mini-o3-7B	2025	7.3	44.7	4.8	38.3	10.6	49.6	5.3	42.1	5.1	42.0
24	DeepEyes-7B	2025	7.1	53.0	4.1	42.1	8.1	64.6	8.0	45.6	7.6	44.3

Table 2 — Sequential evaluation表 2 — 序列评测

#	Model模型	Cardinality		Attribute		Spatial		Logic		Avg
#	Model模型	Acc_p	SeqCR	Acc_p	SeqCR	Acc_p	SeqCR	Acc_p	SeqCR	Acc_p ↑	SeqCR ↓
1	Gemini 3.1 Pro	76.7	10.1	83.2	10.5	70.7	14.5	68.6	11.3	76.6	11.4
2	Gemini 3.5 Flash	80.8	8.1	66.3	22.8	65.3	9.0	66.1	6.8	69.1	14.2
3	Claude Opus 4.7	34.9	22.5	64.5	23.4	50.0	18.6	47.5	19.8	52.1	21.6
4	Qwen3.6-Plus	43.2	51.1	53.1	41.3	49.3	32.5	48.3	22.1	49.3	39.2
5	GPT-4o	16.4	25.0	38.1	8.5	45.3	11.6	14.4	15.6	31.0	13.2
6	GPT-5-mini	11.6	64.8	31.9	47.4	40.7	37.1	21.2	51.5	27.7	48.4
7	Grok 4.3	15.1	41.4	26.7	43.8	40.7	27.1	26.3	25.0	27.2	36.5
8	GLM-5V-Turbo	18.7	79.4	19.1	38.3	28.0	41.9	21.0	33.3	21.3	47.1
9	Qwen3-VL-235B	11.6	72.9	13.6	68.3	36.0	39.4	24.6	38.5	19.9	56.7
10	GLM-4.6V	14.5	57.1	13.8	63.4	21.3	55.1	10.4	47.1	15.0	57.5

04 / Citation引用

BibTeX

@article{zhu2026visualflip,
  title   = {VisualFLIP: Do Predictions Depend on Task-Critical
             Visual Evidence in Multimodal Reasoning?},
  author  = {Zhu, Didi and Chen, Changrui and Zafeiriou, Stefanos and Deng, Jiankang},
  year    = {2026},
  journal = {arXiv preprint}
}

If you use the real-image pairs (source == "real_mathvision"), please also cite the MathVision benchmark (Wang et al., 2024). 如果使用了真实图像样本（source == "real_mathvision"），请同时引用 MathVision 基准（Wang et al., 2024）。