ACL Findings 2026

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

Sunghwan Steve Cho¹ Yunseok Han² Jaeyoung Do^1,2^†

AIDAS Laboratory · ¹ECE & ²IPAI, Seoul National University
^† Corresponding author · {steve97, qicher, jaeyoung.do}@snu.ac.kr

arXiv Code / Data 🤗 Dataset AIDAS Lab

Abstract

Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context.

MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization. Evaluating 14 state-of-the-art VLMs shows low overall performance (29.3% accuracy), only modestly above random guessing. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning.

Three Reasoning Capabilities

MI-CXR decomposes longitudinal CXR interpretation into three core capabilities that naturally arise in clinical workflows, jointly stressing different aspects of temporal reasoning.

$MI-CXR overview figure$

Overview of longitudinal medical VQA and MI-CXR. Three core reasoning capabilities (TEL, ICR, GTS) are evaluated over multi-visit CXR sequences using a diagnostic stage-wise decomposition.

TEL

Temporal Event Localization

Identify when clinically meaningful events—such as abnormality emergence, resolution, or recurrence—occur along the timeline.

ICR

Interval-wise Change Reasoning

Interpret visual changes between consecutive visits. The relevant interval is not pre-specified, requiring temporal localization before change interpretation.

GTS

Global Trajectory Summarization

Integrate evidence across all visits to characterize the overall disease course across the full timeline.

Benchmark Construction

MI-CXR is built on MIMIC-CXR-JPG and MIMIC-Ext-CXR-QBA, retaining patients with at least five temporally ordered visits after rigorous annotation quality filtering. Task-specific QA pairs are instantiated from the same longitudinal timelines.

Overview of MI-CXR construction. Structured metadata from MIMIC-Ext-CXR-QBA and CXR images from MIMIC-CXR-JPG are combined to construct patient-level longitudinal timelines. TEL, ICR, and GTS questions are all instantiated from the same timelines.

Main Results

We evaluate 14 state-of-the-art VLMs under zero-shot prompting. Even the strongest models achieve well below 50% accuracy on this five-way task, revealing a fundamental gap in longitudinal temporal reasoning that persists across model categories and scales.

Category	Model	TEL (Single)	TEL (Multi)	TEL (E→R)	ICR	GTS (Single)	GTS (Multi)	Overall
Closed-source
	Claude Sonnet 4.5	0.226	0.222	0.243	0.442	0.292	0.389	0.315
	Gemini 3.0 Pro	0.246	0.325	0.290	0.457	0.407	0.556	0.387
	GPT-5.2	0.334	0.371	0.358	0.438	0.390	0.558	0.411
Open-source General
	InternVL3.5-8B	0.239	0.295	0.193	0.552	0.371	0.389	0.358
	InternVL3.5-38B	0.298	0.306	0.224	0.571	0.515	0.510	0.418
	QwenVL3-32B	0.258	0.246	0.240	0.224	0.325	0.363	0.272
	DeepSeek-VL-16B	0.223	0.124	0.200	0.186	0.187	0.160	0.181
	IDEFICS2-8B	0.165	0.308	0.291	0.246	0.178	0.281	0.245
Medical-specialized
	Lingshu-7B	0.230	0.260	0.165	0.189	0.194	0.324	0.223
	Lingshu-32B	0.221	0.247	0.214	0.167	0.290	0.388	0.247
	MedGemma-4B	0.174	0.196	0.301	0.281	0.183	0.259	0.237
	MedGemma-27B	0.215	0.351	0.254	0.429	0.214	0.255	0.299
	Random Baseline	chance level across all subtasks						0.200

Five-way multiple-choice; random guessing = 20%. Stage-wise prompting yields +15–27% relative gains across model families.

BibTeX

@misc{cho2026micxrbenchmarklongitudinalreasoning, title={MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays}, author={Sunghwan Steve Cho and Yunseok Han and Jaeyoung Do}, year={2026}, eprint={2605.15574}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2605.15574}, }