ACL Findings 2026

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

Sunghwan Steve Cho1Yunseok Han2Jaeyoung Do1,2

AIDAS Laboratory  ·  1ECE & 2IPAI, Seoul National University
Corresponding author  ·  {steve97, qicher, jaeyoung.do}@snu.ac.kr

arXiv Code / Data 🤗 Dataset AIDAS Lab

Abstract

Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context.

MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization. Evaluating 14 state-of-the-art VLMs shows low overall performance (29.3% accuracy), only modestly above random guessing. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning.

Three Reasoning Capabilities

MI-CXR decomposes longitudinal CXR interpretation into three core capabilities that naturally arise in clinical workflows, jointly stressing different aspects of temporal reasoning.

MI-CXR overview figure
Overview of longitudinal medical VQA and MI-CXR. Three core reasoning capabilities (TEL, ICR, GTS) are evaluated over multi-visit CXR sequences using a diagnostic stage-wise decomposition.
TEL

Temporal Event Localization

Identify when clinically meaningful events—such as abnormality emergence, resolution, or recurrence—occur along the timeline.

ICR

Interval-wise Change Reasoning

Interpret visual changes between consecutive visits. The relevant interval is not pre-specified, requiring temporal localization before change interpretation.

GTS

Global Trajectory Summarization

Integrate evidence across all visits to characterize the overall disease course across the full timeline.

Benchmark Construction

MI-CXR is built on MIMIC-CXR-JPG and MIMIC-Ext-CXR-QBA, retaining patients with at least five temporally ordered visits after rigorous annotation quality filtering. Task-specific QA pairs are instantiated from the same longitudinal timelines.

MI-CXR construction pipeline
Overview of MI-CXR construction. Structured metadata from MIMIC-Ext-CXR-QBA and CXR images from MIMIC-CXR-JPG are combined to construct patient-level longitudinal timelines. TEL, ICR, and GTS questions are all instantiated from the same timelines.

Main Results

We evaluate 14 state-of-the-art VLMs under zero-shot prompting. Even the strongest models achieve well below 50% accuracy on this five-way task, revealing a fundamental gap in longitudinal temporal reasoning that persists across model categories and scales.

CategoryModel TEL (Single)TEL (Multi)TEL (E→R) ICRGTS (Single)GTS (Multi)Overall
Closed-source
Claude Sonnet 4.50.2260.2220.2430.4420.2920.3890.315
Gemini 3.0 Pro0.2460.3250.2900.4570.4070.5560.387
GPT-5.20.3340.3710.3580.4380.3900.5580.411
Open-source General
InternVL3.5-8B0.2390.2950.1930.5520.3710.3890.358
InternVL3.5-38B0.2980.3060.2240.5710.5150.5100.418
QwenVL3-32B0.2580.2460.2400.2240.3250.3630.272
DeepSeek-VL-16B0.2230.1240.2000.1860.1870.1600.181
IDEFICS2-8B0.1650.3080.2910.2460.1780.2810.245
Medical-specialized
Lingshu-7B0.2300.2600.1650.1890.1940.3240.223
Lingshu-32B0.2210.2470.2140.1670.2900.3880.247
MedGemma-4B0.1740.1960.3010.2810.1830.2590.237
MedGemma-27B0.2150.3510.2540.4290.2140.2550.299
Random Baseline chance level across all subtasks 0.200

Five-way multiple-choice; random guessing = 20%. Stage-wise prompting yields +15–27% relative gains across model families.

BibTeX

@misc{cho2026micxrbenchmarklongitudinalreasoning, title={MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays}, author={Sunghwan Steve Cho and Yunseok Han and Jaeyoung Do}, year={2026}, eprint={2605.15574}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2605.15574}, }