AIDAS Laboratory · 1ECE & 2IPAI, Seoul National University
† Corresponding author · {steve97, qicher, jaeyoung.do}@snu.ac.kr
Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context.
MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization. Evaluating 14 state-of-the-art VLMs shows low overall performance (29.3% accuracy), only modestly above random guessing. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning.
MI-CXR decomposes longitudinal CXR interpretation into three core capabilities that naturally arise in clinical workflows, jointly stressing different aspects of temporal reasoning.
Identify when clinically meaningful events—such as abnormality emergence, resolution, or recurrence—occur along the timeline.
Interpret visual changes between consecutive visits. The relevant interval is not pre-specified, requiring temporal localization before change interpretation.
Integrate evidence across all visits to characterize the overall disease course across the full timeline.
MI-CXR is built on MIMIC-CXR-JPG and MIMIC-Ext-CXR-QBA, retaining patients with at least five temporally ordered visits after rigorous annotation quality filtering. Task-specific QA pairs are instantiated from the same longitudinal timelines.
We evaluate 14 state-of-the-art VLMs under zero-shot prompting. Even the strongest models achieve well below 50% accuracy on this five-way task, revealing a fundamental gap in longitudinal temporal reasoning that persists across model categories and scales.
| Category | Model | TEL (Single) | TEL (Multi) | TEL (E→R) | ICR | GTS (Single) | GTS (Multi) | Overall |
|---|---|---|---|---|---|---|---|---|
| Closed-source | ||||||||
| Claude Sonnet 4.5 | 0.226 | 0.222 | 0.243 | 0.442 | 0.292 | 0.389 | 0.315 | |
| Gemini 3.0 Pro | 0.246 | 0.325 | 0.290 | 0.457 | 0.407 | 0.556 | 0.387 | |
| GPT-5.2 | 0.334 | 0.371 | 0.358 | 0.438 | 0.390 | 0.558 | 0.411 | |
| Open-source General | ||||||||
| InternVL3.5-8B | 0.239 | 0.295 | 0.193 | 0.552 | 0.371 | 0.389 | 0.358 | |
| InternVL3.5-38B | 0.298 | 0.306 | 0.224 | 0.571 | 0.515 | 0.510 | 0.418 | |
| QwenVL3-32B | 0.258 | 0.246 | 0.240 | 0.224 | 0.325 | 0.363 | 0.272 | |
| DeepSeek-VL-16B | 0.223 | 0.124 | 0.200 | 0.186 | 0.187 | 0.160 | 0.181 | |
| IDEFICS2-8B | 0.165 | 0.308 | 0.291 | 0.246 | 0.178 | 0.281 | 0.245 | |
| Medical-specialized | ||||||||
| Lingshu-7B | 0.230 | 0.260 | 0.165 | 0.189 | 0.194 | 0.324 | 0.223 | |
| Lingshu-32B | 0.221 | 0.247 | 0.214 | 0.167 | 0.290 | 0.388 | 0.247 | |
| MedGemma-4B | 0.174 | 0.196 | 0.301 | 0.281 | 0.183 | 0.259 | 0.237 | |
| MedGemma-27B | 0.215 | 0.351 | 0.254 | 0.429 | 0.214 | 0.255 | 0.299 | |
| Random Baseline | chance level across all subtasks | 0.200 | ||||||
Five-way multiple-choice; random guessing = 20%. Stage-wise prompting yields +15–27% relative gains across model families.