We introduce MMLongBench, the first benchmark featuring a diverse set of long-context vision-language tasks for comprehensive evaluation of LCVLMs.
MMLongBench contains 13,331 examples across five task categories, such as Visual RAG and Many-Shot ICL, and covers a wide range of natural and synthetic image types.
Each example is provided at five standardized input lengths (8K, 16K, 32K, 64K, 128K) using a cross-modal tokenization scheme that unifies vision patches and text tokens, enabling robust evaluation across varying context lengths.
We benchmark 46 closed-source and open-source LCVLMs and provide a comprehensive analysis of the current models' vision-language long-context ability.
We find that:
- performance on a single task is a weak proxy for overall long-context capability;
- both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement;
- models with stronger reasoning ability tend to exhibit better long-context performance.
By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.