MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

📢 News

[Sept 2025] 🎉 MMLongBench is accepted as a spotlight at NeurIPS 2025!!!

Abstract

We introduce MMLongBench, the first benchmark featuring a diverse set of long-context vision-language tasks for comprehensive evaluation of LCVLMs. MMLongBench contains 13,331 examples across five task categories, such as Visual RAG and Many-Shot ICL, and covers a wide range of natural and synthetic image types. Each example is provided at five standardized input lengths (8K, 16K, 32K, 64K, 128K) using a cross-modal tokenization scheme that unifies vision patches and text tokens, enabling robust evaluation across varying context lengths. We benchmark 46 closed-source and open-source LCVLMs and provide a comprehensive analysis of the current models' vision-language long-context ability. We find that:

performance on a single task is a weak proxy for overall long-context capability;
both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement;
models with stronger reasoning ability tend to exhibit better long-context performance.

By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

📚 Comprehensive Evaluation

Recent advances in long-context vision–language models (LCVLMs) unlocked a wide array of new capabilities for large vision–language models (LVLMs), such as document-level visual question answering, multi-hop reasoning across web pages, etc. While researchers have proposed various techniques to extend the context windows, the development of effective evaluation benchmarks is lagging behind. It remains unclear how well current LCVLMs perform in long-context settings, what types of tasks they struggle with, and how robust they are to input length variation. We have released a full set of 13,331 visual long-context samples. MMLongBench was created to comprehensively evaluate the long-context ability of LCVLMs with diverse task in five categories:

Visual Retrieval-Augmented Generation
Needle-In-A-Haystack
Many-Shot In-Context Learning
Summarization
Long-Document VQA

Here is a comparison with previous datasets. We find that existing datasets:

have limited coverage of downstream tasks, only NIAH or Long-Document VQA;
have insufficient coverage of image types, only natural images (e.g., photographs) or synthetic (e.g., scanned PDFs), but not both;
lack a consensus on cross-modality length control, especially image tokens. Most works just use the number of images as context length;
provide each example with only one context of a random length.

Comparison page — Comparison of MMLongBench and previous datasets.

Overview page — An overview of our benchmark.

🏆 Leader Board

We broadly evaluate 46 LVCLMs, including both closed-source and open-source models (from 1B - 72B).
We will post an online leaderboard soon.

Concrete Examples

📌 Example 1: Visual Retrieval-Augmented Generation (InfoSeek)

Visual Retrieval-Augmented Generation Example

Example of InfoSeek dataset in the visual retrieval-augmented generation category.

📌 Example 2: Needle-In-A-Haystack (Visual Haystack)

Example of Visual Haystak-Single dataset in NIAH category. Note: The input image list is shown in two columns for display clarity; in the actual input, the images are arranged in a single sequence.

📌 Example 3: Needle-In-A-Haystack (MM-NIAH)

Example of MM-NIAH-Ret dataset in NIAH category.

📌 Example 4: Many-Shot In-Context Learning (Stanford Cars)

Example of the Stanford Cars dataset in the ICL category. Note: The input image list is shown in three columns for display clarity; in the actual input, the images are arranged in a single sequence.

📌 Example 5: Summarization (GovReport)

Example of GovReport in the summarization category. We only show two pages due to limited space.

📌 Example 6: Long-Document VQA (LongDocURL)

Example of LongDocURL dataset in the DocVQA category. We only show three pages due to limited space.

BibTeX

@inproceedings{wang2025mmlongbenchbenchmarkinglongcontextvisionlanguage,
      title={MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly}, 
      author={Zhaowei Wang and Wenhao Yu and Xiyu Ren and Jipeng Zhang and Yu Zhao and Rohit Saxena and Liang Cheng and Ginny Wong and Simon See and Pasquale Minervini and Yangqiu Song and Mark Steedman},
      year={2025},
      eprint={2505.10610},
      booktitle={The 39th (2025) Annual Conference on Neural Information Processing Systems},
      url={https://arxiv.org/abs/2505.10610}, 
}