MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

1CSE Department, HKUST 2Tencent AI Seattle Lab 3University of Edinburgh 4Miniml.AI 5NVIDIA AI Technology Center (NVAITC)

Abstract

We introduce MMLongBench, the first benchmark featuring a diverse set of long-context vision-language tasks for comprehensive evaluation of LCVLMs. MMLongBench contains 13,331 examples across five task categories, such as Visual RAG and Many-Shot ICL, and covers a wide range of natural and synthetic image types. Each example is provided at five standardized input lengths (8K, 16K, 32K, 64K, 128K) using a cross-modal tokenization scheme that unifies vision patches and text tokens, enabling robust evaluation across varying context lengths. We benchmark 46 closed-source and open-source LCVLMs and provide a comprehensive analysis of the current models' vision-language long-context ability. We find that:

  • performance on a single task is a weak proxy for overall long-context capability;
  • both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement;
  • models with stronger reasoning ability tend to exhibit better long-context performance.
By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

๐Ÿ“š Comprehensive Evaluation

Recent advances in long-context visionโ€“language models (LCVLMs) unlocked a wide array of new capabilities for large visionโ€“language models (LVLMs), such as document-level visual question answering, multi-hop reasoning across web pages, etc. While researchers have proposed various techniques to extend the context windows, the development of effective evaluation benchmarks is lagging behind. It remains unclear how well current LCVLMs perform in long-context settings, what types of tasks they struggle with, and how robust they are to input length variation. We have released a full set of 13,331 visual long-context samples. MMLongBench was created to comprehensively evaluate the long-context ability of LCVLMs with diverse task in five categories:

  • Visual Retrieval-Augmented Generation
  • Needle-In-A-Haystack
  • Many-Shot In-Context Learning
  • Summarization
  • Long-Document VQA

Here is a comparison with previous datasets. We find that existing datasets:
  • have limited coverage of downstream tasks, only NIAH or Long-Document VQA;
  • have insufficient coverage of image types, only natural images (e.g., photographs) or synthetic (e.g., scanned PDFs), but not both;
  • lack a consensus on cross-modality length control, especially image tokens. Most works just use the number of images as context length;
  • provide each example with only one context of a random length.
Comparison page
Comparison of MMLongBench and previous datasets.
Overview page
An overview of our benchmark.

๐Ÿ† Leader Board

We broadly evaluate 46 LVCLMs, including both closed-source and open-source models (from 1B - 72B).
We will post an online leaderboard soon.
Overview page
The results of 46 LCVLMs on our benchmark.

Concrete Examples

๐Ÿ“Œ Example 1: Visual Retrieval-Augmented Generation (InfoSeek)
Visual Retrieval-Augmented Generation Example
Example of InfoSeek dataset in the visual retrieval-augmented generation category.
๐Ÿ“Œ Example 2: Needle-In-A-Haystack (Visual Haystack)
Needle-In-A-Haystack Example
Example of Visual Haystak-Single dataset in NIAH category. Note: The input image list is shown in two columns for display clarity; in the actual input, the images are arranged in a single sequence.
๐Ÿ“Œ Example 3: Needle-In-A-Haystack (MM-NIAH)
Needle-In-A-Haystack Example
Example of MM-NIAH-Ret dataset in NIAH category.
๐Ÿ“Œ Example 4: Many-Shot In-Context Learning (Stanford Cars)
Many-Shot In-Context Learning Example
Example of the Stanford Cars dataset in the ICL category. Note: The input image list is shown in three columns for display clarity; in the actual input, the images are arranged in a single sequence.
๐Ÿ“Œ Example 5: Summarization (GovReport)
Summarization Example
Example of GovReport in the summarization category. We only show two pages due to limited space.
๐Ÿ“Œ Example 6: Long-Document VQA (LongDocURL)
Long-Document VQA Example
Example of LongDocURL dataset in the DocVQA category. We only show three pages due to limited space.

BibTeX

@misc{wang2025mmlongbenchbenchmarkinglongcontextvisionlanguage,
title={MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly}, 
author={Zhaowei Wang and Wenhao Yu and Xiyu Ren and Jipeng Zhang and Yu Zhao and Rohit Saxena and Liang Cheng and Ginny Wong and Simon See and Pasquale Minervini and Yangqiu Song and Mark Steedman},
year={2025},
eprint={2505.10610},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.10610}, 
}