How to Train Your Long-Context Visual Document Model

每日信息看板 · 2026-02-16
研究/论文
Category
arxiv_search
Source
92
Score
2026-02-16T23:26:51Z
Published

AI 总结

该论文系统研究了最长344K上下文的长文档视觉语言模型训练流程,并在MMLongBenchDoc上达SOTA,重要性在于提供可复现配方并验证视觉长上下文训练可迁移提升长文本能力。
#arXiv #paper #研究/论文 #MMLongBenchDoc

内容摘录

We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.