How to Train Your Long-Context Visual Document Model

每日信息看板 · 2026-02-16

返回当天 Daily Index

研究/论文

AI 总结

该论文系统研究了最长344K上下文的长文档视觉语言模型训练流程，并在MMLongBenchDoc上达SOTA，重要性在于提供可复现配方并验证视觉长上下文训练可迁移提升长文本能力。

开展首个大规模、系统性的长上下文视觉语言模型训练研究，覆盖持续预训练、监督微调与偏好优化。
在24B与32B参数规模上实现MMLongBenchDoc当前最优结果，并提供大量长上下文评测与消融分析。
发现训练上下文长度与评测长度匹配优于盲目训练更长上下文。
在训练与评测中加入页码索引可显著提升长文档问答表现。
合成数据管线支持模型自提升，并验证了“视觉长上下文训练→长文本能力”反向迁移。
发布人工修正的MMLBD-C基准版本，以减少原基准中的错误与低质量样本。

#arXiv #paper #研究/论文 #MMLongBenchDoc

原链接

内容摘录

We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.