Fast and Effective On-policy Distillation from Reasoning Prefixes

每日信息看板 · 2026-02-16
研究/论文
Category
arxiv_search
Source
98
Score
2026-02-16T23:28:54Z
Published

AI 总结

该论文提出仅对学生输出前缀进行在策略蒸馏并提前终止采样的方法,在数学与跨域基准上达到与完整OPD相当效果,同时将训练算力开销降低2到47倍,提升了长推理模型训练效率与可扩展性。
#arXiv #paper #研究/论文 #On-policy Distillation #AI-for-Math

内容摘录

On-policy distillation (OPD), which samples trajectories from the student model and supervises them with a teacher at the token level, avoids relying solely on verifiable terminal rewards and can yield better generalization than off-policy distillation. However, OPD requires expensive on-the-fly sampling of the student policy during training, which substantially increases training cost, especially for long responses. Our initial analysis shows that, during OPD, training signals are often concentrated in the prefix of each output, and that even a short teacher-generated prefix can significantly help the student produce the correct answer. Motivated by these observations, we propose a simple yet effective modification of OPD: we apply the distillation objective only to prefixes of student-generated outputs and terminate each sampling early during distillation. Experiments on a suite of AI-for-Math and out-of-domain benchmarks show that on-policy prefix distillation matches the performance of full OPD while reducing training FLOP by 2x-47x.