A Multi-Dimensional Quality Scoring Framework for Decentralized LLM Inference with Proof of Quality

每日信息看板 · 2026-03-05

研究/论文

AI 总结

Decentralized large language model (LLM) inference networks can pool heterogeneous compute to scale serving, but they require lightweight and incentive-compati…

Decentralized large language model (LLM) inference networks can pool heterogeneous compute to scale serving, but they require lightweight a…
Prior work introduced cost-aware Proof of Quality (PoQ) and adaptive robust PoQ to allocate rewards under evaluator heterogeneity and adver…
In this paper, we focus on the quality signal itself and propose a multi-dimensional quality scoring framework that decomposes output quali…
Using logged outputs from QA and summarization tasks, we systematically audit dimension reliability and show that seemingly reasonable dime…
While the default composite underperforms a strong single semantic evaluator, ablations reveal that removing unreliable dimensions and re-n…
Finally, we integrate the composite score as a drop-in quality signal in PoQ and demonstrate complementary benefits with robust aggregation…

#arXiv #paper #研究/论文

原链接

内容摘录

Decentralized large language model (LLM) inference networks can pool heterogeneous compute to scale serving, but they require lightweight and incentive-compatible mechanisms to assess output quality. Prior work introduced cost-aware Proof of Quality (PoQ) and adaptive robust PoQ to allocate rewards under evaluator heterogeneity and adversarial behavior. In this paper, we focus on the quality signal itself and propose a multi-dimensional quality scoring framework that decomposes output quality into modular dimensions, including model and cost priors, structure quality, semantic quality, query-output alignment, and agreement/uncertainty. Using logged outputs from QA and summarization tasks, we systematically audit dimension reliability and show that seemingly reasonable dimensions can be task-dependent and even negatively correlated with reference quality without calibration. While the default composite underperforms a strong single semantic evaluator, ablations reveal that removing unreliable dimensions and re-normalizing weights yields a calibrated composite that matches or exceeds the best single- evaluator and consensus baselines. Finally, we integrate the composite score as a drop-in quality signal in PoQ and demonstrate complementary benefits with robust aggregation and adaptive trust weighting under adversarial evaluator attacks.