2026-03-07alignmentreasoningcode

Entropy-Aware On-Policy Distillation of Language Models

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee

Key claim

Balancing KL divergence improves language model knowledge transfer.

The paper presents Entropy-Aware On-Policy Distillation, which improves knowledge transfer between language models by balancing precision and diversity. The key result shows significant accuracy gains across various benchmarks, indicating that accounting for teacher uncertainty enhances student-teacher alignment.

Novelty

8.0/10

Introduces a new method that combines reverse and forward KL divergence for better knowledge transfer.

Reliability

7.5/10

The methodology is solid with experimental results across multiple benchmarks.

Deep reliability assessment

The methodology supports the claim that selectively applying forward KL in high-entropy regions can improve student model performance by preserving teacher uncertainty, but it may overclaim the general applicability across different model architectures without extensive testing.

Reproducibility

Yes, the authors have provided open-source code available at https://github.com/WLS04/EOPD, which should aid in reproducing the experiments and results presented in the paper.

Discussion questions

How does the assumption that high-entropy regions are critical for preserving diversity hold across different types of language models and tasks?
What are the practical implications of implementing EOPD in real-world applications, such as in resource-constrained environments?
What specific scenarios or datasets could potentially falsify the results claimed by EOPD, particularly in terms of its ability to maintain diversity and improve performance?

Key figure

Figure 1 illustrates the top-10 change rate for scenarios with low and high teacher entropy, showing that high entropy leads to persistent instability in student learning.

GitHub1 repo

WLS04/EOPDOfficial

Read on arXiv →