2026-05-27agentsreasoningcode

Skill-Conditioned Gated Self-Distillation for LLM Reasoning

Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai, Senkang Hu, Yuzhi Zhao

Key claim

SGSD improves reasoning by leveraging a skill bank.

The paper presents Skill-Conditioned Gated Self-Distillation (SGSD), which enhances reasoning in large language models by using a skill bank for supervision. SGSD outperforms existing methods like GRPO and OPSD on multiple benchmarks, showing a 6.2% improvement on average. This approach allows for more effective use of teacher-student dynamics in model training.

In plain English

Novelty

8.0/10

The proposed Skill-Conditioned Gated Self-Distillation introduces a new approach to self-distillation that leverages a skill bank, which is a significant extension of existing methods.

Reliability

7.5/10

The experiments demonstrate consistent improvements over strong baselines, indicating solid support for the claims made.

Deep reliability assessment

The methodology supports the claim that verifier-validated, skill-conditioned self-distillation can improve math-reasoning post-training over GRPO under the reported benchmark setup. It overclaims if interpreted as generally reliable skill reuse for open-ended reasoning, since the approach depends heavily on automatic verifiers, retrieved skill quality, and mostly math-domain evaluation.

Reproducibility

Yes for code: the paper states code is available at the GitHub repository. Datasets appear to be public mathematical reasoning benchmarks such as AIME24, AIME25, and HMMT25, but the provided text does not include full dataset construction or split details.

Discussion questions

1.Does the verifier-based polarity rule actually validate the usefulness of a retrieved skill, or does it mostly reinforce correlations between teacher logits and final-answer success?
2.For builders, is maintaining and retrieving from a skill-mistake bank cheaper and more robust than simply using stronger answer-conditioned supervision or more RLVR sampling?
3.What result would falsify SGSD: degraded performance when skills are partially corrupted, no gain over random skill retrieval, or failure to transfer the skill bank across related math benchmarks?

Key figure

The key architecture retrieves skill-mistake pairs from an experience bank, conditions multiple self-teachers on them to score the same plain-prompt student rollout, uses the verifier outcome to assign each teacher signal a helpful or harmful polarity, and applies a gated distillation loss.

GitHub1 repo

walawalagoose/SGSDOfficial