Skip to content

Add KL-penalized advantage adjustment#562

Open
arcticfly wants to merge 2 commits intomainfrom
kl-advantage
Open

Add KL-penalized advantage adjustment#562
arcticfly wants to merge 2 commits intomainfrom
kl-advantage

Conversation

@arcticfly
Copy link
Collaborator

Summary

  • Adds a new mechanism that adjusts per-token advantages based on KL divergence from a reference model — tokens where the policy has drifted more get reduced advantages, tokens that drifted less get increased advantages. The adjustment is zero-mean (centered) across tokens.
  • New LocalBackend.train() parameters: kl_penalty_coef, kl_penalty_reference_step, and kl_ref_adapter_path
  • Fixes a pre-existing bug in preprocessing/inputs.py where warmup config used incorrect field names (lrlearning_rate, kl_coefkl_penalty_coef)

Test plan

  • All linting/formatting checks pass (uv run prek run --all-files)
  • 5 unit tests for the advantage adjustment formula pass
  • Remote sweep of 9 kl_penalty_coef values (0.0001–1.0001) with kl_penalty_reference_step=0 completed successfully on Kubernetes H200 GPUs, all 20 steps each

🤖 Generated with Claude Code

arcticfly and others added 2 commits February 12, 2026 16:36
Introduces a new mechanism that adjusts per-token advantages based on KL
divergence from a reference model. Tokens where the policy has drifted more
get reduced advantages, while tokens that drifted less get increased
advantages. The adjustment is zero-mean (centered) across tokens.

New parameters on LocalBackend.train():
- kl_penalty_coef: coefficient for the adjustment (0.0 = disabled)
- kl_penalty_reference_step: use a specific checkpoint step as reference
- kl_ref_adapter_path: use an arbitrary LoRA adapter path as reference

Also fixes a pre-existing bug in preprocessing/inputs.py where warmup
config used incorrect field names (lr → learning_rate, kl_coef → kl_penalty_coef).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments