Add KL-penalized advantage adjustment by arcticfly · Pull Request #562 · OpenPipe/ART

arcticfly · 2026-02-15T08:24:13Z

Summary

Adds a new mechanism that adjusts per-token advantages based on KL divergence from a reference model — tokens where the policy has drifted more get reduced advantages, tokens that drifted less get increased advantages. The adjustment is zero-mean (centered) across tokens.
New LocalBackend.train() parameters: kl_penalty_coef, kl_penalty_reference_step, and kl_ref_adapter_path
Fixes a pre-existing bug in preprocessing/inputs.py where warmup config used incorrect field names (lr → learning_rate, kl_coef → kl_penalty_coef)

Test plan

All linting/formatting checks pass (uv run prek run --all-files)
5 unit tests for the advantage adjustment formula pass
Remote sweep of 9 kl_penalty_coef values (0.0001–1.0001) with kl_penalty_reference_step=0 completed successfully on Kubernetes H200 GPUs, all 20 steps each

🤖 Generated with Claude Code

Introduces a new mechanism that adjusts per-token advantages based on KL divergence from a reference model. Tokens where the policy has drifted more get reduced advantages, while tokens that drifted less get increased advantages. The adjustment is zero-mean (centered) across tokens. New parameters on LocalBackend.train(): - kl_penalty_coef: coefficient for the adjustment (0.0 = disabled) - kl_penalty_reference_step: use a specific checkpoint step as reference - kl_ref_adapter_path: use an arbitrary LoRA adapter path as reference Also fixes a pre-existing bug in preprocessing/inputs.py where warmup config used incorrect field names (lr → learning_rate, kl_coef → kl_penalty_coef). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

arcticfly and others added 2 commits February 12, 2026 16:36

Fix import sorting and formatting

99ee7a4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KL-penalized advantage adjustment#562

Add KL-penalized advantage adjustment#562
arcticfly wants to merge 2 commits intomainfrom
kl-advantage

arcticfly commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

arcticfly commented Feb 15, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments