Skip to content

Comments

Cleanup codebase to make it readable and usable#73

Open
adityasoni9998 wants to merge 2 commits intomajor-updatefrom
soni/main
Open

Cleanup codebase to make it readable and usable#73
adityasoni9998 wants to merge 2 commits intomajor-updatefrom
soni/main

Conversation

@adityasoni9998
Copy link
Collaborator

@adityasoni9998 adityasoni9998 commented Feb 10, 2026

Fixes #71 (mostly, but not fully)

Summary of changes made:

  • edit code search generator to fix bugs in masking, removes length-based loss masking in trainer, and fixes masking logic across all LLMs
  • cleanup dead code files, dead comments, unused datasets, submodules, unused packages like prime-rl
  • agent-sdk is now installed via pyproject.toml
  • fixes bugs that logs metrics to Wandb regarding total tokens in each rollout

This code is tested by training the 1.7B RL'ed LLM ckpt for a few more steps -- it runs out-of-the-box, and the reward is similar/close to where it plateaued. But empirical end-to-end validation by training some base LLM and checking rewards going up or not is a TODO task.

image

TODOs/Questions:

  • configs and src/prompts/templates contain a lot of unused reward configs and prompt templates. Do we want to remove them?
  • Do we want to retain all the reward function implementations? We end up only using the simplest rewards in our work
  • Train 4B Instruct with this code (with exact same config as 14B model?) and verify that it works (<12 hours of work on 8xH100 machine)

@yucc-leon
Copy link

yucc-leon commented Feb 10, 2026

Hi Aditya, thanks for the cleanup PR — this seems super helpful.

Some of the issues you fixed (especially the masking logic and the length-based loss masking) might explain why my earlier ablations on the major-update branch didn’t behave well (runs here: https://wandb.ai/leon_at_work/ablation_v2_4b). I likely ran into those bugs before they were cleaned up or used different configs or settings in training scripts.

I’m happy to help with the TODO you mentioned — e.g. doing an end-to-end validation by training a 4B Instruct model using this branch and checking whether rewards improve as expected. Let me know if the cleanup branch reflects the intended setup for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants