feat: Megatron backend improvements for Docker and GPU memory#558
Open
saurabhbikram wants to merge 1 commit intoOpenPipe:mainfrom
Open
feat: Megatron backend improvements for Docker and GPU memory#558saurabhbikram wants to merge 1 commit intoOpenPipe:mainfrom
saurabhbikram wants to merge 1 commit intoOpenPipe:mainfrom
Conversation
Three improvements to the Megatron training backend: 1. Load identity LoRA on CPU instead of GPU (`device_map="cpu"`). This runs before vLLM starts and only needs to generate adapter config/weights on disk, avoiding unnecessary GPU memory claims. 2. Support bare `torchrun` via `RL_DOCKER` env var for Docker environments where `uv` is not needed. 3. Add `MEGATRON_EXIT_AFTER_JOB` mode that fully exits the training process after each job, releasing all GPU memory (CUDA context + NCCL buffers) before vLLM reclaims GPUs. The service restarts Megatron for the next training step. https://claude.ai/code/session_017Y9KNNQX2RyVWnqpj3A4hh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three improvements to the Megatron training backend:
1. Load identity LoRA on CPU (
service.py)_create_identity_lora()previously loaded the full base model withdevice_map="auto", claiming GPU memory before vLLM starts. Changed todevice_map="cpu"since this step only generates adapter config/weights on disk — no GPU computation needed. Cleanup changed fromcuda.synchronize() + empty_cache()togc.collect()accordingly.2. Docker-compatible torchrun (
service.py)When
RL_DOCKER=1is set, uses baretorchruninstead ofuv run torchrun. Inside a Docker container, dependencies are installed system-wide souvis unnecessary.3. Exit-after-job mode (
service.py+train.py)New
MEGATRON_EXIT_AFTER_JOB=1env var causes the Megatron training process to fully exit after each training step. This releases all GPU memory including CUDA context and NCCL communication buffers — memory thattorch.cuda.empty_cache()cannot reclaim. The service waits for the process to exit before waking vLLM, and restarts Megatron for the next step. The existing CPU-offload wake-lock behavior is preserved as the default when this env var is not set.Files changed
src/art/megatron/service.py— CPU loading, Docker runner, exit-after-job wake logicsrc/art/megatron/train.py— exit-after-job barrier + process group teardown