Skip to content

feat: Megatron backend improvements for Docker and GPU memory#558

Open
saurabhbikram wants to merge 1 commit intoOpenPipe:mainfrom
nansen-ai:megatron-backend-improvements
Open

feat: Megatron backend improvements for Docker and GPU memory#558
saurabhbikram wants to merge 1 commit intoOpenPipe:mainfrom
nansen-ai:megatron-backend-improvements

Conversation

@saurabhbikram
Copy link

Summary

Three improvements to the Megatron training backend:

1. Load identity LoRA on CPU (service.py)

_create_identity_lora() previously loaded the full base model with device_map="auto", claiming GPU memory before vLLM starts. Changed to device_map="cpu" since this step only generates adapter config/weights on disk — no GPU computation needed. Cleanup changed from cuda.synchronize() + empty_cache() to gc.collect() accordingly.

2. Docker-compatible torchrun (service.py)

When RL_DOCKER=1 is set, uses bare torchrun instead of uv run torchrun. Inside a Docker container, dependencies are installed system-wide so uv is unnecessary.

3. Exit-after-job mode (service.py + train.py)

New MEGATRON_EXIT_AFTER_JOB=1 env var causes the Megatron training process to fully exit after each training step. This releases all GPU memory including CUDA context and NCCL communication buffers — memory that torch.cuda.empty_cache() cannot reclaim. The service waits for the process to exit before waking vLLM, and restarts Megatron for the next step. The existing CPU-offload wake-lock behavior is preserved as the default when this env var is not set.

Files changed

  • src/art/megatron/service.py — CPU loading, Docker runner, exit-after-job wake logic
  • src/art/megatron/train.py — exit-after-job barrier + process group teardown

Three improvements to the Megatron training backend:

1. Load identity LoRA on CPU instead of GPU (`device_map="cpu"`).
   This runs before vLLM starts and only needs to generate adapter
   config/weights on disk, avoiding unnecessary GPU memory claims.

2. Support bare `torchrun` via `RL_DOCKER` env var for Docker
   environments where `uv` is not needed.

3. Add `MEGATRON_EXIT_AFTER_JOB` mode that fully exits the training
   process after each job, releasing all GPU memory (CUDA context +
   NCCL buffers) before vLLM reclaims GPUs. The service restarts
   Megatron for the next training step.

https://claude.ai/code/session_017Y9KNNQX2RyVWnqpj3A4hh
@saurabhbikram saurabhbikram marked this pull request as ready for review February 12, 2026 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments