feat: Megatron backend improvements for Docker and GPU memory by saurabhbikram · Pull Request #558 · OpenPipe/ART

saurabhbikram · 2026-02-12T08:32:47Z

Summary

Three improvements to the Megatron training backend:

1. Load identity LoRA on CPU (`service.py`)

_create_identity_lora() previously loaded the full base model with device_map="auto", claiming GPU memory before vLLM starts. Changed to device_map="cpu" since this step only generates adapter config/weights on disk — no GPU computation needed. Cleanup changed from cuda.synchronize() + empty_cache() to gc.collect() accordingly.

2. Docker-compatible torchrun (`service.py`)

When RL_DOCKER=1 is set, uses bare torchrun instead of uv run torchrun. Inside a Docker container, dependencies are installed system-wide so uv is unnecessary.

3. Exit-after-job mode (`service.py` + `train.py`)

New MEGATRON_EXIT_AFTER_JOB=1 env var causes the Megatron training process to fully exit after each training step. This releases all GPU memory including CUDA context and NCCL communication buffers — memory that torch.cuda.empty_cache() cannot reclaim. The service waits for the process to exit before waking vLLM, and restarts Megatron for the next step. The existing CPU-offload wake-lock behavior is preserved as the default when this env var is not set.

Files changed

src/art/megatron/service.py — CPU loading, Docker runner, exit-after-job wake logic
src/art/megatron/train.py — exit-after-job barrier + process group teardown

Three improvements to the Megatron training backend: 1. Load identity LoRA on CPU instead of GPU (`device_map="cpu"`). This runs before vLLM starts and only needs to generate adapter config/weights on disk, avoiding unnecessary GPU memory claims. 2. Support bare `torchrun` via `RL_DOCKER` env var for Docker environments where `uv` is not needed. 3. Add `MEGATRON_EXIT_AFTER_JOB` mode that fully exits the training process after each job, releasing all GPU memory (CUDA context + NCCL buffers) before vLLM reclaims GPUs. The service restarts Megatron for the next training step. https://claude.ai/code/session_017Y9KNNQX2RyVWnqpj3A4hh

saurabhbikram marked this pull request as ready for review February 12, 2026 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Megatron backend improvements for Docker and GPU memory#558

feat: Megatron backend improvements for Docker and GPU memory#558
saurabhbikram wants to merge 1 commit intoOpenPipe:mainfrom
nansen-ai:megatron-backend-improvements

saurabhbikram commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

saurabhbikram commented Feb 12, 2026

Summary

1. Load identity LoRA on CPU (service.py)

2. Docker-compatible torchrun (service.py)

3. Exit-after-job mode (service.py + train.py)

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

1. Load identity LoRA on CPU (`service.py`)

2. Docker-compatible torchrun (`service.py`)

3. Exit-after-job mode (`service.py` + `train.py`)