SGLang + Megatron: verl-style hybrid engine for ART RL training pipeline by pmukeshreddy · Pull Request #552 · OpenPipe/ART

pmukeshreddy · 2026-02-11T05:56:55Z

Summary

Adds an alternative SGLang inference backend for ART's RL training pipeline, replacing vLLM with SGLang using a verl-style separate-process architecture (HTTP-based sleep/wake + LoRA hot-reload)
Includes a full benchmarking suite (benchmarks/sglang_vs_vllm/) that compares SGLang + Megatron vs vLLM + Megatron under identical conditions (same model, same prompts, same Megatron training loop)
Result on Qwen3-30B-A3B (MoE), 4×A100, TP=2, 10 RL steps: SGLang delivers 3.9× throughput, 2.3× faster ITL, 52% less tail latency, 29% less peak GPU memory, and 3.4× faster startup — zero errors on both sides

Motivation

ART's current vLLM backend runs in-process, sharing a CUDA context with Megatron. After the first sleep/wake cycle, vLLM permanently loses ~53 GB of GPU memory because Megatron's subprocess stays alive during wake (vLLM RFC #15254). This causes a 29% throughput degradation from step 1 → step 2 onward.

The SGLang backend avoids this by running as a separate process with its own CUDA context. Memory release via HTTP /release_memory_occupation is a clean OS-level free, giving Megatron the full GPU during training and SGLang full recovery on wake.

Architecture

Aspect	ART's vLLM	SGLang Backend
Process model	In-process (shared CUDA context)	Separate process (independent CUDA context)
Sleep/wake	`do_sleep(level=2)` / `do_wake_up()`	HTTP `/release_memory_occupation` / `/resume_memory_occupation`
Memory recovery	53 GB lost permanently after step 1	Full recovery every step
Weight sync	In-process `add_lora()`	HTTP `/load_lora_adapter` (<2s)
KV cache	Standard prefix cache	RadixAttention (auto-dedup shared prefixes)

What's Added

benchmarks/
├── __init__.py
└── sglang_vs_vllm/
    ├── __init__.py
    ├── config.py                    # Benchmark configuration (model, inference, training params)
    ├── sglang_server.py             # SGLang server lifecycle (start, sleep, wake, LoRA reload)
    ├── sglang_megatron_service.py   # Sleep → train → wake → load_lora lifecycle
    ├── sglang_megatron_backend.py   # Backend class extending ART's LocalBackend
    ├── metrics_collector.py         # Metrics collection and comparison reporting
    ├── run_benchmark.py             # Main benchmark orchestrator (CLI entry point)
    ├── setup_environments.sh        # Environment setup (separate SGLang venv)
    └── README.md                    # Detailed usage docs

No existing ART source files were modified — this is purely additive.

Benchmark Results

Setup: Qwen3-30B-A3B-Instruct-2507 | GSM8K | TP=2 | 4×A100 | 10 RL steps

Metric	vLLM	SGLang	Delta
Avg throughput	582 tok/s	2,271 tok/s	3.9× faster
Avg ITL	31.9 ms	13.9 ms	2.3× faster
Avg p99 latency	29.5s	14.1s	−52%
Peak GPU memory	190.4 GB	135.2 GB	−29%
Server startup	182s	53s	3.4× faster
Total wall time	1,553s	1,210s	−22%

How to Reproduce

# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc

# 2. Clone and setup
git clone -b sglang-megatron-benchmark https://github.com/pmukeshreddy/ART.git
cd ART
uv sync --extra backend

# 3. Setup SGLang environment
bash benchmarks/sglang_vs_vllm/setup_environments.sh

# 4. Run SGLang benchmark (GSM8K)
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \
    --sglang-python ~/.venvs/sglang-bench/bin/python \
    --tp 2 --num-steps 5 --num-rollouts 32 --backends sglang --dataset gsm8k

# 5. Run vLLM benchmark (GSM8K)
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \
    --sglang-python ~/.venvs/sglang-bench/bin/python \
    --tp 2 --num-steps 5 --num-rollouts 32 --backends vllm --dataset gsm8k

# 6. Check results
cat benchmark_results/benchmark_report.txt
cat benchmark_results/sglang_metrics.json
cat benchmark_results/vllm_metrics.json

References

vLLM RFC #15254 — Memory recovery issue in sleep/wake for RLHF
verl (Volcano Engine) — SGLang integration pattern reference
SGLang — Inference engine

pmukeshreddy added 3 commits February 11, 2026 00:39

SGLang + Megatron: verl-style hybrid engine for ART RL training pipeline

24d6387

Delete benchmark_results/vllm_stderr.log

96546df

Delete benchmark_results/sglang_stderr.log

da14f09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGLang + Megatron: verl-style hybrid engine for ART RL training pipeline#552

SGLang + Megatron: verl-style hybrid engine for ART RL training pipeline#552
pmukeshreddy wants to merge 3 commits intoOpenPipe:mainfrom
pmukeshreddy:sglang-megatron-benchmark-v2

pmukeshreddy commented Feb 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

pmukeshreddy commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Architecture

What's Added

Benchmark Results

How to Reproduce

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

pmukeshreddy commented Feb 11, 2026 •

edited

Loading