Skip to content

SGLang + Megatron: verl-style hybrid engine for ART RL training pipeline#552

Open
pmukeshreddy wants to merge 3 commits intoOpenPipe:mainfrom
pmukeshreddy:sglang-megatron-benchmark-v2
Open

SGLang + Megatron: verl-style hybrid engine for ART RL training pipeline#552
pmukeshreddy wants to merge 3 commits intoOpenPipe:mainfrom
pmukeshreddy:sglang-megatron-benchmark-v2

Conversation

@pmukeshreddy
Copy link

@pmukeshreddy pmukeshreddy commented Feb 11, 2026

Summary

  • Adds an alternative SGLang inference backend for ART's RL training pipeline, replacing vLLM with SGLang using a verl-style separate-process architecture (HTTP-based sleep/wake + LoRA hot-reload)
  • Includes a full benchmarking suite (benchmarks/sglang_vs_vllm/) that compares SGLang + Megatron vs vLLM + Megatron under identical conditions (same model, same prompts, same Megatron training loop)
  • Result on Qwen3-30B-A3B (MoE), 4×A100, TP=2, 10 RL steps: SGLang delivers 3.9× throughput, 2.3× faster ITL, 52% less tail latency, 29% less peak GPU memory, and 3.4× faster startup — zero errors on both sides

Motivation

ART's current vLLM backend runs in-process, sharing a CUDA context with Megatron. After the first sleep/wake cycle, vLLM permanently loses ~53 GB of GPU memory because Megatron's subprocess stays alive during wake (vLLM RFC #15254). This causes a 29% throughput degradation from step 1 → step 2 onward.

The SGLang backend avoids this by running as a separate process with its own CUDA context. Memory release via HTTP /release_memory_occupation is a clean OS-level free, giving Megatron the full GPU during training and SGLang full recovery on wake.

Architecture

Aspect ART's vLLM SGLang Backend
Process model In-process (shared CUDA context) Separate process (independent CUDA context)
Sleep/wake do_sleep(level=2) / do_wake_up() HTTP /release_memory_occupation / /resume_memory_occupation
Memory recovery 53 GB lost permanently after step 1 Full recovery every step
Weight sync In-process add_lora() HTTP /load_lora_adapter (<2s)
KV cache Standard prefix cache RadixAttention (auto-dedup shared prefixes)

What's Added

benchmarks/
├── __init__.py
└── sglang_vs_vllm/
    ├── __init__.py
    ├── config.py                    # Benchmark configuration (model, inference, training params)
    ├── sglang_server.py             # SGLang server lifecycle (start, sleep, wake, LoRA reload)
    ├── sglang_megatron_service.py   # Sleep → train → wake → load_lora lifecycle
    ├── sglang_megatron_backend.py   # Backend class extending ART's LocalBackend
    ├── metrics_collector.py         # Metrics collection and comparison reporting
    ├── run_benchmark.py             # Main benchmark orchestrator (CLI entry point)
    ├── setup_environments.sh        # Environment setup (separate SGLang venv)
    └── README.md                    # Detailed usage docs

No existing ART source files were modified — this is purely additive.

Benchmark Results

Setup: Qwen3-30B-A3B-Instruct-2507 | GSM8K | TP=2 | 4×A100 | 10 RL steps

Metric vLLM SGLang Delta
Avg throughput 582 tok/s 2,271 tok/s 3.9× faster
Avg ITL 31.9 ms 13.9 ms 2.3× faster
Avg p99 latency 29.5s 14.1s −52%
Peak GPU memory 190.4 GB 135.2 GB −29%
Server startup 182s 53s 3.4× faster
Total wall time 1,553s 1,210s −22%

How to Reproduce

# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc

# 2. Clone and setup
git clone -b sglang-megatron-benchmark https://github.com/pmukeshreddy/ART.git
cd ART
uv sync --extra backend

# 3. Setup SGLang environment
bash benchmarks/sglang_vs_vllm/setup_environments.sh

# 4. Run SGLang benchmark (GSM8K)
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \
    --sglang-python ~/.venvs/sglang-bench/bin/python \
    --tp 2 --num-steps 5 --num-rollouts 32 --backends sglang --dataset gsm8k

# 5. Run vLLM benchmark (GSM8K)
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \
    --sglang-python ~/.venvs/sglang-bench/bin/python \
    --tp 2 --num-steps 5 --num-rollouts 32 --backends vllm --dataset gsm8k

# 6. Check results
cat benchmark_results/benchmark_report.txt
cat benchmark_results/sglang_metrics.json
cat benchmark_results/vllm_metrics.json

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments