Unsloth sglang by pmukeshreddy · Pull Request #564 · OpenPipe/ART

pmukeshreddy · 2026-02-17T03:02:14Z

Integrates Unsloth training with the SGLang inference backend using a dedicated GPU split architecture. SGLang runs persistently on inference GPU(s) while Unsloth/GRPO training runs on separate GPU(s), eliminating the sleep/wake overhead of shared-GPU approaches and keeping SGLang's RadixAttention prefix cache warm across training steps.

Weight synchronization happens via SGLang's LoRA hot-reload API, updating the inference model in-place without restarting the server or losing cached KV states. For single-GPU setups, falls back to server restart with cache clearing.

The backend uses a two-environment architecture to resolve dependency conflicts — SGLang (torchao==0.9) runs in an isolated venv from Unsloth (torchao>=0.13), communicating via HTTP only. DeviceConfig auto-detects available GPUs and computes optimal inference/training splits.

Benchmark (Qwen3-30B-A3B, GSM8K, 4×A100)

Config	tok/s	VRAM
SGLang + Megatron	1,174	133 GB
SGLang + Unsloth	724	186 GB
vLLM + Megatron	582	143 GB

Throughput difference between Megatron and Unsloth configurations comes from GPU allocation during inference. Megatron shards training across all GPUs via tensor parallelism, so during rollout all GPUs serve inference. Unsloth currently supports DDP only (no TP for training), requiring a permanent GPU split — fewer GPUs available for serving rollouts at any given time.

Changes

src/art/sglang_backend/ — SGLangBackend, SGLangConfig, DeviceConfig, SGLangService
src/art/unsloth/training_utils.py — backend-agnostic training utilities extracted from service
benchmarks/sglang_benchmarks/ — end-to-end benchmark suite with DDP training, metrics collection, server lifecycle management
scripts/ — setup, e2e test, and benchmark runner scripts
docs/sglang-integration.md — architecture docs, configuration reference, troubleshooting
Two-environment setup scripts for torchao version isolation
Core fixes: ruler empty group handling, tokenizer compatibility, vLLM import patches

…ore fixes - SGLang backend with dedicated GPU split (inference GPU 0, training GPU 1+) - LoRA hot-reload via SGLang API preserves RadixAttention cache - Two-environment architecture for torchao version isolation - Benchmarks: SGLang vs vLLM comparison suite - Training utils extracted for backend-agnostic use - DeviceConfig with auto-detection - Ruler fix for empty trajectory groups and exception preservation - vLLM compatibility patches

mukesh reddy and others added 9 commits February 16, 2026 21:38

feat: Add SGLang backend integration

a0f0f8c

Add sglang optional dependencies to pyproject.toml

486365c

Add missing modified files for SGLang integration

a872571

Update sglang-integration.md

13377c0

Update sglang-integration.md

19bd069

rename benchmarks to sglang_benchmarks

28fc350

rename benchmarks/sglang_vs_vllm to benchmarks/sglang_benchmarks

e579e5b

Update README.md

b18dc9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsloth sglang#564

Unsloth sglang#564
pmukeshreddy wants to merge 9 commits intoOpenPipe:mainfrom
pmukeshreddy:unsloth-sglang-dedicated-gpu-split

pmukeshreddy commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

pmukeshreddy commented Feb 17, 2026

Benchmark (Qwen3-30B-A3B, GSM8K, 4×A100)

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments