Skip to content

Unsloth sglang#564

Open
pmukeshreddy wants to merge 9 commits intoOpenPipe:mainfrom
pmukeshreddy:unsloth-sglang-dedicated-gpu-split
Open

Unsloth sglang#564
pmukeshreddy wants to merge 9 commits intoOpenPipe:mainfrom
pmukeshreddy:unsloth-sglang-dedicated-gpu-split

Conversation

@pmukeshreddy
Copy link

Integrates Unsloth training with the SGLang inference backend using a dedicated GPU split architecture. SGLang runs persistently on inference GPU(s) while Unsloth/GRPO training runs on separate GPU(s), eliminating the sleep/wake overhead of shared-GPU approaches and keeping SGLang's RadixAttention prefix cache warm across training steps.

Weight synchronization happens via SGLang's LoRA hot-reload API, updating the inference model in-place without restarting the server or losing cached KV states. For single-GPU setups, falls back to server restart with cache clearing.

The backend uses a two-environment architecture to resolve dependency conflicts — SGLang (torchao==0.9) runs in an isolated venv from Unsloth (torchao>=0.13), communicating via HTTP only. DeviceConfig auto-detects available GPUs and computes optimal inference/training splits.

Benchmark (Qwen3-30B-A3B, GSM8K, 4×A100)

Config tok/s VRAM
SGLang + Megatron 1,174 133 GB
SGLang + Unsloth 724 186 GB
vLLM + Megatron 582 143 GB

Throughput difference between Megatron and Unsloth configurations comes from GPU allocation during inference. Megatron shards training across all GPUs via tensor parallelism, so during rollout all GPUs serve inference. Unsloth currently supports DDP only (no TP for training), requiring a permanent GPU split — fewer GPUs available for serving rollouts at any given time.

Changes

  • src/art/sglang_backend/ — SGLangBackend, SGLangConfig, DeviceConfig, SGLangService
  • src/art/unsloth/training_utils.py — backend-agnostic training utilities extracted from service
  • benchmarks/sglang_benchmarks/ — end-to-end benchmark suite with DDP training, metrics collection, server lifecycle management
  • scripts/ — setup, e2e test, and benchmark runner scripts
  • docs/sglang-integration.md — architecture docs, configuration reference, troubleshooting
  • Two-environment setup scripts for torchao version isolation
  • Core fixes: ruler empty group handling, tokenizer compatibility, vLLM import patches

mukesh reddy and others added 9 commits February 16, 2026 21:38
…ore fixes

- SGLang backend with dedicated GPU split (inference GPU 0, training GPU 1+)
- LoRA hot-reload via SGLang API preserves RadixAttention cache
- Two-environment architecture for torchao version isolation
- Benchmarks: SGLang vs vLLM comparison suite
- Training utils extracted for backend-agnostic use
- DeviceConfig with auto-detection
- Ruler fix for empty trajectory groups and exception preservation
- vLLM compatibility patches
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments