Achieving High Performance with PD Disaggregation & Large-scale Expert Parallelism
Based on SGLang Team, May 05, 2025
SGLang's implementation on 12 nodes (96 H100 GPUs) nearly matches DeepSeek's official inference throughput.
Input: 52.3k tokens/s per node
Output: 22.3k tokens/s per node (for 2k token inputs)
Translates to $0.20 / 1M output tokens, approximately 1/5th the cost of the official DeepSeek Chat API.
Optimized strategy improves output throughput by up to 5x compared to vanilla tensor parallelism on the same resources.
Utilizes DP Attention (Data Parallelism):
Adopts Data Parallelism (DP) over Tensor Parallelism (TP):
Implements Expert Parallelism (EP):
Employs Data Parallelism (DP):
LLM inference has two phases: computation-intensive Prefill and memory-intensive Decode. Unified scheduling is inefficient.
This separation allows tailored optimizations for each phase, maximizing GPU utilization.
DeepEP streamlines EP by efficiently routing tokens to experts across GPUs.
Normal Dispatch: For prefill (long inputs, max throughput). Incompatible with CUDA Graph.
Low-Latency Dispatch: For decode (output tokens, min delay). Supports CUDA Graph.
SGLang's PD Disaggregation enables using both modes effectively with DP Attention.
Optimizes MoE matrix multiplications (Grouped GEMMs).
Contiguous Layout Kernel: For prefill (dynamic shapes). Used with DeepEP's Normal Dispatch (requires permutation).
Masked Layout Kernel: For decode (fixed shapes, CUDA Graph compatible). Used with DeepEP's Low-Latency Dispatch.
Splits a batch into two micro-batches to overlap computation and communication.
Addresses uneven workload distribution in MoE models.
Effectiveness depends on matching input distribution to serving workload (achieved via larger batches or periodic rebalancing).
On 4 nodes (32 H100s, EP32):
Up to 3.3x improvement over TP16 baseline.
Throughput within 5.6% of DeepSeek's official profile (assuming perfect balance).
Example: 50,302 tokens/s per node for 4K prompts.
On 9 nodes (72 H100s, EP72):
5.2x speedup over TP16 baseline.
With simulated MTP, throughput 6.6% below DeepSeek's profile.
Example: 22,282 tokens/s per node for 2K inputs.
Prefill:
Decode:
Delivers significant speedup by mitigating workload imbalance:
Strong correlation between workload balancedness and overall throughput.
Different expert distributions for prefill vs. decode support PD disaggregation for phase-specific expert placement.
SGLang, by integrating advanced techniques like Prefill-Decode Disaggregation and sophisticated Expert Parallelism strategies (DeepEP, DeepGEMM, TBO, EPLB), successfully deploys the large DeepSeek model on H100 GPUs with performance nearly matching official reports and significantly reducing costs. The open-source nature of these components empowers the community to build upon these optimizations for efficient large-scale LLM serving.