Deploying DeepSeek with SGLang

Achieving High Performance with PD Disaggregation & Large-scale Expert Parallelism

Based on SGLang Team, May 05, 2025

Key Achievements with SGLang

🚀 Near Official Performance

SGLang's implementation on 12 nodes (96 H100 GPUs) nearly matches DeepSeek's official inference throughput.

Input: 52.3k tokens/s per node

Output: 22.3k tokens/s per node (for 2k token inputs)

💰 Cost Efficiency

Translates to $0.20 / 1M output tokens, approximately 1/5th the cost of the official DeepSeek Chat API.

⚡ Throughput Boost

Optimized strategy improves output throughput by up to 5x compared to vanilla tensor parallelism on the same resources.

Core SGLang Enhancements

Support for Prefill-Decode (PD) Disaggregation.
Large-scale Expert Parallelism (EP), including DeepEP, DeepGEMM, and EPLB.
Open-source implementation for community access and development.

Parallelism Design Strategies

Attention Layers (MLA)

Utilizes DP Attention (Data Parallelism):

Eliminates KV cache duplication across devices.
Significantly reduces memory overhead.
Supports hybrid data and tensor parallelism for flexibility.

Dense FFNs

Adopts Data Parallelism (DP) over Tensor Parallelism (TP):

Enhanced Scalability: Avoids fragmentation and ensures balanced workloads.
Optimized Memory Efficiency: Lower TP degree often minimizes memory, making DP favorable.
Minimized Communication: Reduces all-reduce operations by 50% compared to pure TP.

Sparse FFNs (Mixture of Experts)

Implements Expert Parallelism (EP):

Distributes expert weights across multiple devices.
Scales memory capacity effectively.
Addresses challenges like irregular communication and workload imbalance using DeepEP.

LM Head

Employs Data Parallelism (DP):

Mirrors the strategy for dense FFNs.
Reduces memory overhead for large vocabulary computations.
Simplifies communication across devices.

Prefill & Decode (PD) Disaggregation

LLM inference has two phases: computation-intensive Prefill and memory-intensive Decode. Unified scheduling is inefficient.

Problems with Unified Scheduling:

Prefill batches interrupt decode batches (delay).
DP Attention imbalance (increased decode latency).
Incompatible with DeepEP's dual dispatch modes.

SGLang's PD Disaggregation Solution:

Input Request

➔

Prefill Server
(Computes KV Cache)

➔

Data Transfer (RDMA)

➔

Decode Server
(Iterative Token Gen)

This separation allows tailored optimizations for each phase, maximizing GPU utilization.

Key Implementation Details:

Non-blocking Transfer: Background data send/receive.
RDMA-Based Transfer: Efficient for non-contiguous memory.
Flexible API Integration: Supports Mooncake, NIXL.

Large-scale Expert Parallelism Optimizations

Expert Parallelism with DeepEP

DeepEP streamlines EP by efficiently routing tokens to experts across GPUs.

Normal Dispatch: For prefill (long inputs, max throughput). Incompatible with CUDA Graph.

Low-Latency Dispatch: For decode (output tokens, min delay). Supports CUDA Graph.

SGLang's PD Disaggregation enables using both modes effectively with DP Attention.

DeepGEMM Integration

Optimizes MoE matrix multiplications (Grouped GEMMs).

Contiguous Layout Kernel: For prefill (dynamic shapes). Used with DeepEP's Normal Dispatch (requires permutation).

Masked Layout Kernel: For decode (fixed shapes, CUDA Graph compatible). Used with DeepEP's Low-Latency Dispatch.

Two-batch Overlap (TBO)

Splits a batch into two micro-batches to overlap computation and communication.

Lowers peak memory usage.
Addresses limited communication bandwidth in multi-node setups.
SGLang uses an abstraction layer (operations & yield points) for clean implementation.
Optimized launch order in prefill to avoid CPU-blocking by DeepEP.

Expert Parallelism Load Balancer (EPLB)

Addresses uneven workload distribution in MoE models.

Computes optimal expert arrangement to minimize imbalance.
Uses redundant experts (e.g., 288 instead of 256) for flexible placement.
Enables diverse parallelism sizes (e.g., 12 or 72).
SGLang implements efficient, non-disruptive rebalancing.

Effectiveness depends on matching input distribution to serving workload (achieved via larger batches or periodic rebalancing).

Evaluation Highlights

Prefill Phase Performance

On 4 nodes (32 H100s, EP32):

Up to 3.3x improvement over TP16 baseline.

Throughput within 5.6% of DeepSeek's official profile (assuming perfect balance).

Example: 50,302 tokens/s per node for 4K prompts.

Decode Phase Performance

On 9 nodes (72 H100s, EP72):

5.2x speedup over TP16 baseline.

With simulated MTP, throughput 6.6% below DeepSeek's profile.

Example: 22,282 tokens/s per node for 2K inputs.

Ablation Study: Two-batch Overlap (TBO)

Prefill:

Supports larger batch sizes (e.g., 16k tokens/device vs 8k OOM without TBO).
27-35% throughput increase by overlapping computation & communication.

Decode:

Speedup contingent on batch size (e.g., 25.5% at 256 tokens/device).
Most substantial speedup (35%) in simulated MTP with prolonged attention.

Ablation Study: EPLB

Delivers significant speedup by mitigating workload imbalance:

Prefill: 1.49x speedup.
Decode: 2.54x speedup.

Strong correlation between workload balancedness and overall throughput.

Different expert distributions for prefill vs. decode support PD disaggregation for phase-specific expert placement.

Conclusion

SGLang, by integrating advanced techniques like Prefill-Decode Disaggregation and sophisticated Expert Parallelism strategies (DeepEP, DeepGEMM, TBO, EPLB), successfully deploys the large DeepSeek model on H100 GPUs with performance nearly matching official reports and significantly reducing costs. The open-source nature of these components empowers the community to build upon these optimizations for efficient large-scale LLM serving.