-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Description
Overview
All features needed for benchmarking MoE models are available in Megatron-LM main branch.
The following scripts and Dockerfiles are examples. For our tested software stack, please refer to official Megatron Core releases and NGC PyTorch images.
Resources
- Scripts: https://github.com/yanring/Megatron-MoE-ModelZoo
- Docker: https://github.com/yanring/Megatron-MoE-ModelZoo/tree/main/dockers
Usage
The full script consists of several scripts, with the core config body provided below.
Note: If on the main branch, you need to set
A2A_OVERLAP=0
Basic Launch Command
A2A_OVERLAP=1 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=2 EP=64 NNODES=128 GBS=8192 PR=fp8 \
bash sbatch_benchmarking.sh \
--recompute-granularity selective \
--recompute-modules mla_up_proj mlp \
--pipeline-model-parallel-layout "Et*3|(tt|)*29|L"
Recommended Configurations
32 Nodes with MTP
OPTIMIZER_OFFLOAD=1 A2A_OVERLAP=0 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=1 EP=32 CP=1 NNODES=32 GBS=8192 PR=fp8 \
bash sbatch_benchmarking.sh \
--recompute-granularity full \
--recompute-method uniform \
--recompute-num-layers 1 \
--pipeline-model-parallel-layout "Et*2|(tt|)*22t|(tt|)*7mL"
32 Nodes with MTP and MoE Forced Balance
OPTIMIZER_OFFLOAD=1 A2A_OVERLAP=0 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=1 EP=32 CP=1 NNODES=32 GBS=8192 PR=fp8 \
bash sbatch_benchmarking.sh \
--recompute-granularity full \
--recompute-method uniform \
--recompute-num-layers 1 \
--pipeline-model-parallel-layout "Et*2|(tt|)*22t|(tt|)*7mL" \
--moe-router-force-load-balancing
16K Sequence + 32 Nodes with MTP
OPTIMIZER_OFFLOAD=1 A2A_OVERLAP=0 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=4 EP=32 CP=1 NNODES=32 GBS=3840 SEQ_LEN=16384 PR=fp8 \
bash sbatch_benchmarking.sh \
--recompute-granularity full \
--recompute-method uniform \
--recompute-num-layers 1 \
--pipeline-model-parallel-layout "Et*2|(tt|)*22t|(tt|)*7mL"
DeepEP Configuration
Enable DeepEP
--moe-token-dispatcher-type flex
--moe-enable-deepep true
Disable DeepEP
--moe-token-dispatcher-type alltoall
--moe-enable-deepep false
Parameter Explanations
PP
: Pipeline ParallelismVPP
: Virtual Pipeline ParallelismTP
: Tensor ParallelismEP
: Expert ParallelismCP
: Context ParallelismNNODES
: Number of NodesGBS
: Global Batch SizePR
: Precision (fp8, bf16, etc.)SEQ_LEN
: Sequence Length
Notes
- Performance is environment-specific and may vary based on hardware configuration
Credit to @yanring
yanring, DNXie, ericharper, pedestrianlove, Skylion007 and 1 more