Skip to content

Benchmarking DeepEP Guide #1721

@sbhavani

Description

@sbhavani

Overview

All features needed for benchmarking MoE models are available in Megatron-LM main branch.

The following scripts and Dockerfiles are examples. For our tested software stack, please refer to official Megatron Core releases and NGC PyTorch images.

Resources

Usage

The full script consists of several scripts, with the core config body provided below.

Note: If on the main branch, you need to set A2A_OVERLAP=0

Basic Launch Command

A2A_OVERLAP=1 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=2 EP=64 NNODES=128 GBS=8192 PR=fp8 \
bash sbatch_benchmarking.sh \
  --recompute-granularity selective \
  --recompute-modules mla_up_proj mlp \
  --pipeline-model-parallel-layout "Et*3|(tt|)*29|L"

Recommended Configurations

32 Nodes with MTP

OPTIMIZER_OFFLOAD=1 A2A_OVERLAP=0 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=1 EP=32 CP=1 NNODES=32 GBS=8192 PR=fp8 \
bash sbatch_benchmarking.sh \
  --recompute-granularity full \
  --recompute-method uniform \
  --recompute-num-layers 1 \
  --pipeline-model-parallel-layout "Et*2|(tt|)*22t|(tt|)*7mL"

32 Nodes with MTP and MoE Forced Balance

OPTIMIZER_OFFLOAD=1 A2A_OVERLAP=0 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=1 EP=32 CP=1 NNODES=32 GBS=8192 PR=fp8 \
bash sbatch_benchmarking.sh \
  --recompute-granularity full \
  --recompute-method uniform \
  --recompute-num-layers 1 \
  --pipeline-model-parallel-layout "Et*2|(tt|)*22t|(tt|)*7mL" \
  --moe-router-force-load-balancing

16K Sequence + 32 Nodes with MTP

OPTIMIZER_OFFLOAD=1 A2A_OVERLAP=0 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=4 EP=32 CP=1 NNODES=32 GBS=3840 SEQ_LEN=16384 PR=fp8 \
bash sbatch_benchmarking.sh \
  --recompute-granularity full \
  --recompute-method uniform \
  --recompute-num-layers 1 \
  --pipeline-model-parallel-layout "Et*2|(tt|)*22t|(tt|)*7mL"

DeepEP Configuration

Enable DeepEP

--moe-token-dispatcher-type flex
--moe-enable-deepep true

Disable DeepEP

--moe-token-dispatcher-type alltoall
--moe-enable-deepep false

Parameter Explanations

  • PP: Pipeline Parallelism
  • VPP: Virtual Pipeline Parallelism
  • TP: Tensor Parallelism
  • EP: Expert Parallelism
  • CP: Context Parallelism
  • NNODES: Number of Nodes
  • GBS: Global Batch Size
  • PR: Precision (fp8, bf16, etc.)
  • SEQ_LEN: Sequence Length

Notes

  • Performance is environment-specific and may vary based on hardware configuration

Credit to @yanring

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions