Benchmarking DeepEP Guide

# Overview

All features needed for benchmarking MoE models are available in Megatron-LM main branch. 

The following scripts and Dockerfiles are examples. For our tested software stack, please refer to official [Megatron Core releases](https://github.com/NVIDIA/Megatron-LM/releases) and [NGC PyTorch images](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch).

## Resources

- **Scripts**: https://github.com/yanring/Megatron-MoE-ModelZoo
- **Docker**: https://github.com/yanring/Megatron-MoE-ModelZoo/tree/main/dockers

## Usage

The full script consists of several scripts, with the core config body provided below. 

> **Note**: If on the main branch, you need to set `A2A_OVERLAP=0`

### Basic Launch Command

```bash
A2A_OVERLAP=1 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=2 EP=64 NNODES=128 GBS=8192 PR=fp8 \
bash sbatch_benchmarking.sh \
  --recompute-granularity selective \
  --recompute-modules mla_up_proj mlp \
  --pipeline-model-parallel-layout "Et*3|(tt|)*29|L"
```

## Recommended Configurations

### 32 Nodes with MTP

```bash
OPTIMIZER_OFFLOAD=1 A2A_OVERLAP=0 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=1 EP=32 CP=1 NNODES=32 GBS=8192 PR=fp8 \
bash sbatch_benchmarking.sh \
  --recompute-granularity full \
  --recompute-method uniform \
  --recompute-num-layers 1 \
  --pipeline-model-parallel-layout "Et*2|(tt|)*22t|(tt|)*7mL"
```

### 32 Nodes with MTP and MoE Forced Balance

```bash
OPTIMIZER_OFFLOAD=1 A2A_OVERLAP=0 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=1 EP=32 CP=1 NNODES=32 GBS=8192 PR=fp8 \
bash sbatch_benchmarking.sh \
  --recompute-granularity full \
  --recompute-method uniform \
  --recompute-num-layers 1 \
  --pipeline-model-parallel-layout "Et*2|(tt|)*22t|(tt|)*7mL" \
  --moe-router-force-load-balancing
```

### 16K Sequence + 32 Nodes with MTP

```bash
OPTIMIZER_OFFLOAD=1 A2A_OVERLAP=0 MODEL=DeepSeek-V3 PP=8 VPP=4 TP=4 EP=32 CP=1 NNODES=32 GBS=3840 SEQ_LEN=16384 PR=fp8 \
bash sbatch_benchmarking.sh \
  --recompute-granularity full \
  --recompute-method uniform \
  --recompute-num-layers 1 \
  --pipeline-model-parallel-layout "Et*2|(tt|)*22t|(tt|)*7mL"
```

## DeepEP Configuration

### Enable DeepEP

```bash
--moe-token-dispatcher-type flex
--moe-enable-deepep true
```

### Disable DeepEP

```bash
--moe-token-dispatcher-type alltoall
--moe-enable-deepep false
```

## Parameter Explanations

- `PP`: Pipeline Parallelism
- `VPP`: Virtual Pipeline Parallelism  
- `TP`: Tensor Parallelism
- `EP`: Expert Parallelism
- `CP`: Context Parallelism
- `NNODES`: Number of Nodes
- `GBS`: Global Batch Size
- `PR`: Precision (fp8, bf16, etc.)
- `SEQ_LEN`: Sequence Length

## Notes

- Performance is environment-specific and may vary based on hardware configuration

Credit to @yanring 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarking DeepEP Guide #1721

Overview

Resources

Usage

Basic Launch Command

Recommended Configurations

32 Nodes with MTP

32 Nodes with MTP and MoE Forced Balance

16K Sequence + 32 Nodes with MTP

DeepEP Configuration

Enable DeepEP

Disable DeepEP

Parameter Explanations

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmarking DeepEP Guide #1721

Description

Overview

Resources

Usage

Basic Launch Command

Recommended Configurations

32 Nodes with MTP

32 Nodes with MTP and MoE Forced Balance

16K Sequence + 32 Nodes with MTP

DeepEP Configuration

Enable DeepEP

Disable DeepEP

Parameter Explanations

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions