Skip to content

[QUESTION] Purpose of all2all communication for probs in MoEAlltoAllTokenDispatcher #1716

@snowpeakz

Description

@snowpeakz

In the ADLR/megatron-lm!2668 - perf(MoE): Memory efficient token permutation, I noticed an all2all communication is applied to the probs tensor in MoEAlltoAllTokenDispatcher . Could someone tell me the motivation for doing this? I didn't observe a drop in memory usage compared to previous versions.

Appreciate any insights or references to discussions/design documents!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions