[QUESTION] Purpose of all2all communication for probs in MoEAlltoAllTokenDispatcher

In the `ADLR/megatron-lm!2668 - perf(MoE): Memory efficient token permutation`, I noticed an all2all communication is applied to the probs tensor in  `MoEAlltoAllTokenDispatcher` . Could someone tell me the motivation for doing this? I didn't observe a drop in memory usage compared to previous versions.

Appreciate any insights or references to discussions/design documents!