Speed up model parallel initialization #1662

alexqdh · 2025-07-02T15:42:55Z

The process groups are created on all ranks regardless of whether the current rank was actually part of that group. This is extremely time-consuming, especially during large-scale distributed training. By default, in torch groups should be created in the same order in all processes.

Parameter use_local_synchronization=True is added to all create_group() calls to enable local synchronization optimization, which can further improve performance during group creation. After modification, process groups are now only created when the current rank is actually part of the group (if rank in ranks:).

qinduohao and others added 2 commits July 2, 2025 23:29

Speed up model parallel initialization

7b86527

Merge branch 'main' into speed_up_mp_initialization

8a5d9ed

sbhavani added the module: distributed label Jul 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up model parallel initialization #1662

Speed up model parallel initialization #1662

Uh oh!

alexqdh commented Jul 2, 2025

Uh oh!

Uh oh!

Speed up model parallel initialization #1662

Are you sure you want to change the base?

Speed up model parallel initialization #1662

Uh oh!

Conversation

alexqdh commented Jul 2, 2025

Uh oh!

Uh oh!