[QUESTION]NCCL timeout error when the second iteration

I use one machine and 4GPUs to run gpt3；
the first iteration is runnning without any errors,
but the second iteration makes errors , strucked by the second iteration and  the second step,
the erros as follows：



[iteration] datetime: 2024-09-13 07:04:42 
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=33, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 607565 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=257, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608700 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1032, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608832 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1796, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608843 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1032, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608832 milliseconds before timing out.


have anyone met the same problem？ thanks a lot


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QUESTION]NCCL timeout error when the second iteration #1141

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION]NCCL timeout error when the second iteration #1141

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions