执行./train_unet.sh 总是报ZeroDivisionError

[rank0]:   File "/home/fanxf7/miniconda3/envs/latentsync/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/fanxf7/miniconda3/envs/latentsync/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/data/fanxf7/LatentSync/scripts/train_unet.py", line 525, in <module>
[rank0]:     main(config)
[rank0]:   File "/data/fanxf7/LatentSync/scripts/train_unet.py", line 235, in main
[rank0]:     num_train_epochs = math.ceil(config.run.max_train_steps / num_update_steps_per_epoch)
[rank0]: ZeroDivisionError: division by zero
[rank0]:[W623 16:45:48.055158738 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
E0623 16:45:50.840000 140504072832832 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2200512) of binary: /home/fanxf7/miniconda3/envs/latentsync/bin/python3.10
Traceback (most recent call last):
  File "/home/fanxf7/miniconda3/envs/latentsync/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/fanxf7/miniconda3/envs/latentsync/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/fanxf7/miniconda3/envs/latentsync/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/fanxf7/miniconda3/envs/latentsync/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/fanxf7/miniconda3/envs/latentsync/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/fanxf7/miniconda3/envs/latentsync/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts.train_unet FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-06-23_16:45:50
  host      : domainagent-ai15
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2200512)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

不明白为什么老报这种错误，哪位大佬能提供帮助？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

执行./train_unet.sh 总是报ZeroDivisionError #289

scripts.train_unet FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-06-23_16:45:50
host : domainagent-ai15
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2200512)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

执行./train_unet.sh 总是报ZeroDivisionError #289

Description

scripts.train_unet FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-06-23_16:45:50 host : domainagent-ai15 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2200512) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-06-23_16:45:50
host : domainagent-ai15
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2200512)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html