[BUG] Last rank exits with error when wandb accessing tracker file before rank 0 completes writing

**Describe the bug**
When saving checkpoints with wandb enabled, the last rank's wandb artifact requires a tracker file that is written by the first rank (rank 0). If the shared file system is slow, the last rank may attempt to access the tracker file before rank 0 has finished writing it, causing the last rank to exit with an error.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
1. Enable wandb and checkpoint save
2. Use a slow shared file system or Artificially add a sleep() delay in rank 0's iter_finalize_fn function
3. Run training with multiple ranks
4. Observe that the last rank fails when trying to access the tracker file

**Expected behavior**
The training should complete without errors. The last rank should wait for rank 0 to complete writing the tracker file before attempting to access it, or there should be proper synchronization to prevent race conditions.

**Stack trace/logs**
[rank15]:   ValueError: Path is not a file:   '/xxx/latest_checkpointed_iteration.txt'
[rank15]:     raise ValueError(f"Path is not a   file: {local_path!r}")
[rank15]:   File   "/usr/local/lib/python3.12/dist-packages/wandb/sdk/artifacts/artifact.py",   line 1422, in add_file
[rank15]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank15]:     return method(self, *args, **kwargs)
[rank15]:   File   "/usr/local/lib/python3.12/dist-packages/wandb/sdk/artifacts/_validators.py",   line 255, in wrapper
[rank15]:     artifact.add_file(tracker_filename)
[rank15]:   File   "Megatron-LM/megatron/training/wandb_utils.py",   line 36, in on_save_checkpoint_success
[rank15]:       wandb_utils.on_save_checkpoint_success(checkpoint_name,   get_checkpoint_tracker_filename(save_dir), save_dir, iteration)
[rank15]:   File   "Megatron-LM/megatron/training/checkpointing.py",   line 571, in wandb_finalize_fn
[rank15]:     wandb_finalize_fn()
[rank15]:   File   "Megatron-LM/megatron/training/checkpointing.py",   line 576, in save_checkpoint
[rank15]:     save_checkpoint(iteration, model,   optimizer, opt_param_scheduler,

**Environment (please complete the following information):**
 - Megatron-LM 878d65fe006b07ccef715ccc5ab521afb94e78a8
 - PyTorch 2.7.0a0+79aa17489c.nv25.4
 - CUDA 12.9
 - NCCL 2.26.3

**Proposed fix**
Fixed in #1654 , thanks for reviewing it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Last rank exits with error when wandb accessing tracker file before rank 0 completes writing #1653

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Last rank exits with error when wandb accessing tracker file before rank 0 completes writing #1653

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions