-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Describe the bug
When saving checkpoints with wandb enabled, the last rank's wandb artifact requires a tracker file that is written by the first rank (rank 0). If the shared file system is slow, the last rank may attempt to access the tracker file before rank 0 has finished writing it, causing the last rank to exit with an error.
To Reproduce
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
- Enable wandb and checkpoint save
- Use a slow shared file system or Artificially add a sleep() delay in rank 0's iter_finalize_fn function
- Run training with multiple ranks
- Observe that the last rank fails when trying to access the tracker file
Expected behavior
The training should complete without errors. The last rank should wait for rank 0 to complete writing the tracker file before attempting to access it, or there should be proper synchronization to prevent race conditions.
Stack trace/logs
[rank15]: ValueError: Path is not a file: '/xxx/latest_checkpointed_iteration.txt'
[rank15]: raise ValueError(f"Path is not a file: {local_path!r}")
[rank15]: File "/usr/local/lib/python3.12/dist-packages/wandb/sdk/artifacts/artifact.py", line 1422, in add_file
[rank15]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank15]: return method(self, *args, **kwargs)
[rank15]: File "/usr/local/lib/python3.12/dist-packages/wandb/sdk/artifacts/_validators.py", line 255, in wrapper
[rank15]: artifact.add_file(tracker_filename)
[rank15]: File "Megatron-LM/megatron/training/wandb_utils.py", line 36, in on_save_checkpoint_success
[rank15]: wandb_utils.on_save_checkpoint_success(checkpoint_name, get_checkpoint_tracker_filename(save_dir), save_dir, iteration)
[rank15]: File "Megatron-LM/megatron/training/checkpointing.py", line 571, in wandb_finalize_fn
[rank15]: wandb_finalize_fn()
[rank15]: File "Megatron-LM/megatron/training/checkpointing.py", line 576, in save_checkpoint
[rank15]: save_checkpoint(iteration, model, optimizer, opt_param_scheduler,
Environment (please complete the following information):
- Megatron-LM 878d65f
- PyTorch 2.7.0a0+79aa17489c.nv25.4
- CUDA 12.9
- NCCL 2.26.3
Proposed fix
Fixed in #1654 , thanks for reviewing it.