Skip to content

Actions: NVIDIA/Megatron-LM

Community Bot

Actions

Loading...
Loading

Show workflow options

Create status badge

Loading
This workflow was disabled manually.
103 workflow runs
103 workflow runs

Filter by Event

Filter by Status

Filter by Branch

Filter by Actor

[QUESTION]NCCL timeout error when the second iteration
Community Bot #103: Issue #1141 reopened by sbhavani
July 31, 2025 06:02 11s
July 31, 2025 06:02 11s
[BUG] Learning rate not overrided when set --override-opt_param-scheduler
Community Bot #101: Issue #1138 reopened by sbhavani
July 31, 2025 04:35 12s
July 31, 2025 04:35 12s
[QUESTION] Epochs Larger Than 1 When Specified with Trained Samples
Community Bot #100: Issue #1127 reopened by sbhavani
July 31, 2025 04:35 13s
July 31, 2025 04:35 13s
Benchmarking DeepEP Guide
Community Bot #99: Issue #1721 opened by sbhavani
July 31, 2025 03:59 11s
July 31, 2025 03:59 11s
How to split the num_layers unevenly when using pipeline parallelism?
Community Bot #97: Issue comment #381 (comment) created by edgeinfinity1
July 30, 2025 14:54 16s
July 30, 2025 14:54 16s
[QUESTION] save_checkpoint with expert_tensor_parallel_size
Community Bot #96: Issue #1719 edited by jeromeku
July 30, 2025 14:15 10s
July 30, 2025 14:15 10s
[QUESTION] save_checkpoint with expert_tensor_parallel_size
Community Bot #95: Issue #1719 opened by jeromeku
July 30, 2025 14:14 12s
July 30, 2025 14:14 12s
v0.12.2 pretrain failed, 0.14.0rc2 model conversion failed
Community Bot #94: Issue #1718 edited by justalittlenoob
July 30, 2025 09:32 12s
July 30, 2025 09:32 12s
v0.12.2 pretrain failed, 0.14.0rc2 model conversion failed
Community Bot #93: Issue #1718 opened by justalittlenoob
July 30, 2025 08:10 10s
July 30, 2025 08:10 10s
[QUESTION]How to calculate MFU based on the flops?
Community Bot #92: Issue #1565 closed by Lynnzake
July 30, 2025 06:44 13s
July 30, 2025 06:44 13s
July 30, 2025 06:22 11s
July 30, 2025 02:26 9s
July 30, 2025 02:26 15s
Issue 1672 fix: initializing the current pointed with int64 to avoid …
Community Bot #88: Issue comment #1673 (comment) created by sharanmayank
July 29, 2025 23:33 11s
July 29, 2025 23:33 11s
[BUG]异步保存参数async-save和断点续训之间存在bug
Community Bot #85: Issue comment #1713 (comment) created by ananthsub
July 29, 2025 19:01 14s
July 29, 2025 19:01 14s
[QUESTION]Missing code branch
Community Bot #84: Issue #1395 reopened by sbhavani
July 29, 2025 15:59 14s
July 29, 2025 15:59 14s
[BUG] can't load saved fp8 checkpoint when resume training
Community Bot #83: Issue #1350 reopened by sbhavani
July 29, 2025 15:59 11s
July 29, 2025 15:59 11s
[QUESTION] Resume training about dataset
Community Bot #82: Issue #1343 reopened by sbhavani
July 29, 2025 15:58 10s
July 29, 2025 15:58 10s