You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2025-03-08 03:39:44] iteration 93/ 101 | consumed samples: 93 | elapsed time per iteration (ms): 283.3 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 2.598389E-07 | loss scale: 1.0 | grad norm: 0.000 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:39:45] iteration 94/ 101 | consumed samples: 94 | elapsed time per iteration (ms): 285.1 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 2.579762E-07 | loss scale: 1.0 | grad norm: 0.000 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:39:45] iteration 95/ 101 | consumed samples: 95 | elapsed time per iteration (ms): 287.5 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 2.523883E-07 | loss scale: 1.0 | grad norm: 0.000 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:39:45] iteration 96/ 101 | consumed samples: 96 | elapsed time per iteration (ms): 282.4 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 2.402811E-07 | loss scale: 1.0 | grad norm: 0.000 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:39:46] iteration 97/ 101 | consumed samples: 97 | elapsed time per iteration (ms): 286.0 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 2.356245E-07 | loss scale: 1.0 | grad norm: 0.000 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:39:46] iteration 98/ 101 | consumed samples: 98 | elapsed time per iteration (ms): 287.5 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 2.263113E-07 | loss scale: 1.0 | grad norm: 0.000 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:39:46] iteration 99/ 101 | consumed samples: 99 | elapsed time per iteration (ms): 290.5 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 2.263113E-07 | loss scale: 1.0 | grad norm: 0.000 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:39:46] iteration 100/ 101 | consumed samples: 100 | elapsed time per iteration (ms): 286.2 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 2.197920E-07 | loss scale: 1.0 | grad norm: 0.000 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:39:47] iteration 101/ 101 | consumed samples: 101 | elapsed time per iteration (ms): 285.7 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 2.151354E-07 | loss scale: 1.0 | grad norm: 0.000 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
recompute
after adding:
--recompute-granularity full \
--recompute-method uniform \
--recompute-num-layers 24 \
training log
[2025-03-08 03:37:35] iteration 93/ 101 | consumed samples: 93 | elapsed time per iteration (ms): 372.5 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 1.931662E+00 | loss scale: 1.0 | grad norm: 106.125 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:37:36] iteration 94/ 101 | consumed samples: 94 | elapsed time per iteration (ms): 378.3 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 1.905890E+00 | loss scale: 1.0 | grad norm: 105.842 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:37:36] iteration 95/ 101 | consumed samples: 95 | elapsed time per iteration (ms): 373.4 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 1.883155E+00 | loss scale: 1.0 | grad norm: 105.635 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:37:36] iteration 96/ 101 | consumed samples: 96 | elapsed time per iteration (ms): 379.1 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 1.860850E+00 | loss scale: 1.0 | grad norm: 105.362 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:37:37] iteration 97/ 101 | consumed samples: 97 | elapsed time per iteration (ms): 383.2 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 1.842552E+00 | loss scale: 1.0 | grad norm: 105.141 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:37:37] iteration 98/ 101 | consumed samples: 98 | elapsed time per iteration (ms): 376.8 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 1.819790E+00 | loss scale: 1.0 | grad norm: 104.850 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:37:37] iteration 99/ 101 | consumed samples: 99 | elapsed time per iteration (ms): 377.7 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 1.797119E+00 | loss scale: 1.0 | grad norm: 104.588 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:37:38] iteration 100/ 101 | consumed samples: 100 | elapsed time per iteration (ms): 375.4 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 1.776598E+00 | loss scale: 1.0 | grad norm: 104.269 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-03-08 03:37:38] iteration 101/ 101 | consumed samples: 101 | elapsed time per iteration (ms): 367.8 | learning rate: 1.000000E-06 | global batch size: 1 | lm loss: 1.756780E+00 | loss scale: 1.0 | grad norm: 104.023 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I don't understand why vram hasn't changed anything. But from the speed of training it looks as if full recompute has been enabled
baseline
vram usage
training log
recompute
after adding:
training log
vram usage
Beta Was this translation helpful? Give feedback.
All reactions