Memory budget strategy for activation checkpointing #297

tyler-romero · 2025-06-16T21:22:53Z

See https://pytorch.org/blog/activation-checkpointing-techniques/ for more details, but essentially this is an easy way to try to enable selective activation checkpointing without fiddling with a bunch of different options to try to make it fast but stay within your GPU memory allowance.

We observe a 50% memory reduction by recomputing only pointwise ops, with a steady drop-off as you recompute more and more of your matmuls. Attention is the most expensive, so you tend to want to recompute those last.

tyler-romero · 2025-07-03T19:36:03Z

Olmo2 on 4 B100s w/ ac budget = 0.5

system/GPU active mem (%)=41.62
2025-07-03T19:33:13.452060977Z     system/GPU active mem (GiB)=74.24
2025-07-03T19:33:13.452062681Z     system/GPU reserved mem (%)=45.29
2025-07-03T19:33:13.452063875Z     system/GPU reserved mem (GiB)=80.79
2025-07-03T19:33:13.452064945Z     throughput/device/BPS=0.0173
2025-07-03T19:33:13.452066249Z     throughput/device/BPS (actual avg)=0.0173
2025-07-03T19:33:13.452067346Z     throughput/device/TPS=18,168
2025-07-03T19:33:13.452068408Z     throughput/device/TPS (actual avg)=18,122

tyler-romero · 2025-07-03T19:41:10Z

Now with ac budget = 0.2

system/GPU active mem (%)=29.57
2025-07-03T19:32:59.859Z     system/GPU active mem (GiB)=52.74
2025-07-03T19:32:59.859Z     system/GPU reserved mem (%)=34.76
2025-07-03T19:32:59.859Z     system/GPU reserved mem (GiB)=61.99
2025-07-03T19:32:59.859Z     throughput/device/BPS=0.0162
2025-07-03T19:32:59.859Z     throughput/device/BPS (actual avg)=0.0162
2025-07-03T19:32:59.859Z     throughput/device/TPS=16,974
2025-07-03T19:32:59.859Z     throughput/device/TPS (actual avg)=16,974

epwalsh

Nice!

See https://pytorch.org/blog/activation-checkpointing-techniques/ for more details, but essentially this is an easy way to try to enable selective activation checkpointing without fiddling with a bunch of different options to try to make it fast but stay within your GPU memory allowance. ![image](https://github.com/user-attachments/assets/5e17af03-aa43-489e-b30e-471ee3025c7e) > We observe a 50% memory reduction by recomputing only pointwise ops, with a steady drop-off as you recompute more and more of your matmuls. Attention is the most expensive, so you tend to want to recompute those last.

tyler-romero added 4 commits June 16, 2025 14:20

Memory budget strategy for activation checkpointing

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Learn about vigilant mode

Loading
Loading status checks…

7768caa

Merge branch 'main' into tyler/budget-ac

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

d50bc7b

Merge branch 'main' into tyler/budget-ac

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

a066fc2

Try with olmo2

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Learn about vigilant mode

Loading
Loading status checks…

d12ab16

tyler-romero added 2 commits July 3, 2025 12:41

revert olmo2 config

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Learn about vigilant mode

Loading
Loading status checks…

c779308

Changelog

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Learn about vigilant mode

Loading
Loading status checks…

4beccab

tyler-romero requested review from epwalsh, dirkgr and 2015aroras July 3, 2025 19:43

tyler-romero marked this pull request as ready for review July 3, 2025 19:43

Merge branch 'main' into tyler/budget-ac

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

e0d3923

epwalsh approved these changes Jul 8, 2025

View reviewed changes

tyler-romero enabled auto-merge (squash) July 8, 2025 20:06

tyler-romero added 2 commits July 8, 2025 13:06

Merge branch 'main' into tyler/budget-ac

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

5c4b4ce

Distribute tests more evenly

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Learn about vigilant mode

Loading
Loading status checks…

dfc75b5

tyler-romero merged commit 992a79e into main Jul 8, 2025
15 checks passed

tyler-romero deleted the tyler/budget-ac branch July 8, 2025 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory budget strategy for activation checkpointing #297

Memory budget strategy for activation checkpointing #297

tyler-romero commented Jun 16, 2025

Uh oh!

tyler-romero commented Jul 3, 2025

Uh oh!

tyler-romero commented Jul 3, 2025

Uh oh!

epwalsh left a comment

Uh oh!

Uh oh!

Memory budget strategy for activation checkpointing #297

Memory budget strategy for activation checkpointing #297

Conversation

tyler-romero commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tyler-romero commented Jul 3, 2025

Uh oh!

tyler-romero commented Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

epwalsh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!