GPU Memory changes by TP/PP and recompute-activations when the GPU-NUM is stable #1486

Listen-WLS · 2025-03-17T03:49:53Z

Listen-WLS
Mar 17, 2025

As mentioned in the title, when the number of Gpus remains the same, I change the size of TP/PP and get the following GPU memory usage. After I read the article "https://arxiv.org/pdf/2205.05198.pdf", I think the GPU memory footprint mainly includes the following several parts:
(1) Parameter occupation
(2) Optimizer occupation
(3) Gradient occupation
(4) Activation value occupation

I used 8 Gpus to train the Llama3-8B model. When batchsize=1, the GPU memory occupancy can be expressed as:

Parameter occupancy: 8B* 2/(TP* PP)=2GB
the optimizer takes up: 8B* 4 /（TP* PP） + 8 B* 4 /（TP* PP）+ 8 B* 4 /（TP* PP）= 12 GB
Gradient occupancy: 8B/PP* 2=2GB
Activation occupation: Batchsize * sequenc_length * hiddensize * 2 = 2.15 GB

I have done many sets of experiments to prove the corresponding situation between this GPU memory formula and the truth GPU memory occupation, but the real GPU memory occupation will gradually increase with the change of PP. I would like to know which of the above four GPU memory occupation is affected by the growth of PP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU Memory changes by TP/PP and recompute-activations when the GPU-NUM is stable #1486

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

GPU Memory changes by TP/PP and recompute-activations when the GPU-NUM is stable #1486

Uh oh!

Uh oh!

Listen-WLS Mar 17, 2025

Replies: 0 comments

Listen-WLS
Mar 17, 2025