Propagating a tensor "v0" from First Layer to Subsequent Layers in Megatron with PP and TP #1522

shiroko98 · 2025-04-09T12:46:44Z

shiroko98
Apr 9, 2025

Hi all,

I’m working on a custom model in the Megatron framework and need help with propagating a tensor, v0, computed in the first layer to all subsequent layers. The model runs with both pipeline parallelism (PP) and tensor parallelism (TP), and I’m unsure how to handle this in a distributed setting.

Goal:
Compute v0 in the first layer (e.g., when layer_number == 1).
Make v0 available to all later layers for use in their computations, while preserving gradients for backpropagation.

Questions:
What’s the recommended approach to pass v0 across all later layers in Megatron’s pipeline parallelism?
How can I efficiently share v0 across PP and TP ranks?
Any tips for ensuring gradients are correctly handled in this setup?

The followings are the corresponding formulas:

I’d appreciate any high-level guidance, examples, or references to relevant parts of the Megatron codebase. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Propagating a tensor "v0" from First Layer to Subsequent Layers in Megatron with PP and TP #1522

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Propagating a tensor "v0" from First Layer to Subsequent Layers in Megatron with PP and TP #1522

Uh oh!

shiroko98 Apr 9, 2025

Replies: 0 comments

shiroko98
Apr 9, 2025