[QUESTION] Is there a way for Megatron to recompute the whole transformer layer except for the flash-attn part?

Hi, as stated in the title, I'm wondering whether megatron provides native functionality like `context_fn` in `torch.utils.checkpoint.checkpoint`, such that flash-attn computation can be excluded from the recomputation of a transformer layer.

For now, I manully modified `tensor_parallel.checkpoint` to accept such an argument. However, issues remain when I want to capture the saved activations in flash-attn and offload them to cpu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QUESTION] Is there a way for Megatron to recompute the whole transformer layer except for the flash-attn part? #1732

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION] Is there a way for Megatron to recompute the whole transformer layer except for the flash-attn part? #1732

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions