Add Support for Packed Sequence Format in GPT Training #1696

sbhavani · 2025-07-17T20:02:04Z

Overview

Adds support for packed sequence format ('thd') in GPT training when using Transformer Engine's DotProductAttention.

Added --gpt-use-thd-qkv-format flag to enable packed sequence format
Added utility function get_cu_seqlens() to handle cumulative sequence lengths for packed format
Modified forward_step to support packed sequence parameters when enabled
Optimized attention mask generation in the dataloader

John Kamalu and others added 3 commits July 23, 2024 10:16

draft first commit

6f390da

Fix dimension cardinality

4278292

Small fixes to attention mask code

1106acd

sbhavani added the module: training label Jul 22, 2025