-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Description
This outlines the current status of gpt-oss features that need to be implemented in Megatron Core, leveraging Transformer Engine (TE).
Core functionality has been implemented and validated through convergence testing in the chcui/gpt_oss branch of Megatron-LM. Future efforts will focus on performance optimization and integration into Megatron Core.
Note: MoE (Mixture of Experts) features are already fully supported - see Megatron Core MoE Roadmap for comprehensive MoE feature support.
MoE Layer
Enabled Bias
- Status: Work in Progress
- Megatron-LM branch: https://github.com/NVIDIA/Megatron-LM/tree/chcui/gpt_oss
Attention Mechanisms
Alternating Sliding-Window Attention Pattern
- Status: ✅ Supported - Infrastructure exists for per-layer patterns and sliding window attention using TE
Attention Sinks
- Status: Work in Progress - in Transformer Engine and cuDNN
- Reference: Streaming LLM
- Related Transformer Engine PR: TBD
Activation Functions
Custom SwiGLU with Clamping
- Status: Work in Progress
- Megatron Core adding a partially fused version as “custom quick GeGLU”
- Megatron-LM branch: https://github.com/NVIDIA/Megatron-LM/tree/chcui/gpt_oss
Positional Encodings
YaRN RoPE Scaling
- Megatron Core Implementation
- YaRN scaling to 128k context
- Integration with existing RoPE
- YaRN for general RoPE/GPT models
- Convergence validation
- Performance optimization for extended sequences
- Megatron-LM Branch: https://github.com/NVIDIA/Megatron-LM/tree/chcui/gpt_oss
- Reference: arXiv:2309.00071
- Status: Work in Progress - YaRN implemented for MLA only in Megatron Core. General RoPE/GPT support available in the POC branch.
Credits: @cuichenx
pku-wuwei, sjcheon86, fuchen319 and renhouxing
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request