gpt-oss implementation

## **Description**

This outlines the current status of gpt-oss features that need to be implemented in Megatron Core, leveraging Transformer Engine (TE).

**Core functionality has been implemented and validated through convergence testing in the [chcui/gpt_oss branch](https://github.com/NVIDIA/Megatron-LM/tree/chcui/gpt_oss) of Megatron-LM. Future efforts will focus on performance optimization and integration into Megatron Core.**

**Note:** MoE (Mixture of Experts) features are already fully supported \- see [Megatron Core MoE Roadmap](https://github.com/NVIDIA/Megatron-LM/issues/1729) for comprehensive MoE feature support.

### **MoE Layer**

#### **Enabled Bias**

* **Status:** **Work in Progress**  
* **Megatron-LM branch:** https://github.com/NVIDIA/Megatron-LM/tree/chcui/gpt_oss

### **Attention Mechanisms**

#### **Alternating Sliding-Window Attention Pattern**

* **Status:** ✅ **Supported** \- Infrastructure exists for per-layer patterns and sliding window attention using TE

#### **Attention Sinks**

* **Status:** **Work in Progress** \- in Transformer Engine and cuDNN  
* **Reference:** [Streaming LLM](https://arxiv.org/abs/2309.17453)  
* **Related Transformer Engine PR:** TBD

### **Activation Functions**

#### **Custom SwiGLU with Clamping**

* **Status:** **Work in Progress**  
* Megatron Core adding a partially fused version as “custom quick GeGLU”  
* **Megatron-LM branch:** [https://github.com/NVIDIA/Megatron-LM/tree/chcui/gpt_oss](https://github.com/NVIDIA/Megatron-LM/tree/chcui/gpt_oss)


### **Positional Encodings**

#### **YaRN RoPE Scaling**

* \[ \] **Megatron Core Implementation**  
  * \[x\] YaRN scaling to 128k context  
  * \[x\] Integration with existing RoPE  
  * \[x\] YaRN for general RoPE/GPT models  
  * \[x\] Convergence validation  
  * \[ \] Performance optimization for extended sequences  
* **Megatron-LM Branch:** https://github.com/NVIDIA/Megatron-LM/tree/chcui/gpt_oss  
* **Reference:** [arXiv:2309.00071](https://arxiv.org/abs/2309.00071)  
* **Status:** **Work in Progress** \- YaRN implemented for MLA only in Megatron Core. General RoPE/GPT support available in the POC branch.

Credits: @cuichenx 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpt-oss implementation #1739

Description

MoE Layer

Enabled Bias

Attention Mechanisms

Alternating Sliding-Window Attention Pattern

Attention Sinks

Activation Functions

Custom SwiGLU with Clamping

Positional Encodings

YaRN RoPE Scaling

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gpt-oss implementation #1739

Description

Description

MoE Layer

Enabled Bias

Attention Mechanisms

Alternating Sliding-Window Attention Pattern

Attention Sinks

Activation Functions

Custom SwiGLU with Clamping

Positional Encodings

YaRN RoPE Scaling

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions