BugFix: FP8 Communication Mismatch with --first-last-layers-bf16 in tp-comm-overlap #1703

xiaomin-D · 2025-07-24T03:16:46Z

Problem Description
When using FP8 tensorwise quantization with --first-last-layers-bf16 configuration, enabling tp-comm-overlap causes training failures due to data format misalignment in communication operations.

Issue Details:
env: TE 2.4/2.3 Megatron-LM main
Configuration: FP8 tensorwise + --first-last-layers-bf16 +tp-comm-overlap enabled
Root Cause: Communication operators still use FP8 format even though head/tail layers are configured to use BF16
Symptom: Training crashes with data format mismatch errors during inter-device communication
Workaround: Disabling TP overlap allows normal training to proceed
Root Cause Analysis
The communication logic doesn't properly detect when head/tail layers are using BF16 format, leading to:

Head/tail layers produce BF16 tensors
Communication operators expect FP8 format
Format mismatch causes runtime errors during tensor parallel communication

Solution
This PR addresses the issue by disabling TP communication overlap when --first-last-layers-bf16 is enabled:

Current Fix:

Disable TP communication overlap for configurations using --first-last-layers-bf16
This prevents the data format mismatch by avoiding the problematic communication path
Future Enhancement Consideration:
The ideal long-term solution would be to implement precision-aware communication operators that can:

Dynamically select appropriate communication algorithms based on each layer's actual precision
Use BF16 communication operators for first/last layers when --first-last-layers-bf16 is enabled
Use FP8 communication operators for intermediate layers using FP8 tensorwise quantization
Enable seamless TP overlap regardless of mixed-precision configurations

DAISY-gh · 2025-07-24T05:25:14Z

@xiaomin-D we are pulling this PR internally to review and run Ci. Thanks.

xiaomin-D · 2025-08-12T08:57:38Z

@xiaomin-D we are pulling this PR internally to review and run Ci. Thanks.

@DAISY-gh Is there any progress? Thanks.

If the first or last layers are bf16, disable tp comm overlap for them

82d230c

sbhavani added bug Something isn't working module: transformer engine labels Jul 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BugFix: FP8 Communication Mismatch with --first-last-layers-bf16 in tp-comm-overlap #1703

BugFix: FP8 Communication Mismatch with --first-last-layers-bf16 in tp-comm-overlap #1703

Uh oh!

xiaomin-D commented Jul 24, 2025

Uh oh!

DAISY-gh commented Jul 24, 2025

Uh oh!

xiaomin-D commented Aug 12, 2025

Uh oh!

Uh oh!

BugFix: FP8 Communication Mismatch with --first-last-layers-bf16 in tp-comm-overlap #1703

Are you sure you want to change the base?

BugFix: FP8 Communication Mismatch with --first-last-layers-bf16 in tp-comm-overlap #1703

Uh oh!

Conversation

xiaomin-D commented Jul 24, 2025

Uh oh!

DAISY-gh commented Jul 24, 2025

Uh oh!

xiaomin-D commented Aug 12, 2025

Uh oh!

Uh oh!