[QUESTION]Loss Differences Between Single GPU and Tensor Parallelism

**Your question**
I want to understand if the loss differences I'm observing between single GPU and tensor parallel execution are reasonable/expected, and how to determine whether these are acceptable systematic variation or silent failures. Are there established ranges or detection methods to validate the loss accuracy of  parallelisms?

**Description**
I've observed consistent differences in loss values between single GPU and tensor parallel (TP=2) training, even in the most minimal model configurations. I created a systematic set of experiments to isolate the source of these differences and would like to understand if this is expected behavior and how to properly validate numerical accuracy in tensor parallel settings.

**Environment**

- Container Image: nvcr.io/nvidia/pytorch:23.12-py3
- PyTorch version: 2.2.0a0+81ea7a4 (NGC 23.12 container version)
- CUDA version: 12.3
- cuDNN version: 8.9.6
- NCCL version: 2.19.3
- Python version: 3.10
- Number of GPUs: 2

**Steps to reproduce**
The test scripts should be run from a directory at the same level as the megatron-lm folder:
workspace/
├── megatron-lm/
├── test_scripts.py
└── logs/

```
#!/usr/bin/env python3
"""
Minimal experiment: Verify numerical differences caused by Softmax and vocabulary parallelism in Megatron-LM
"""

import os
import torch
import numpy as np
from pathlib import Path
import json
import subprocess
import time

class MinimalSoftmaxTPTester:
    def __init__(self, megatron_path="./megatron-lm"):
        self.megatron_path = Path(megatron_path)
        self.results = {}
        
    def create_minimal_configs(self):
        
        # Base configuration - minimal model
        base_config = {
            # Minimize model architecture
            "num_layers": 1,  # Only 1 layer to reduce accumulated errors
            "hidden_size": 64,  # Minimal hidden layer size
            "num_attention_heads": 2,  # Minimum attention heads
            "seq_length": 32,  # Short sequence
            "max_position_embeddings": 32,
            "vocab_size": 128,  
            
            # Batch configuration
            "micro_batch_size": 1,
            "global_batch_size": 1,
            "train_samples": 10,
            "save_interval": 100,  
            
            # Key parameters 
            "seed": 1234,
            "init_method_std": 0.02,
            "fp16": False,  
            "bf16": False,
            "params_dtype": "float32",
            "attention_softmax_in_fp32": True, 
            
            # Disable all optimizations that might introduce randomness
            "use_flash_attn": False,
            "gradient_accumulation_fusion": False,
            "use_cpu_initialization": True, 
            "sequence_parallel": False,
            "overlap_grad_reduce": False,
            "overlap_param_gather": False,
            
            # Set dropout to 0
            "attention_dropout": 0.0,
            "hidden_dropout": 0.0,
            
            # Set learning rate to very small value
            "lr": 1e-10,  
            "min_lr": 1e-10,
            "weight_decay": 0.0,
            "clip_grad": 100.0,
            
            # Other settings
            "log_interval": 1,
            "eval_iters": 1,
            "eval_interval": 5,
            "tokenizer_type": "GPT2BPETokenizer",
        }
        
        # Experiment 1: Simplest single-layer model (baseline)
        experiment_1 = base_config.copy()
        experiment_1["experiment_name"] = "minimal_single_layer"
        experiment_1["description"] = "Simplest single-layer Transformer to locate basic numerical differences"
        
        # Experiment 2: Disable LayerNorm
        experiment_2 = base_config.copy()
        experiment_2["experiment_name"] = "no_layernorm"
        experiment_2["layernorm_epsilon"] = 1.0  
        experiment_2["apply_layernorm_1p"] = False
        experiment_2["description"] = "Disable LayerNorm to isolate Attention and Softmax effects"
        
        # Experiment 3: Linear attention (disable Softmax)
        experiment_3 = base_config.copy()
        experiment_3["experiment_name"] = "linear_attention"
        experiment_3["attention_softmax_in_fp32"] = False
        experiment_3["description"] = "Test if Softmax causes the difference"
        
        # Experiment 4: Vocabulary projection layer only
        experiment_4 = base_config.copy()
        experiment_4["experiment_name"] = "vocab_parallel_only"
        experiment_4["num_layers"] = 0  # No Transformer layers
        experiment_4["description"] = "Test only vocabulary parallel output layer"
        
        # Experiment 5: Fixed weight initialization
        experiment_5 = base_config.copy()
        experiment_5["experiment_name"] = "fixed_weights"
        experiment_5["init_method"] = "zero"  
        experiment_5["description"] = "Use fixed weights to eliminate initialization differences"
        
        # Experiment 6: Different attention head configurations
        experiment_6 = base_config.copy()
        experiment_6["experiment_name"] = "attention_heads_test"
        experiment_6["num_attention_heads"] = 4
        experiment_6["hidden_size"] = 128  
        experiment_6["description"] = "Test the impact of attention head splitting"
        
        return [
            experiment_1, experiment_2, experiment_3,
            experiment_4, experiment_5, experiment_6
        ]
    
    def generate_test_script(self, config, tp_size=1):
        """Generate test script"""
        
        mode = "tp" if tp_size > 1 else "single"
        experiment_name = config["experiment_name"]
        script_name = f"test_{experiment_name}_{mode}.sh"
        
        # Build arguments
        args = []
        
        # Parallel configuration
        args.extend([
            f"--tensor-model-parallel-size {tp_size}",
            f"--pipeline-model-parallel-size 1",
        ])
        
        # Model architecture
        args.extend([
            f"--num-layers {config.get('num_layers', 1)}",
            f"--hidden-size {config['hidden_size']}",
            f"--num-attention-heads {config['num_attention_heads']}",
            f"--seq-length {config['seq_length']}",
            f"--max-position-embeddings {config['max_position_embeddings']}",
        ])
        
        # Batch and training 
        args.extend([
            f"--micro-batch-size {config['micro_batch_size']}",
            f"--global-batch-size {config['global_batch_size']}",
            f"--train-samples {config['train_samples']}",
            f"--lr-decay-samples {config.get('train_samples', 10)}",  
            f"--lr-warmup-samples 0", 
            f"--seed {config['seed']}",
            f"--lr {config['lr']}",
            f"--min-lr {config.get('lr', 0.0)}",  
            f"--lr-decay-style constant",  
            f"--weight-decay {config['weight_decay']}",
            f"--clip-grad {config['clip_grad']}",
            "--optimizer adam",  
            "--adam-beta1 0.9",  
            "--adam-beta2 0.95",  
            "--adam-eps 1e-8",  
        ])
        
        # Dropout
        args.extend([
            f"--attention-dropout {config['attention_dropout']}",
            f"--hidden-dropout {config['hidden_dropout']}",
        ])
        
        # Precision settings
        if config.get('fp16', False):
            args.append("--fp16")
            args.append("--initial-loss-scale 1024")
            args.append("--min-loss-scale 1")
        elif config.get('bf16', False):
            args.append("--bf16")
        else:
            # Default FP32, no special flags needed
            pass
        
        # Special parameters
        if config.get("attention_softmax_in_fp32", True):
            args.append("--attention-softmax-in-fp32")
        
        if "layernorm_epsilon" in config:
            args.append(f"--layernorm-epsilon {config['layernorm_epsilon']}")
            
        if config.get("init_method") == "zero":
            args.append("--init-method-std 0.0")
        else:
            args.append(f"--init-method-std {config.get('init_method_std', 0.02)}")
        
        if config.get("use_cpu_initialization"):
            args.append("--use-cpu-initialization")
        
        # Save and logging parameters
        args.extend([
            f"--log-interval {config['log_interval']}",
            f"--eval-iters {config['eval_iters']}",
            f"--eval-interval {config['eval_interval']}",
            f"--save-interval {config.get('save_interval', 100)}",
            "--distributed-backend nccl",
            "--data-path ./minimal_data/test_data_text_document",
            "--vocab-file ./vocab/gpt2-vocab.json",
            "--merge-file ./vocab/gpt2-merges.txt",
            "--tokenizer-type GPT2BPETokenizer",
            "--split 949,50,1",
            f"--save ./checkpoints/{experiment_name}_{mode}",
            f"--tensorboard-dir ./logs/{experiment_name}_{mode}",
            "--no-load-optim",
            "--no-load-rng",
        ])
        
        # Generate script content
        env_vars = f"""#!/bin/bash
# Minimal experiment: {config['description']}
# Mode: {mode.upper()} (TP={tp_size})

export CUDA_DEVICE_MAX_CONNECTIONS=1
export MASTER_ADDR=localhost
export MASTER_PORT=6002
export WORLD_SIZE={tp_size}
export RANK=0

# Deterministic settings
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export CUBLAS_WORKSPACE_CONFIG=:4096:8
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False

echo "Experiment: {experiment_name}"
echo "Description: {config['description']}"
echo "Tensor parallel size: {tp_size}"
"""
        
        if tp_size > 1:
            launcher = f"torchrun --nnodes=1 --nproc-per-node={tp_size} --master-port=6002"
        else:
            launcher = "python"
        
        args_str = " \\\n    ".join(args)
        
        script_content = f"""{env_vars}

# Run test
{launcher} {self.megatron_path}/pretrain_gpt.py \\
    {args_str} 2>&1 | tee ./logs/{experiment_name}_{mode}.log

# Extract key numerical results
echo ""
echo "=== Extracting numerical results ==="
grep "validation loss" ./logs/{experiment_name}_{mode}.log | tail -5
grep "lm loss" ./logs/{experiment_name}_{mode}.log | tail -5
"""
        
        # Save script
        script_path = Path("./scripts") / script_name
        script_path.parent.mkdir(exist_ok=True)
        with open(script_path, 'w') as f:
            f.write(script_content)
        os.chmod(script_path, 0o755)
        
        return str(script_path)
    
    def create_analysis_script(self):
        """Create results analysis script"""
        
        analysis_script = """#!/usr/bin/env python3
'''
Analyze numerical differences between single GPU vs tensor parallelism
Focus on the impact of Softmax and vocabulary parallelism
'''

import re
import json
from pathlib import Path

def extract_losses(log_file):
    '''Extract loss values from log'''
    losses = []
    with open(log_file, 'r') as f:
        content = f.read()
        
    # Extract validation loss
    pattern = r'validation loss.*?(\d+\.\d+(?:E[+-]\d+)?)'
    matches = re.findall(pattern, content)
    
    if not matches:
        # Try to extract lm loss
        pattern = r'lm loss:\s*(\d+\.\d+(?:E[+-]\d+)?)'
        matches = re.findall(pattern, content)
    
    return [float(m) for m in matches]

def compare_experiments():
    '''Compare single GPU vs TP results for all experiments'''
    
    experiments = [
        "minimal_single_layer",
        "no_layernorm", 
        "linear_attention",
        "vocab_parallel_only",
        "fixed_weights",
        "attention_heads_test"
    ]
    
    results = {}
    
    for exp_name in experiments:
        single_log = f"./logs/{exp_name}_single.log"
        tp_log = f"./logs/{exp_name}_tp.log"
        
        if Path(single_log).exists() and Path(tp_log).exists():
            single_losses = extract_losses(single_log)
            tp_losses = extract_losses(tp_log)
            
            if single_losses and tp_losses:
                # Calculate differences
                min_len = min(len(single_losses), len(tp_losses))
                if min_len > 0:
                    diffs = []
                    for i in range(min_len):
                        if single_losses[i] > 0:
                            rel_diff = abs(tp_losses[i] - single_losses[i]) / single_losses[i]
                            diffs.append(rel_diff)
                    
                    if diffs:
                        results[exp_name] = {
                            "single_final": single_losses[-1],
                            "tp_final": tp_losses[-1],
                            "max_rel_diff": max(diffs),
                            "mean_rel_diff": sum(diffs) / len(diffs)
                        }
    
    # Print results table
    print("\\n" + "="*75)
    print("Softmax and Vocabulary Parallelism Numerical Difference Analysis")
    print("="*80)
    print(f"{'Experiment Name':<25} {'Single GPU Loss':<15} {'TP Loss':<15} {'Max Relative Error':<15}")
    print("-"*75)
    
    for exp_name, data in results.items():
        print(f"{exp_name:<25} {data['single_final']:<15.6e} {data['tp_final']:<15.6e} "
              f"{data['max_rel_diff']:<15.4%}")
    
    # Analysis conclusions
    print("\\n" + "="*75)
    print("Conclusion Analysis:")
    print("-"*75)
    
    if "vocab_parallel_only" in results:
        vocab_diff = results["vocab_parallel_only"]["max_rel_diff"]
        print(f"Vocabulary parallel layer max relative error: {vocab_diff:.4%}")
    
    if "linear_attention" in results and "minimal_single_layer" in results:
        linear_diff = results["linear_attention"]["max_rel_diff"]
        minimal_diff = results["minimal_single_layer"]["max_rel_diff"]
        print(f"Linear attention max relative error: {linear_diff:.4%}")
        print(f"Minimal single layer max relative error: {minimal_diff:.4%}")
        if linear_diff < minimal_diff:
            print("Softmax may be a contributing factor to numerical differences")
    
    if "no_layernorm" in results and "minimal_single_layer" in results:
        no_ln_diff = results["no_layernorm"]["max_rel_diff"]
        minimal_diff = results["minimal_single_layer"]["max_rel_diff"]
        print(f"No LayerNorm max relative error: {no_ln_diff:.4%}")
        if no_ln_diff < minimal_diff:
            print("LayerNorm may increase numerical differences")
    
    return results

if __name__ == "__main__":
    results = compare_experiments()
    
    # Save results
    with open("./softmax_tp_analysis.json", 'w') as f:
        json.dump(results, f, indent=2)
    
    print("\\nResults saved to: ./softmax_tp_analysis.json")
"""
        
        script_path = Path("./analyze_softmax_tp.py")
        with open(script_path, 'w') as f:
            f.write(analysis_script)
        os.chmod(script_path, 0o755)
        
        return str(script_path)
    
    def generate_all_experiments(self):
        """Generate all experiment scripts"""
        configs = self.create_minimal_configs()
        
        print("=== Generating Minimal Softmax/Vocabulary Parallelism Tests ===\n")
        
        all_scripts = []
        
        for config in configs:
            # Generate single GPU version
            single_script = self.generate_test_script(config, tp_size=1)
            # Generate tensor parallel version (TP=2)
            tp_script = self.generate_test_script(config, tp_size=2)
            
            all_scripts.append({
                "experiment": config["experiment_name"],
                "description": config["description"],
                "single_script": single_script,
                "tp_script": tp_script
            })
            
            print(f"Experiment: {config['experiment_name']}")
            print(f"  Description: {config['description']}")
            print(f"  Single GPU script: {single_script}")
            print(f"  TP script: {tp_script}")
            print()
        
        # Generate batch run script
        batch_script = """#!/bin/bash
# Batch run all minimal experiments

echo "Starting minimal Softmax/Vocabulary parallelism tests"
echo "================================"

# Create log directory
mkdir -p ./logs

# Run all experiments
experiments=(
    "minimal_single_layer"
    "no_layernorm"
    "linear_attention"
    "vocab_parallel_only"
    "fixed_weights"
    "attention_heads_test"
)

for exp in "${experiments[@]}"; do
    echo ""
    echo "Running experiment: $exp"
    echo "-------------------"
    
    # Run single GPU version
    echo "Single GPU version..."
    bash ./scripts/test_${exp}_single.sh
    
    # Run TP version
    echo "Tensor parallel version..."
    bash ./scripts/test_${exp}_tp.sh
    
    echo "Experiment $exp completed"
done

echo ""
echo "================================"
echo "All experiments completed, starting analysis..."
python ./analyze_softmax_tp.py
"""
        
        batch_path = Path("./run_all_softmax_tests.sh")
        with open(batch_path, 'w') as f:
            f.write(batch_script)
        os.chmod(batch_path, 0o755)
        
        # Generate analysis script
        analysis_script = self.create_analysis_script()
        
        print("=== Generation Complete ===")
        print(f"\nBatch run script: {batch_path}")
        print(f"Analysis script: {analysis_script}")
        return all_scripts

if __name__ == "__main__":
    tester = MinimalSoftmaxTPTester()
    tester.generate_all_experiments()
```

**Results**
I conducted several controlled experiments to isolate the source of differences:
```
Experiment              Description                      Single GPU    TP=2          Observation
----------------------  -------------------------------  ------------  ------------  -------------------------
minimal_single_layer    Baseline 1-layer transformer    1.082085e+01  1.084218e+01  Consistent difference
linear_attention        Disabled softmax computation     1.082085e+01  1.084218e+01  Same difference as baseline
fixed_weights          Zero initialization              1.082584e+01  1.082838e+01  Reduced but present
attention_heads_test   Increased attention heads (2→4)  1.092446e+01  1.088210e+01  Larger absolute difference
```
**Analysis**
The differences appear to originate from the AllReduce operations during tensor parallel communication, as:

- They persist even without softmax operations
- They scale with the degree of parallelism
- They're affected by numerical magnitudes
- They occur even with deterministic settings (fixed seed, CPU initialization)

**Questions**

1. Are these differences originating from AllReduce communication operations?

- My experiments show the differences persist even when disabling softmax, suggesting they come from the communication layer rather than specific computational kernels.


2. Why are these numerical differences unavoidable in tensor parallelism, and what is their practical impact?

- Is this due to floating-point non-associativity in distributed reductions?
- Will these differences affect final model quality or training convergence?


3. How can we distinguish between acceptable numerical variance and silent errors?

- Are there tools or testing frameworks within Megatron-LM to validate numerical accuracy?
- What is the acceptable error range (e.g., relative difference threshold)?
- Is there a recommended test suite or validation procedure to ensure tensor parallel execution is working correctly versus experiencing silent failures?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QUESTION]Loss Differences Between Single GPU and Tensor Parallelism #1736

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION]Loss Differences Between Single GPU and Tensor Parallelism #1736

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions