-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Your question
I want to understand if the loss differences I'm observing between single GPU and tensor parallel execution are reasonable/expected, and how to determine whether these are acceptable systematic variation or silent failures. Are there established ranges or detection methods to validate the loss accuracy of parallelisms?
Description
I've observed consistent differences in loss values between single GPU and tensor parallel (TP=2) training, even in the most minimal model configurations. I created a systematic set of experiments to isolate the source of these differences and would like to understand if this is expected behavior and how to properly validate numerical accuracy in tensor parallel settings.
Environment
- Container Image: nvcr.io/nvidia/pytorch:23.12-py3
- PyTorch version: 2.2.0a0+81ea7a4 (NGC 23.12 container version)
- CUDA version: 12.3
- cuDNN version: 8.9.6
- NCCL version: 2.19.3
- Python version: 3.10
- Number of GPUs: 2
Steps to reproduce
The test scripts should be run from a directory at the same level as the megatron-lm folder:
workspace/
├── megatron-lm/
├── test_scripts.py
└── logs/
#!/usr/bin/env python3
"""
Minimal experiment: Verify numerical differences caused by Softmax and vocabulary parallelism in Megatron-LM
"""
import os
import torch
import numpy as np
from pathlib import Path
import json
import subprocess
import time
class MinimalSoftmaxTPTester:
def __init__(self, megatron_path="./megatron-lm"):
self.megatron_path = Path(megatron_path)
self.results = {}
def create_minimal_configs(self):
# Base configuration - minimal model
base_config = {
# Minimize model architecture
"num_layers": 1, # Only 1 layer to reduce accumulated errors
"hidden_size": 64, # Minimal hidden layer size
"num_attention_heads": 2, # Minimum attention heads
"seq_length": 32, # Short sequence
"max_position_embeddings": 32,
"vocab_size": 128,
# Batch configuration
"micro_batch_size": 1,
"global_batch_size": 1,
"train_samples": 10,
"save_interval": 100,
# Key parameters
"seed": 1234,
"init_method_std": 0.02,
"fp16": False,
"bf16": False,
"params_dtype": "float32",
"attention_softmax_in_fp32": True,
# Disable all optimizations that might introduce randomness
"use_flash_attn": False,
"gradient_accumulation_fusion": False,
"use_cpu_initialization": True,
"sequence_parallel": False,
"overlap_grad_reduce": False,
"overlap_param_gather": False,
# Set dropout to 0
"attention_dropout": 0.0,
"hidden_dropout": 0.0,
# Set learning rate to very small value
"lr": 1e-10,
"min_lr": 1e-10,
"weight_decay": 0.0,
"clip_grad": 100.0,
# Other settings
"log_interval": 1,
"eval_iters": 1,
"eval_interval": 5,
"tokenizer_type": "GPT2BPETokenizer",
}
# Experiment 1: Simplest single-layer model (baseline)
experiment_1 = base_config.copy()
experiment_1["experiment_name"] = "minimal_single_layer"
experiment_1["description"] = "Simplest single-layer Transformer to locate basic numerical differences"
# Experiment 2: Disable LayerNorm
experiment_2 = base_config.copy()
experiment_2["experiment_name"] = "no_layernorm"
experiment_2["layernorm_epsilon"] = 1.0
experiment_2["apply_layernorm_1p"] = False
experiment_2["description"] = "Disable LayerNorm to isolate Attention and Softmax effects"
# Experiment 3: Linear attention (disable Softmax)
experiment_3 = base_config.copy()
experiment_3["experiment_name"] = "linear_attention"
experiment_3["attention_softmax_in_fp32"] = False
experiment_3["description"] = "Test if Softmax causes the difference"
# Experiment 4: Vocabulary projection layer only
experiment_4 = base_config.copy()
experiment_4["experiment_name"] = "vocab_parallel_only"
experiment_4["num_layers"] = 0 # No Transformer layers
experiment_4["description"] = "Test only vocabulary parallel output layer"
# Experiment 5: Fixed weight initialization
experiment_5 = base_config.copy()
experiment_5["experiment_name"] = "fixed_weights"
experiment_5["init_method"] = "zero"
experiment_5["description"] = "Use fixed weights to eliminate initialization differences"
# Experiment 6: Different attention head configurations
experiment_6 = base_config.copy()
experiment_6["experiment_name"] = "attention_heads_test"
experiment_6["num_attention_heads"] = 4
experiment_6["hidden_size"] = 128
experiment_6["description"] = "Test the impact of attention head splitting"
return [
experiment_1, experiment_2, experiment_3,
experiment_4, experiment_5, experiment_6
]
def generate_test_script(self, config, tp_size=1):
"""Generate test script"""
mode = "tp" if tp_size > 1 else "single"
experiment_name = config["experiment_name"]
script_name = f"test_{experiment_name}_{mode}.sh"
# Build arguments
args = []
# Parallel configuration
args.extend([
f"--tensor-model-parallel-size {tp_size}",
f"--pipeline-model-parallel-size 1",
])
# Model architecture
args.extend([
f"--num-layers {config.get('num_layers', 1)}",
f"--hidden-size {config['hidden_size']}",
f"--num-attention-heads {config['num_attention_heads']}",
f"--seq-length {config['seq_length']}",
f"--max-position-embeddings {config['max_position_embeddings']}",
])
# Batch and training
args.extend([
f"--micro-batch-size {config['micro_batch_size']}",
f"--global-batch-size {config['global_batch_size']}",
f"--train-samples {config['train_samples']}",
f"--lr-decay-samples {config.get('train_samples', 10)}",
f"--lr-warmup-samples 0",
f"--seed {config['seed']}",
f"--lr {config['lr']}",
f"--min-lr {config.get('lr', 0.0)}",
f"--lr-decay-style constant",
f"--weight-decay {config['weight_decay']}",
f"--clip-grad {config['clip_grad']}",
"--optimizer adam",
"--adam-beta1 0.9",
"--adam-beta2 0.95",
"--adam-eps 1e-8",
])
# Dropout
args.extend([
f"--attention-dropout {config['attention_dropout']}",
f"--hidden-dropout {config['hidden_dropout']}",
])
# Precision settings
if config.get('fp16', False):
args.append("--fp16")
args.append("--initial-loss-scale 1024")
args.append("--min-loss-scale 1")
elif config.get('bf16', False):
args.append("--bf16")
else:
# Default FP32, no special flags needed
pass
# Special parameters
if config.get("attention_softmax_in_fp32", True):
args.append("--attention-softmax-in-fp32")
if "layernorm_epsilon" in config:
args.append(f"--layernorm-epsilon {config['layernorm_epsilon']}")
if config.get("init_method") == "zero":
args.append("--init-method-std 0.0")
else:
args.append(f"--init-method-std {config.get('init_method_std', 0.02)}")
if config.get("use_cpu_initialization"):
args.append("--use-cpu-initialization")
# Save and logging parameters
args.extend([
f"--log-interval {config['log_interval']}",
f"--eval-iters {config['eval_iters']}",
f"--eval-interval {config['eval_interval']}",
f"--save-interval {config.get('save_interval', 100)}",
"--distributed-backend nccl",
"--data-path ./minimal_data/test_data_text_document",
"--vocab-file ./vocab/gpt2-vocab.json",
"--merge-file ./vocab/gpt2-merges.txt",
"--tokenizer-type GPT2BPETokenizer",
"--split 949,50,1",
f"--save ./checkpoints/{experiment_name}_{mode}",
f"--tensorboard-dir ./logs/{experiment_name}_{mode}",
"--no-load-optim",
"--no-load-rng",
])
# Generate script content
env_vars = f"""#!/bin/bash
# Minimal experiment: {config['description']}
# Mode: {mode.upper()} (TP={tp_size})
export CUDA_DEVICE_MAX_CONNECTIONS=1
export MASTER_ADDR=localhost
export MASTER_PORT=6002
export WORLD_SIZE={tp_size}
export RANK=0
# Deterministic settings
export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export CUBLAS_WORKSPACE_CONFIG=:4096:8
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
echo "Experiment: {experiment_name}"
echo "Description: {config['description']}"
echo "Tensor parallel size: {tp_size}"
"""
if tp_size > 1:
launcher = f"torchrun --nnodes=1 --nproc-per-node={tp_size} --master-port=6002"
else:
launcher = "python"
args_str = " \\\n ".join(args)
script_content = f"""{env_vars}
# Run test
{launcher} {self.megatron_path}/pretrain_gpt.py \\
{args_str} 2>&1 | tee ./logs/{experiment_name}_{mode}.log
# Extract key numerical results
echo ""
echo "=== Extracting numerical results ==="
grep "validation loss" ./logs/{experiment_name}_{mode}.log | tail -5
grep "lm loss" ./logs/{experiment_name}_{mode}.log | tail -5
"""
# Save script
script_path = Path("./scripts") / script_name
script_path.parent.mkdir(exist_ok=True)
with open(script_path, 'w') as f:
f.write(script_content)
os.chmod(script_path, 0o755)
return str(script_path)
def create_analysis_script(self):
"""Create results analysis script"""
analysis_script = """#!/usr/bin/env python3
'''
Analyze numerical differences between single GPU vs tensor parallelism
Focus on the impact of Softmax and vocabulary parallelism
'''
import re
import json
from pathlib import Path
def extract_losses(log_file):
'''Extract loss values from log'''
losses = []
with open(log_file, 'r') as f:
content = f.read()
# Extract validation loss
pattern = r'validation loss.*?(\d+\.\d+(?:E[+-]\d+)?)'
matches = re.findall(pattern, content)
if not matches:
# Try to extract lm loss
pattern = r'lm loss:\s*(\d+\.\d+(?:E[+-]\d+)?)'
matches = re.findall(pattern, content)
return [float(m) for m in matches]
def compare_experiments():
'''Compare single GPU vs TP results for all experiments'''
experiments = [
"minimal_single_layer",
"no_layernorm",
"linear_attention",
"vocab_parallel_only",
"fixed_weights",
"attention_heads_test"
]
results = {}
for exp_name in experiments:
single_log = f"./logs/{exp_name}_single.log"
tp_log = f"./logs/{exp_name}_tp.log"
if Path(single_log).exists() and Path(tp_log).exists():
single_losses = extract_losses(single_log)
tp_losses = extract_losses(tp_log)
if single_losses and tp_losses:
# Calculate differences
min_len = min(len(single_losses), len(tp_losses))
if min_len > 0:
diffs = []
for i in range(min_len):
if single_losses[i] > 0:
rel_diff = abs(tp_losses[i] - single_losses[i]) / single_losses[i]
diffs.append(rel_diff)
if diffs:
results[exp_name] = {
"single_final": single_losses[-1],
"tp_final": tp_losses[-1],
"max_rel_diff": max(diffs),
"mean_rel_diff": sum(diffs) / len(diffs)
}
# Print results table
print("\\n" + "="*75)
print("Softmax and Vocabulary Parallelism Numerical Difference Analysis")
print("="*80)
print(f"{'Experiment Name':<25} {'Single GPU Loss':<15} {'TP Loss':<15} {'Max Relative Error':<15}")
print("-"*75)
for exp_name, data in results.items():
print(f"{exp_name:<25} {data['single_final']:<15.6e} {data['tp_final']:<15.6e} "
f"{data['max_rel_diff']:<15.4%}")
# Analysis conclusions
print("\\n" + "="*75)
print("Conclusion Analysis:")
print("-"*75)
if "vocab_parallel_only" in results:
vocab_diff = results["vocab_parallel_only"]["max_rel_diff"]
print(f"Vocabulary parallel layer max relative error: {vocab_diff:.4%}")
if "linear_attention" in results and "minimal_single_layer" in results:
linear_diff = results["linear_attention"]["max_rel_diff"]
minimal_diff = results["minimal_single_layer"]["max_rel_diff"]
print(f"Linear attention max relative error: {linear_diff:.4%}")
print(f"Minimal single layer max relative error: {minimal_diff:.4%}")
if linear_diff < minimal_diff:
print("Softmax may be a contributing factor to numerical differences")
if "no_layernorm" in results and "minimal_single_layer" in results:
no_ln_diff = results["no_layernorm"]["max_rel_diff"]
minimal_diff = results["minimal_single_layer"]["max_rel_diff"]
print(f"No LayerNorm max relative error: {no_ln_diff:.4%}")
if no_ln_diff < minimal_diff:
print("LayerNorm may increase numerical differences")
return results
if __name__ == "__main__":
results = compare_experiments()
# Save results
with open("./softmax_tp_analysis.json", 'w') as f:
json.dump(results, f, indent=2)
print("\\nResults saved to: ./softmax_tp_analysis.json")
"""
script_path = Path("./analyze_softmax_tp.py")
with open(script_path, 'w') as f:
f.write(analysis_script)
os.chmod(script_path, 0o755)
return str(script_path)
def generate_all_experiments(self):
"""Generate all experiment scripts"""
configs = self.create_minimal_configs()
print("=== Generating Minimal Softmax/Vocabulary Parallelism Tests ===\n")
all_scripts = []
for config in configs:
# Generate single GPU version
single_script = self.generate_test_script(config, tp_size=1)
# Generate tensor parallel version (TP=2)
tp_script = self.generate_test_script(config, tp_size=2)
all_scripts.append({
"experiment": config["experiment_name"],
"description": config["description"],
"single_script": single_script,
"tp_script": tp_script
})
print(f"Experiment: {config['experiment_name']}")
print(f" Description: {config['description']}")
print(f" Single GPU script: {single_script}")
print(f" TP script: {tp_script}")
print()
# Generate batch run script
batch_script = """#!/bin/bash
# Batch run all minimal experiments
echo "Starting minimal Softmax/Vocabulary parallelism tests"
echo "================================"
# Create log directory
mkdir -p ./logs
# Run all experiments
experiments=(
"minimal_single_layer"
"no_layernorm"
"linear_attention"
"vocab_parallel_only"
"fixed_weights"
"attention_heads_test"
)
for exp in "${experiments[@]}"; do
echo ""
echo "Running experiment: $exp"
echo "-------------------"
# Run single GPU version
echo "Single GPU version..."
bash ./scripts/test_${exp}_single.sh
# Run TP version
echo "Tensor parallel version..."
bash ./scripts/test_${exp}_tp.sh
echo "Experiment $exp completed"
done
echo ""
echo "================================"
echo "All experiments completed, starting analysis..."
python ./analyze_softmax_tp.py
"""
batch_path = Path("./run_all_softmax_tests.sh")
with open(batch_path, 'w') as f:
f.write(batch_script)
os.chmod(batch_path, 0o755)
# Generate analysis script
analysis_script = self.create_analysis_script()
print("=== Generation Complete ===")
print(f"\nBatch run script: {batch_path}")
print(f"Analysis script: {analysis_script}")
return all_scripts
if __name__ == "__main__":
tester = MinimalSoftmaxTPTester()
tester.generate_all_experiments()
Results
I conducted several controlled experiments to isolate the source of differences:
Experiment Description Single GPU TP=2 Observation
---------------------- ------------------------------- ------------ ------------ -------------------------
minimal_single_layer Baseline 1-layer transformer 1.082085e+01 1.084218e+01 Consistent difference
linear_attention Disabled softmax computation 1.082085e+01 1.084218e+01 Same difference as baseline
fixed_weights Zero initialization 1.082584e+01 1.082838e+01 Reduced but present
attention_heads_test Increased attention heads (2→4) 1.092446e+01 1.088210e+01 Larger absolute difference
Analysis
The differences appear to originate from the AllReduce operations during tensor parallel communication, as:
- They persist even without softmax operations
- They scale with the degree of parallelism
- They're affected by numerical magnitudes
- They occur even with deterministic settings (fixed seed, CPU initialization)
Questions
- Are these differences originating from AllReduce communication operations?
- My experiments show the differences persist even when disabling softmax, suggesting they come from the communication layer rather than specific computational kernels.
- Why are these numerical differences unavoidable in tensor parallelism, and what is their practical impact?
- Is this due to floating-point non-associativity in distributed reductions?
- Will these differences affect final model quality or training convergence?
- How can we distinguish between acceptable numerical variance and silent errors?
- Are there tools or testing frameworks within Megatron-LM to validate numerical accuracy?
- What is the acceptable error range (e.g., relative difference threshold)?
- Is there a recommended test suite or validation procedure to ensure tensor parallel execution is working correctly versus experiencing silent failures?