Skip to content

v0.12.2 pretrain failed, 0.14.0rc2 model conversion failed #1718

@justalittlenoob

Description

@justalittlenoob

My goal is to train and inference with Qwen2.5-1.5B-instruction

OS: ubuntu 24.04 GPUs: 4*4090D (pp:2, tp:2)

$ pip list | grep cuda
nvidia-cuda-cupti-cu12   12.9.79
nvidia-cuda-nvrtc-cu12   12.9.86
nvidia-cuda-runtime-cu12 12.9.79
$ pip list | grep torch
pytorch-triton           3.4.0+gitae848267
torch                    2.9.0.dev20250716+cu129
torchaudio               2.8.0.dev20250716+cu129
torchprofile             0.0.4
torchvision              0.24.0.dev20250716+cu129
transformer_engine_torch 2.5.0
  1. At first, I tried using vcore-0.14.0rc2. The model conversion command used is as follows:
  python ../tools/checkpoint/convert.py \
      --loader llama_mistral \
      --saver mcore \
      --checkpoint-type hf \
      --target-tensor-parallel-size 2\
      --target-pipeline-parallel-size 2\
      --model-type GPT \
      --tokenizer-model $TARGET_LLM_MODEL_HF_DIR \
      --model-size qwen2.5 \
      --saver-transformer-impl local \
      --load-dir $TARGET_LLM_MODEL_HF_DIR \
      --save-dir $TARGET_LLM_MODEL_MCORE_DIR \ 

The error obtained is as follows:

Image

(if using pp:1 tp:1 the model conversion can work)

  1. From here: Error occured while converting Mistral-7B-v0.3-hf to Mistral-7B-v0.3-mcore #1708
    it appears that the issue mentioned above is a bug. So I switched to v0.12.2 version.
  • Model successfully converted from HF to mcore format
  • Pre training failed. The commands used are as follows:
#!/usr/bin/env bash

GPUS_PER_NODE=4
MASTER_ADDR=localhost
MASTER_PORT=29501
NUM_NODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))

TENSORBOARD_LOGS_PATH=./tensorboard_logs
DATA_PATH=wiki_qwen_text_document
CHECKPOINT=qwen2.5-1.5B-instruct-mcore-pp2-tp2

DISTRIBUTED_ARGS=(
    --nproc_per_node $GPUS_PER_NODE
    --nnodes $NUM_NODES
    --node_rank $NODE_RANK
    --master_addr $MASTER_ADDR
    --master_port $MASTER_PORT
)

GPT_MODEL_ARGS=(
    --seq-length 512
    --max-position-embeddings 32768
    --tokenizer-type HuggingFaceTokenizer
    --tokenizer-model $TARGET_LLM_MODEL_HF_DIR
    --exit-on-missing-checkpoint
    --use-checkpoint-args
    --no-load-optim
    --no-load-rng
    --untie-embeddings-and-output-weights
    --use-rotary-position-embeddings
    --normalization RMSNorm
    --no-position-embedding
    --no-masked-softmax-fusion
    --attention-softmax-in-fp32
    --num-layers 28
    --hidden-size 1536
    --num-attention-heads 12
    --ffn-hidden-size 8960  #5632 # intermediate_size
    --group-query-attention
    --num-query-groups 4
)

TRAINING_ARGS=(
    --micro-batch-size 1
    --global-batch-size 16
    --train-iters 100
    --weight-decay 0.1
    --adam-beta1 0.9
    --adam-beta2 0.95
    --init-method-std 0.006
    --clip-grad 1.0
    --bf16
    --lr 5.0e-4
    --lr-decay-style cosine
    --min-lr 1.0e-5
    --lr-warmup-fraction .01
    --lr-decay-iters 100000
    --use-distributed-optimizer
)

MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 2
    --pipeline-model-parallel-size 2
)

DATA_ARGS=(
    --data-path $DATA_PATH
    --split 98,2,0
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 20
    --save-interval 50
    --eval-interval 20
    --save $CHECKPOINT
    --load $CHECKPOINT
    --eval-iters 20
    --tensorboard-dir $TENSORBOARD_LOGS_PATH
)

torchrun ${DISTRIBUTED_ARGS[@]} ../pretrain_gpt.py \
    ${GPT_MODEL_ARGS[@]} \
    ${TRAINING_ARGS[@]} \
    ${MODEL_PARALLEL_ARGS[@]} \
    ${DATA_ARGS[@]} \
    ${EVAL_AND_LOGGING_ARGS[@]} \
    --use-mcore-models

The first error obtained is:

Image

After made the following change:

Image

A new error appears:

Image

Conclusion
If v0.12.2 is used for model conversion and then vcore-0.14.0rc2 is used for training and inference, it can work.

The information I want to get:

  • Which version should I use?
  • Is the issue in bullets 1 a bug?
  • How to solve the problem in bullets 2?
  • Does --model-size qwen2.5 mean it supports any parameter size (e.g. 1.5B, 8B, etc..)?

Any advice would be appreciated.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions