-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Labels
community-requestmodule: trainingquestionFurther information is requestedFurther information is requested
Description
My goal is to train and inference with Qwen2.5-1.5B-instruction
OS: ubuntu 24.04 GPUs: 4*4090D (pp:2, tp:2
)
$ pip list | grep cuda
nvidia-cuda-cupti-cu12 12.9.79
nvidia-cuda-nvrtc-cu12 12.9.86
nvidia-cuda-runtime-cu12 12.9.79
$ pip list | grep torch
pytorch-triton 3.4.0+gitae848267
torch 2.9.0.dev20250716+cu129
torchaudio 2.8.0.dev20250716+cu129
torchprofile 0.0.4
torchvision 0.24.0.dev20250716+cu129
transformer_engine_torch 2.5.0
- At first, I tried using
vcore-0.14.0rc2
. The model conversion command used is as follows:
python ../tools/checkpoint/convert.py \
--loader llama_mistral \
--saver mcore \
--checkpoint-type hf \
--target-tensor-parallel-size 2\
--target-pipeline-parallel-size 2\
--model-type GPT \
--tokenizer-model $TARGET_LLM_MODEL_HF_DIR \
--model-size qwen2.5 \
--saver-transformer-impl local \
--load-dir $TARGET_LLM_MODEL_HF_DIR \
--save-dir $TARGET_LLM_MODEL_MCORE_DIR \
The error obtained is as follows:

(if using pp:1 tp:1
the model conversion can work)
- From here: Error occured while converting Mistral-7B-v0.3-hf to Mistral-7B-v0.3-mcore #1708
it appears that the issue mentioned above is a bug. So I switched tov0.12.2
version.
- Model successfully converted from
HF
tomcore
format - Pre training failed. The commands used are as follows:
#!/usr/bin/env bash
GPUS_PER_NODE=4
MASTER_ADDR=localhost
MASTER_PORT=29501
NUM_NODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
TENSORBOARD_LOGS_PATH=./tensorboard_logs
DATA_PATH=wiki_qwen_text_document
CHECKPOINT=qwen2.5-1.5B-instruct-mcore-pp2-tp2
DISTRIBUTED_ARGS=(
--nproc_per_node $GPUS_PER_NODE
--nnodes $NUM_NODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
)
GPT_MODEL_ARGS=(
--seq-length 512
--max-position-embeddings 32768
--tokenizer-type HuggingFaceTokenizer
--tokenizer-model $TARGET_LLM_MODEL_HF_DIR
--exit-on-missing-checkpoint
--use-checkpoint-args
--no-load-optim
--no-load-rng
--untie-embeddings-and-output-weights
--use-rotary-position-embeddings
--normalization RMSNorm
--no-position-embedding
--no-masked-softmax-fusion
--attention-softmax-in-fp32
--num-layers 28
--hidden-size 1536
--num-attention-heads 12
--ffn-hidden-size 8960 #5632 # intermediate_size
--group-query-attention
--num-query-groups 4
)
TRAINING_ARGS=(
--micro-batch-size 1
--global-batch-size 16
--train-iters 100
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.95
--init-method-std 0.006
--clip-grad 1.0
--bf16
--lr 5.0e-4
--lr-decay-style cosine
--min-lr 1.0e-5
--lr-warmup-fraction .01
--lr-decay-iters 100000
--use-distributed-optimizer
)
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 2
--pipeline-model-parallel-size 2
)
DATA_ARGS=(
--data-path $DATA_PATH
--split 98,2,0
)
EVAL_AND_LOGGING_ARGS=(
--log-interval 20
--save-interval 50
--eval-interval 20
--save $CHECKPOINT
--load $CHECKPOINT
--eval-iters 20
--tensorboard-dir $TENSORBOARD_LOGS_PATH
)
torchrun ${DISTRIBUTED_ARGS[@]} ../pretrain_gpt.py \
${GPT_MODEL_ARGS[@]} \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${DATA_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]} \
--use-mcore-models
The first error obtained is:

After made the following change:

A new error appears:

Conclusion
If v0.12.2
is used for model conversion and then vcore-0.14.0rc2
is used for training and inference, it can work.
The information I want to get:
- Which version should I use?
- Is the issue in bullets
1
a bug? - How to solve the problem in bullets
2
? - Does
--model-size qwen2.5
mean it supports any parameter size (e.g. 1.5B, 8B, etc..)?
Any advice would be appreciated.
Metadata
Metadata
Assignees
Labels
community-requestmodule: trainingquestionFurther information is requestedFurther information is requested