v0.12.2 pretrain failed, 0.14.0rc2 model conversion failed

My goal is to train and inference with Qwen2.5-1.5B-instruction  

OS: ubuntu 24.04   GPUs: 4*4090D (`pp:2,  tp:2`)  
```
$ pip list | grep cuda
nvidia-cuda-cupti-cu12   12.9.79
nvidia-cuda-nvrtc-cu12   12.9.86
nvidia-cuda-runtime-cu12 12.9.79
$ pip list | grep torch
pytorch-triton           3.4.0+gitae848267
torch                    2.9.0.dev20250716+cu129
torchaudio               2.8.0.dev20250716+cu129
torchprofile             0.0.4
torchvision              0.24.0.dev20250716+cu129
transformer_engine_torch 2.5.0
```

1. At first, I tried using `vcore-0.14.0rc2`. The model conversion command used is as follows:

  ```
    python ../tools/checkpoint/convert.py \
        --loader llama_mistral \
        --saver mcore \
        --checkpoint-type hf \
        --target-tensor-parallel-size 2\
        --target-pipeline-parallel-size 2\
        --model-type GPT \
        --tokenizer-model $TARGET_LLM_MODEL_HF_DIR \
        --model-size qwen2.5 \
        --saver-transformer-impl local \
        --load-dir $TARGET_LLM_MODEL_HF_DIR \
        --save-dir $TARGET_LLM_MODEL_MCORE_DIR \ 
  ```  

The error obtained is as follows:

<img width="1724" height="621" alt="Image" src="https://github.com/user-attachments/assets/e245e79b-b1cd-438a-82a8-a4ea675c572f" />

(if using `pp:1 tp:1` the model conversion can work)  

2. From here: https://github.com/NVIDIA/Megatron-LM/issues/1708
    it appears that the issue mentioned above is a bug.  So I switched to `v0.12.2` version.
-  Model successfully converted from `HF` to `mcore` format
-  Pre training failed. The commands used are as follows:
```
#!/usr/bin/env bash

GPUS_PER_NODE=4
MASTER_ADDR=localhost
MASTER_PORT=29501
NUM_NODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))

TENSORBOARD_LOGS_PATH=./tensorboard_logs
DATA_PATH=wiki_qwen_text_document
CHECKPOINT=qwen2.5-1.5B-instruct-mcore-pp2-tp2

DISTRIBUTED_ARGS=(
    --nproc_per_node $GPUS_PER_NODE
    --nnodes $NUM_NODES
    --node_rank $NODE_RANK
    --master_addr $MASTER_ADDR
    --master_port $MASTER_PORT
)

GPT_MODEL_ARGS=(
    --seq-length 512
    --max-position-embeddings 32768
    --tokenizer-type HuggingFaceTokenizer
    --tokenizer-model $TARGET_LLM_MODEL_HF_DIR
    --exit-on-missing-checkpoint
    --use-checkpoint-args
    --no-load-optim
    --no-load-rng
    --untie-embeddings-and-output-weights
    --use-rotary-position-embeddings
    --normalization RMSNorm
    --no-position-embedding
    --no-masked-softmax-fusion
    --attention-softmax-in-fp32
    --num-layers 28
    --hidden-size 1536
    --num-attention-heads 12
    --ffn-hidden-size 8960  #5632 # intermediate_size
    --group-query-attention
    --num-query-groups 4
)

TRAINING_ARGS=(
    --micro-batch-size 1
    --global-batch-size 16
    --train-iters 100
    --weight-decay 0.1
    --adam-beta1 0.9
    --adam-beta2 0.95
    --init-method-std 0.006
    --clip-grad 1.0
    --bf16
    --lr 5.0e-4
    --lr-decay-style cosine
    --min-lr 1.0e-5
    --lr-warmup-fraction .01
    --lr-decay-iters 100000
    --use-distributed-optimizer
)

MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 2
    --pipeline-model-parallel-size 2
)

DATA_ARGS=(
    --data-path $DATA_PATH
    --split 98,2,0
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 20
    --save-interval 50
    --eval-interval 20
    --save $CHECKPOINT
    --load $CHECKPOINT
    --eval-iters 20
    --tensorboard-dir $TENSORBOARD_LOGS_PATH
)

torchrun ${DISTRIBUTED_ARGS[@]} ../pretrain_gpt.py \
    ${GPT_MODEL_ARGS[@]} \
    ${TRAINING_ARGS[@]} \
    ${MODEL_PARALLEL_ARGS[@]} \
    ${DATA_ARGS[@]} \
    ${EVAL_AND_LOGGING_ARGS[@]} \
    --use-mcore-models
```

The first error obtained is:

<img width="1605" height="517" alt="Image" src="https://github.com/user-attachments/assets/243223e8-0f29-448b-b76e-5930e4a03f79" />

After made the following change:

<img width="829" height="199" alt="Image" src="https://github.com/user-attachments/assets/c7d2a520-cae0-4ca7-9a92-b37124e9516a" />

A new error appears:

<img width="1629" height="731" alt="Image" src="https://github.com/user-attachments/assets/c4d86238-eb89-440d-ba25-80ad40a22016" />




**Conclusion**
If `v0.12.2` is used for model conversion and then `vcore-0.14.0rc2` is used for training and inference, it can work.  

The information I want to get:  
- Which version should I use?
- Is the issue in bullets `1` a bug?
- How to solve the problem in bullets `2`? 
- Does `--model-size qwen2.5` mean it supports any parameter size (e.g. 1.5B, 8B, etc..)?

Any advice would be appreciated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.12.2 pretrain failed, 0.14.0rc2 model conversion failed #1718

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

v0.12.2 pretrain failed, 0.14.0rc2 model conversion failed #1718

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions