Skip to content

[QUESTION] Epochs Larger Than 1 When Specified with Trained Samples #1127

@zixianwang2022

Description

@zixianwang2022

Your question
Ask a clear and concise question about Megatron-LM.

Hi, I am doing a toy experiment in training the model.

I specified the TRAIN_SAMPLES=100 in my train.sh. And there's only 100 data points in my training dataset.

TRAIN_SAMPLES=100  # 300B tokens / 4096
LR_WARMUP_SAMPLES=0
LR_DECAY_SAMPLES=100 # TRAIN_SAMPLES - LR_WARMUP_SAMPLES

options=" \
    ...
    --train-samples ${TRAIN_SAMPLES} \
    --lr-warmup-samples ${LR_WARMUP_SAMPLES} \
    --lr-decay-samples ${LR_DECAY_SAMPLES} \
    ...
    --split 99,1,0 \

torchrun --nproc_per_node 1 pretrain_model.py ${options}

But the log appears that it shows
total number of epochs: 165 despite I set TRAIN_SAMPLES=100

Why will this happen when I am using --train-samples flag instead of --train-itr?

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions