Description
Checks
- This template is only for research question, not usage problems, feature requests or bug reports.
- I have thoroughly reviewed the project documentation and read the related paper(s).
- I have searched for existing issues, including closed ones, no similar questions.
- I am using English to submit this issue to facilitate community communication.
Problem
To evaluate the capabilities of this repository, I attempted to train a single-speaker model from scratch using a dataset of slightly over 24 hours, similar to LJSpeech. After training, the speaker identity is preserved well, but the generated textual content is unintelligible (babbling nonsense), even when using text prompts from the training set.
Details
- Commit used: ebbd7bd (less than a month old)
- Training setup:
- Dataset: ~24 hours, single speaker
- Hardware: 4 H200 GPUs
- Training duration: ~3 days (16,000 epochs)
- Configuration is attached below.
- Observations:
- Loss curve is attached below.
- The output suggests that alignment was not learned successfully.
Request for Help
Could anyone provide guidance on what I might be missing for successful single-speaker training of the F5 TTS model from scratch? Are there specific hyperparameters, preprocessing steps, or alignment-related configurations I should pay closer attention to? Or is the model just not capable of learning the alignment from a single-speaker 24-hours dataset and it is strictly required to have multiple speakers for training from scratch? Can you estimate if the alignment is learned well if the loss curve falls below a certain threshold? Any advice would be greatly appreciated.
Thanks in advance!

hydra:
run:
dir: <target_directory>
datasets:
name: <dataset_name>
batch_size_per_gpu: 76800
batch_size_type: frame
max_samples: 64
num_workers: 12
optim:
epochs: 16000
learning_rate: 7.5e-5
num_warmup_updates: 20000
grad_accumulation_steps: 1
max_grad_norm: 1.0
bnb_optimizer: False
model:
name: F5TTS_Small
tokenizer: char
tokenizer_path: null
backbone: DiT
arch:
dim: 768
depth: 18
heads: 12
ff_mult: 2
text_dim: 512
text_mask_padding: False
conv_layers: 4
pe_attn_head: 1
attn_backend: torch
attn_mask_enabled: False
checkpoint_activations: False
mel_spec:
target_sample_rate: 24000
n_mel_channels: 100
hop_length: 256
win_length: 1024
n_fft: 1024
mel_spec_type: vocos
vocoder:
is_local: False
local_path: null
ckpts:
logger: tensorboard
log_samples: True
save_per_updates: 50000
keep_last_n_checkpoints: -1
last_per_updates: 5000
save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}