Single-Speaker Training Produces Unintelligible Text Output

### Checks

- [x] This template is only for research question, not usage problems, feature requests or bug reports.
- [x] I have thoroughly reviewed the project documentation and read the related paper(s).
- [x] I have searched for existing issues, including closed ones, no similar questions.
- [x] I am using English to submit this issue to facilitate community communication.

### Problem

To evaluate the capabilities of this repository, I attempted to train a single-speaker model from scratch using a dataset of slightly over 24 hours, similar to LJSpeech. After training, the speaker identity is preserved well, but the generated textual content is unintelligible (babbling nonsense), even when using text prompts from the training set.

### Details

- **Commit used**: ebbd7bd91ff5f8f5d07d9176493e0d2bee33aa22 (less than a month old)
- **Training setup**:
  - Dataset: ~24 hours, single speaker
  - Hardware: 4 H200 GPUs
  - Training duration: ~3 days (16,000 epochs)
  -  Configuration is attached below.
- **Observations**:
  - Loss curve is attached below.
  - The output suggests that alignment was not learned successfully.

### Request for Help

Could anyone provide guidance on what I might be missing for successful single-speaker training of the F5 TTS model from scratch? Are there specific hyperparameters, preprocessing steps, or alignment-related configurations I should pay closer attention to? Or is the model just not capable of learning the alignment from a single-speaker 24-hours dataset and it is strictly required to have multiple speakers for training from scratch? Can you estimate if the alignment is learned well if the loss curve falls below a certain threshold? Any advice would be greatly appreciated. 

Thanks in advance!

<img width="1778" height="492" alt="Image" src="https://github.com/user-attachments/assets/08284e06-dba8-4c7c-8fad-a5ba70663e00" />

```
hydra:
  run:
    dir: <target_directory>

datasets:
  name: <dataset_name>
  batch_size_per_gpu: 76800
  batch_size_type: frame
  max_samples: 64
  num_workers: 12

optim:
  epochs: 16000
  learning_rate: 7.5e-5
  num_warmup_updates: 20000
  grad_accumulation_steps: 1
  max_grad_norm: 1.0
  bnb_optimizer: False

model:
  name: F5TTS_Small
  tokenizer: char
  tokenizer_path: null
  backbone: DiT
  arch:
    dim: 768
    depth: 18
    heads: 12
    ff_mult: 2
    text_dim: 512
    text_mask_padding: False
    conv_layers: 4
    pe_attn_head: 1
    attn_backend: torch
    attn_mask_enabled: False
    checkpoint_activations: False
  mel_spec:
    target_sample_rate: 24000
    n_mel_channels: 100
    hop_length: 256
    win_length: 1024
    n_fft: 1024
    mel_spec_type: vocos
  vocoder:
    is_local: False
    local_path: null

ckpts:
  logger: tensorboard
  log_samples: True
  save_per_updates: 50000
  keep_last_n_checkpoints: -1
  last_per_updates: 5000
  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Single-Speaker Training Produces Unintelligible Text Output #1122

Checks

Problem

Details

Request for Help

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Single-Speaker Training Produces Unintelligible Text Output #1122

Description

Checks

Problem

Details

Request for Help

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions