Skip to content

Long Silence in Generated Speech After ~800k Steps #1101

Open
@atlonxp

Description

@atlonxp

Checks

  • This template is only for bug reports, usage problems go with 'Help Wanted'.
  • I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
  • I have searched for existing issues, including closed ones, and couldn't find a solution.
  • I am using English to submit this issue to facilitate community communication.

Environment Details

After a certain number of training steps, the F5-TTS model begins generating speech outputs with long silence. This occurs with both multilingual and monolingual datasets:

  • Dataset 1: Thai, English, Chinese → issue starts around 800k steps
  • Dataset 2: Thai only → issue starts around 750k steps

After reaching these steps, the generated output often becomes mostly or entirely silent despite no apparent changes to the training process or hyperparameters.

Detailed Observation

  • Before reaching the threshold step count (~750k–800k), there are occasional silent outputs.
  • As training progresses closer to this point, the frequency of silent generations increases.
  • Eventually, the model almost always outputs silence, even for previously working prompts.
  • This degradation appears to be gradual, not sudden.

Questions

  • Have you observed similar behavior in your experiments?
  • Could this be related to overfitting, stability in the diffusion process, or dataset imbalance?
  • Are there recommended mitigation strategies (e.g., dynamic loss balancing, early stopping, discriminator tuning)?

Environment

  • F5-TTS latest commit
  • GPU: A100 x 4

Steps to Reproduce

  1. Train F5-TTS with:
  • Dataset 1: Mixed Thai, English, Chinese
  • Dataset 2: Thai only
  1. Continue training past 700k steps
  2. Generate speech with a standard prompt
  3. Observe increased silence or entirely silent outputs beyond ~750k–800k steps
  4. Note: I have checked FFMPEG; it is working fine.

✔️ Expected Behavior

Model should continue producing natural speech without abnormal silence if training progresses normally.

❌ Actual Behavior

  • Outputs degrade into silence permanently after a certain number of training steps.
  • However, at certain numbers trainings steps that closes to the steps causing the degradation to permanent silence,

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions