Open
Description
Checks
- This template is only for bug reports, usage problems go with 'Help Wanted'.
- I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
- I have searched for existing issues, including closed ones, and couldn't find a solution.
- I am using English to submit this issue to facilitate community communication.
Environment Details
After a certain number of training steps, the F5-TTS model begins generating speech outputs with long silence. This occurs with both multilingual and monolingual datasets:
- Dataset 1: Thai, English, Chinese → issue starts around 800k steps
- Dataset 2: Thai only → issue starts around 750k steps
After reaching these steps, the generated output often becomes mostly or entirely silent despite no apparent changes to the training process or hyperparameters.
Detailed Observation
- Before reaching the threshold step count (~750k–800k), there are occasional silent outputs.
- As training progresses closer to this point, the frequency of silent generations increases.
- Eventually, the model almost always outputs silence, even for previously working prompts.
- This degradation appears to be gradual, not sudden.
Questions
- Have you observed similar behavior in your experiments?
- Could this be related to overfitting, stability in the diffusion process, or dataset imbalance?
- Are there recommended mitigation strategies (e.g., dynamic loss balancing, early stopping, discriminator tuning)?
Environment
- F5-TTS latest commit
- GPU: A100 x 4
Steps to Reproduce
- Train F5-TTS with:
- Dataset 1: Mixed Thai, English, Chinese
- Dataset 2: Thai only
- Continue training past 700k steps
- Generate speech with a standard prompt
- Observe increased silence or entirely silent outputs beyond ~750k–800k steps
- Note: I have checked FFMPEG; it is working fine.
✔️ Expected Behavior
Model should continue producing natural speech without abnormal silence if training progresses normally.
❌ Actual Behavior
- Outputs degrade into silence permanently after a certain number of training steps.
- However, at certain numbers trainings steps that closes to the steps causing the degradation to permanent silence,