Long Silence in Generated Speech After ~800k Steps

### Checks

- [x] This template is only for bug reports, usage problems go with 'Help Wanted'.
- [x] I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
- [x] I have searched for existing issues, including closed ones, and couldn't find a solution.
- [x] I am using English to submit this issue to facilitate community communication.

### Environment Details

After a certain number of training steps, the F5-TTS model begins generating speech outputs with long silence. This occurs with both multilingual and monolingual datasets:
* Dataset 1: Thai, English, Chinese → issue starts around 800k steps
* Dataset 2: Thai only → issue starts around 750k steps

After reaching these steps, the generated output often becomes mostly or entirely silent despite no apparent changes to the training process or hyperparameters.

**Detailed Observation**
* Before reaching the threshold step count (~750k–800k), there are occasional silent outputs.
* As training progresses closer to this point, the frequency of silent generations increases.
* Eventually, the model almost always outputs silence, even for previously working prompts.
* This degradation appears to be gradual, not sudden.

**Questions**
* Have you observed similar behavior in your experiments?
* Could this be related to overfitting, stability in the diffusion process, or dataset imbalance?
* Are there recommended mitigation strategies (e.g., dynamic loss balancing, early stopping, discriminator tuning)?

**Environment**
* F5-TTS latest commit
* GPU: A100 x 4

### Steps to Reproduce

1. Train F5-TTS with:
* Dataset 1: Mixed Thai, English, Chinese
* Dataset 2: Thai only
2. Continue training past 700k steps
3. Generate speech with a standard prompt
4. Observe increased silence or entirely silent outputs beyond ~750k–800k steps
5. **Note**: I have checked FFMPEG; it is working fine.


### ✔️ Expected Behavior

Model should continue producing natural speech without abnormal silence if training progresses normally.


### ❌ Actual Behavior

* Outputs degrade into silence permanently after a certain number of training steps.
* However, at certain numbers trainings steps that closes to the steps causing the degradation to permanent silence,


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long Silence in Generated Speech After ~800k Steps #1101

Checks

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Long Silence in Generated Speech After ~800k Steps #1101

Description

Checks

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions