-
This is a PyTorch implementation of Microsoft's FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
-
This project is based on ming024's implementation. Any suggestion for improvement is appreciated.
-
Now supporting about 900 speakers in 🔥 LibriTTS for multi-speaker text-to-speech.
This project supports 4 datasets, including muti-speaker datasets and single-speaker datasets:
-
LibriTTS
-
VCTK
-
LJSpeech
-
Blizzard2013
After downloading the dataset, extract the compressed files. You have to modify the hp.data_path
and some other parameters in hparams.py
. Default parameters are for the LibriTTS dataset.
- Download the pretrained model.
- Put
checkpoint_600000.pth.tar
in ./states/ckpt. - Run
python synthesize.py
Preprocessing contains 3 stages:
- Preparing Alignment Data
- Montreal Force Alignmnet (MFA)
- Creating Training Dataset
For 2. Montreal Force Alignmnet (MFA), please refer to Montreal-Forced-Aligner.
Download and extract the tar.gz file, then specify the path to MFA in hparams.py
Then run:
python preprocess.py --prepare_align --mfa --create_dataset
After preprocessing, you will get a stat.txt
file in your hp.preprocessed_path/
, recording the maximum and minimum values of the fundamental frequency and energy values throughout the entire corpus. You have to modify the f0 and energy parameters in the data/dataset.yaml
according to the content of stat.txt
.
Train your model with
python train.py
The training output, including log message, checkpoint, and synthesized audios will be put in ./states
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, Y. Ren, et al.
- FastSpeech: Fast, Robust and Controllable Text to Speech, Y. Ren, et al.
- xcmyz's FastSpeech implementation
- rishikksh20's FastSpeech2 implementation
- TensorSpeech's FastSpeech2 implementation
- NVIDIA's WaveGlow implementation
- seungwonpark's MelGAN implementation