Skip to content

Commit 4b33699

Browse files
committed
Update docs
1 parent b6e1ac6 commit 4b33699

File tree

1 file changed

+179
-3
lines changed

1 file changed

+179
-3
lines changed

docs/source/models/xtts.md

Lines changed: 179 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ You can do inference using one of the available speakers using the following com
6565
```
6666

6767
##### Clone a voice
68-
You can clone a speaker voice with a single or multiple references:
68+
You can clone a speaker voice using a single or multiple references:
6969

7070
###### Single reference
7171

@@ -98,7 +98,7 @@ or for all wav files in a directory you can use:
9898
#### 🐸TTS API
9999

100100
##### Clone a voice
101-
You can clone a speaker voice with a single or multiple references:
101+
You can clone a speaker voice using a single or multiple references:
102102

103103
###### Single reference
104104

@@ -208,4 +208,180 @@ model.cuda()
208208
print("Computing speaker latents...")
209209
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["reference.wav"])
210210

211-
print("Inference...")
211+
print("Inference...")
212+
out = model.inference(
213+
"It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
214+
"en",
215+
gpt_cond_latent,
216+
speaker_embedding,
217+
temperature=0.7, # Add custom parameters here
218+
)
219+
torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
220+
```
221+
222+
223+
##### Streaming manually
224+
225+
Here the goal is to stream the audio as it is being generated. This is useful for real-time applications.
226+
Streaming inference is typically slower than regular inference, but it allows to get a first chunk of audio faster.
227+
228+
229+
```python
230+
import os
231+
import time
232+
import torch
233+
import torchaudio
234+
from TTS.tts.configs.xtts_config import XttsConfig
235+
from TTS.tts.models.xtts import Xtts
236+
237+
print("Loading model...")
238+
config = XttsConfig()
239+
config.load_json("/path/to/xtts/config.json")
240+
model = Xtts.init_from_config(config)
241+
model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", use_deepspeed=True)
242+
model.cuda()
243+
244+
print("Computing speaker latents...")
245+
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["reference.wav"])
246+
247+
print("Inference...")
248+
t0 = time.time()
249+
chunks = model.inference_stream(
250+
"It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
251+
"en",
252+
gpt_cond_latent,
253+
speaker_embedding
254+
)
255+
256+
wav_chuncks = []
257+
for i, chunk in enumerate(chunks):
258+
if i == 0:
259+
print(f"Time to first chunck: {time.time() - t0}")
260+
print(f"Received chunk {i} of audio length {chunk.shape[-1]}")
261+
wav_chuncks.append(chunk)
262+
wav = torch.cat(wav_chuncks, dim=0)
263+
torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000)
264+
```
265+
266+
267+
### Training
268+
269+
#### Easy training
270+
To make `XTTS_v2` GPT encoder training easier for beginner users we did a gradio demo that implements the whole fine-tuning pipeline. The gradio demo enables the user to easily do the following steps:
271+
272+
- Preprocessing of the uploaded audio or audio files in 🐸 TTS coqui formatter
273+
- Train the XTTS GPT encoder with the processed data
274+
- Inference support using the fine-tuned model
275+
276+
The user can run this gradio demo locally or remotely using a Colab Notebook.
277+
278+
##### Run demo on Colab
279+
To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook.
280+
281+
The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing).
282+
283+
To learn how to use this Colab Notebook please check the [XTTS fine-tuning video]().
284+
285+
If you are not able to acess the video you need to follow the steps:
286+
287+
1. Open the Colab notebook and start the demo by runining the first two cells (ignore pip install errors in the first one).
288+
2. Click on the link "Running on public URL:" on the second cell output.
289+
3. On the first Tab (1 - Data processing) you need to select the audio file or files, wait for upload, and then click on the button "Step 1 - Create dataset" and then wait until the dataset processing is done.
290+
4. Soon as the dataset processing is done you need to go to the second Tab (2 - Fine-tuning XTTS Encoder) and press the button "Step 2 - Run the training" and then wait until the training is finished. Note that it can take up to 40 minutes.
291+
5. Soon the training is done you can go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded. Then you can do the inference on the model by clicking on the button "Step 4 - Inference".
292+
293+
294+
##### Run demo locally
295+
296+
To run the demo locally you need to do the following steps:
297+
1. Install 🐸 TTS following the instructions available [here](https://tts.readthedocs.io/en/dev/installation.html#installation).
298+
2. Install the Gradio demo requirements with the command `python3 -m pip install -r TTS/demos/xtts_ft_demo/requirements.txt`
299+
3. Run the Gradio demo using the command `python3 TTS/demos/xtts_ft_demo/xtts_demo.py`
300+
4. Follow the steps presented in the [tutorial video](https://www.youtube.com/watch?v=8tpDiiouGxc&feature=youtu.be) to be able to fine-tune and test the fine-tuned model.
301+
302+
303+
If you are not able to access the video, here is what you need to do:
304+
305+
1. On the first Tab (1 - Data processing) select the audio file or files, wait for upload
306+
2. Click on the button "Step 1 - Create dataset" and then wait until the dataset processing is done.
307+
3. Go to the second Tab (2 - Fine-tuning XTTS Encoder) and press the button "Step 2 - Run the training" and then wait until the training is finished. it will take some time.
308+
4. Go to the third Tab (3 - Inference) and then click on the button "Step 3 - Load Fine-tuned XTTS model" and wait until the fine-tuned model is loaded.
309+
5. Now you can run inference with the model by clicking on the button "Step 4 - Inference".
310+
311+
#### Advanced training
312+
313+
A recipe for `XTTS_v2` GPT encoder training using `LJSpeech` dataset is available at https://github.com/coqui-ai/TTS/tree/dev/recipes/ljspeech/xtts_v1/train_gpt_xtts.py
314+
315+
You need to change the fields of the `BaseDatasetConfig` to match your dataset and then update `GPTArgs` and `GPTTrainerConfig` fields as you need. By default, it will use the same parameters that XTTS v1.1 model was trained with. To speed up the model convergence, as default, it will also download the XTTS v1.1 checkpoint and load it.
316+
317+
After training you can do inference following the code bellow.
318+
319+
```python
320+
import os
321+
import torch
322+
import torchaudio
323+
from TTS.tts.configs.xtts_config import XttsConfig
324+
from TTS.tts.models.xtts import Xtts
325+
326+
# Add here the xtts_config path
327+
CONFIG_PATH = "recipes/ljspeech/xtts_v1/run/training/GPT_XTTS_LJSpeech_FT-October-23-2023_10+36AM-653f2e75/config.json"
328+
# Add here the vocab file that you have used to train the model
329+
TOKENIZER_PATH = "recipes/ljspeech/xtts_v1/run/training/XTTS_v2_original_model_files/vocab.json"
330+
# Add here the checkpoint that you want to do inference with
331+
XTTS_CHECKPOINT = "recipes/ljspeech/xtts_v1/run/training/GPT_XTTS_LJSpeech_FT/best_model.pth"
332+
# Add here the speaker reference
333+
SPEAKER_REFERENCE = "LjSpeech_reference.wav"
334+
335+
# output wav path
336+
OUTPUT_WAV_PATH = "xtts-ft.wav"
337+
338+
print("Loading model...")
339+
config = XttsConfig()
340+
config.load_json(CONFIG_PATH)
341+
model = Xtts.init_from_config(config)
342+
model.load_checkpoint(config, checkpoint_path=XTTS_CHECKPOINT, vocab_path=TOKENIZER_PATH, use_deepspeed=False)
343+
model.cuda()
344+
345+
print("Computing speaker latents...")
346+
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[SPEAKER_REFERENCE])
347+
348+
print("Inference...")
349+
out = model.inference(
350+
"It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
351+
"en",
352+
gpt_cond_latent,
353+
speaker_embedding,
354+
temperature=0.7, # Add custom parameters here
355+
)
356+
torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000)
357+
```
358+
359+
360+
361+
## References and Acknowledgements
362+
- VallE: https://arxiv.org/abs/2301.02111
363+
- Tortoise Repo: https://github.com/neonbjb/tortoise-tts
364+
- Faster implementation: https://github.com/152334H/tortoise-tts-fast
365+
- Univnet: https://arxiv.org/abs/2106.07889
366+
- Latent Diffusion:https://arxiv.org/abs/2112.10752
367+
- DALL-E: https://arxiv.org/abs/2102.12092
368+
- Perceiver: https://arxiv.org/abs/2103.03206
369+
370+
371+
## XttsConfig
372+
```{eval-rst}
373+
.. autoclass:: TTS.tts.configs.xtts_config.XttsConfig
374+
:members:
375+
```
376+
377+
## XttsArgs
378+
```{eval-rst}
379+
.. autoclass:: TTS.tts.models.xtts.XttsArgs
380+
:members:
381+
```
382+
383+
## XTTS Model
384+
```{eval-rst}
385+
.. autoclass:: TTS.tts.models.xtts.XTTS
386+
:members:
387+
```

0 commit comments

Comments
 (0)