Skip to content

#286: Advanced Text-to-Speech With Coqui TTS

Coqui TTS is a library for advanced Text-to-Speech generation. Not only does it run on our local machine (like pyttsx3 that we covered in last week's post), but it allows us to use different models and even train our own. Let us see how we can work with Coqui TTS.

Coqui AI is dead, use the right fork

In early 2024, coqui.ai shut down and left the Git repository as-is. Unfortunately, not even a small hint was made on the repository to indicate the end of the company, what leads to a range of problems – the most notable that the Python package does not install.

Luckily for us, with idiap/coqui-ai-TTS there is a maintained fork that published a new package. That way we can keep using this interesting tool for text-to-speech. Just make sure that you use this repository and not the original one.

Requires Python 3.12

Unfortunately, Coqui TTS does not work with the current Python version. We need to build our virtual environment with Python 3.12:

uv venv --python 3.12
.\.venv\Scripts\activate

Installation

We can install the right package for Coqui TTS with this command:

uv pip install coqui-tts

This not only installs the Python module, but also a command line tool named tts.

Find the right model

One of the great benefits of Coqui TTS is the large set of models that we can use with it. To get the list of available models, we can run this command in the terminal of our virtual environment:

tts --list_models

This gives us a long list like this one:

Name format: type/language/dataset/model
  1: tts_models/multilingual/multi-dataset/xtts_v2
  2: tts_models/multilingual/multi-dataset/xtts_v1.1
  3: tts_models/multilingual/multi-dataset/your_tts
 71: tts_models/be/common-voice/glow-tts

Name format: type/language/dataset/model
  1: vocoder_models/universal/libri-tts/wavegrad
  2: vocoder_models/universal/libri-tts/fullband-melgan
  3: vocoder_models/en/ek1/wavegrad
19: vocoder_models/be/common-voice/hifigan

Name format: type/language/dataset/model
  1: voice_conversion_models/multilingual/vctk/freevc24
  2: voice_conversion_models/multilingual/multi-dataset/knnvc
  3: voice_conversion_models/multilingual/multi-dataset/openvoice_v1
  4: voice_conversion_models/multilingual/multi-dataset/openvoice_v2

Path to downloaded models: C:\Users\jg\AppData\Local\tts

The list shows us that there are 3 main types of models: - tts_models are the ones we select to turn text into speech. - vocoder_models are essential for generating the final audible output. - voice_conversion_models converts the voice from one speaker to another one.

If we found the name of a model we are interested in, we can use this command to get some details:

tts --model_info_by_name tts_models/be/common-voice/glow-tts

Model type: tts_models
Language supported: be
Dataset used: common-voice
Model name: glow-tts
Description: Belarusian GlowTTS model created by @alex73 (Github).
Default vocoder: vocoder_models/be/common-voice/hifigan

The first time we use a model, Coqui TTS will download it for us and stores it in this location:

OS Path
Linux ~/.local/share/tts
Mac ~/Library/Application Support/tts
Windows C:\Users\<user>\AppData\Local\tts

Attention: Those models are big, and each can take more than 2GB of disk space. Do not forget to delete models when you no longer want to use them.

Turn text into speech

After we picked a model, we can turn text into speech. We need to initialise our TTS class with the model and the device (CPU or CUDA to use the GPU) before we can create the audio file with the tts_to_file method. To play the audio file, we once more use playsound3:

import torch
from TTS.api import TTS
from playsound3 import playsound

# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize TTS
tts = TTS("tts_models/en/jenny/jenny").to(device)

# TTS to a file
file_name = "coqui_jenny.wav"
tts.tts_to_file(
  text="Welcome to Python Friday.",
  file_path=file_name,
)

playsound(file_name)

More to explore

After you made your first steps with Coqui TTS, I suggest you go to the documentation and explore the many more advanced tasks you can do with it. Coqui TTS allows us to convert the voice of one speaker into the voice of another one, what offers some interesting use cases.

If you want to train your own model or fine-tune an existing one, Coqui TTS has you covered. While these tasks are complex and require extensive training data, the necessary steps are thoroughly documented.

Next

Coqui TTS offer us a lot of flexibility and many advanced features. If we want to train our own model, we can do so by following the extended documentation. But even if we are only interested in turning text to speech, Coqui TTS gives us a wide range for customisations.

The only downside I found so far is that it takes its time. Next week we try to improve the speed by activating the GPU support for PyTorch.