#286: Advanced Text-to-Speech With Coqui TTS
Coqui TTS is a library for advanced Text-to-Speech generation. Not only does it run on our local machine (like pyttsx3 that we covered in last week's post), but it allows us to use different models and even train our own. Let us see how we can work with Coqui TTS.
Coqui AI is dead, use the right fork
In early 2024, coqui.ai shut down and left the Git repository as-is. Unfortunately, not even a small hint was made on the repository to indicate the end of the company, what leads to a range of problems – the most notable that the Python package does not install.
Luckily for us, with idiap/coqui-ai-TTS there is a maintained fork that published a new package. That way we can keep using this interesting tool for text-to-speech. Just make sure that you use this repository and not the original one.
Requires Python 3.12
Unfortunately, Coqui TTS does not work with the current Python version. We need to build our virtual environment with Python 3.12:
Installation
We can install the right package for Coqui TTS with this command:
This not only installs the Python module, but also a command line tool named tts.
Find the right model
One of the great benefits of Coqui TTS is the large set of models that we can use with it. To get the list of available models, we can run this command in the terminal of our virtual environment:
This gives us a long list like this one:
Name format: type/language/dataset/model
1: tts_models/multilingual/multi-dataset/xtts_v2
2: tts_models/multilingual/multi-dataset/xtts_v1.1
3: tts_models/multilingual/multi-dataset/your_tts
…
71: tts_models/be/common-voice/glow-tts
Name format: type/language/dataset/model
1: vocoder_models/universal/libri-tts/wavegrad
2: vocoder_models/universal/libri-tts/fullband-melgan
3: vocoder_models/en/ek1/wavegrad
…
19: vocoder_models/be/common-voice/hifigan
Name format: type/language/dataset/model
1: voice_conversion_models/multilingual/vctk/freevc24
2: voice_conversion_models/multilingual/multi-dataset/knnvc
3: voice_conversion_models/multilingual/multi-dataset/openvoice_v1
4: voice_conversion_models/multilingual/multi-dataset/openvoice_v2
Path to downloaded models: C:\Users\jg\AppData\Local\tts
The list shows us that there are 3 main types of models: - tts_models are the ones we select to turn text into speech. - vocoder_models are essential for generating the final audible output. - voice_conversion_models converts the voice from one speaker to another one.
If we found the name of a model we are interested in, we can use this command to get some details:
tts --model_info_by_name tts_models/be/common-voice/glow-tts
Model type: tts_models
Language supported: be
Dataset used: common-voice
Model name: glow-tts
Description: Belarusian GlowTTS model created by @alex73 (Github).
Default vocoder: vocoder_models/be/common-voice/hifigan
The first time we use a model, Coqui TTS will download it for us and stores it in this location:
| OS | Path |
|---|---|
| Linux | ~/.local/share/tts |
| Mac | ~/Library/Application Support/tts |
| Windows | C:\Users\<user>\AppData\Local\tts |
Attention: Those models are big, and each can take more than 2GB of disk space. Do not forget to delete models when you no longer want to use them.
Turn text into speech
After we picked a model, we can turn text into speech. We need to initialise our TTS class with the model and the device (CPU or CUDA to use the GPU) before we can create the audio file with the tts_to_file method. To play the audio file, we once more use playsound3:
More to explore
After you made your first steps with Coqui TTS, I suggest you go to the documentation and explore the many more advanced tasks you can do with it. Coqui TTS allows us to convert the voice of one speaker into the voice of another one, what offers some interesting use cases.
If you want to train your own model or fine-tune an existing one, Coqui TTS has you covered. While these tasks are complex and require extensive training data, the necessary steps are thoroughly documented.
Next
Coqui TTS offer us a lot of flexibility and many advanced features. If we want to train our own model, we can do so by following the extended documentation. But even if we are only interested in turning text to speech, Coqui TTS gives us a wide range for customisations.
The only downside I found so far is that it takes its time. Next week we try to improve the speed by activating the GPU support for PyTorch.