#291: Extract Text From Audio Files With Vosk
With sounddevice and PyAudio we found two solutions to record audio files. With this new knowledge we can go one step further and find an option to extract the text from those recordings. For this task we ignore the online services and jump directly to two solutions that run on our local machine. In this post we use Vosk to transcribe our audio file, while next week we find out how well Whisper solves this task.
What is Vosk?
Vosk is an offline open-source speech recognition toolkit based on Kaldi. Unlike many cloud-based services, Vosk works entirely offline, ensuring privacy and does not require an internet connection after we downloaded the package and the model. It supports more than 20 languages and dialects, offers low-latency recognition, and works on multiple platforms including Windows, Linux, macOS, and even Raspberry Pi.
Installation
We can install Vosk with this command:
Download the model
To use Vosk, we now need to download a model. For that we go to https://alphacephei.com/vosk/models and select a model that matches the language we want to transcribe. I opted for the vosk-model-small-en-us-0.15 model.
Download the model and extract the *.zip file. We can copy the whole folder next to our script or put it in a folder we can share between projects. If you opt for the later option, you must change the path in the script from the relative one I use to the absolute one that points to the model folder.
Create the demo file
To extract text from an audio file, we first need an audio file. I went back to the gTTS solution and modified the English part of the script to turn this text into an audio file:
Python Friday is a weekly blog about the Python programming language. It covers a wide range of topics from web development to data visualization and artificial intelligence.
I also saved the generated output from Google Translate to the audio_demo.mp3 file.
Transcribe the audio file
To extract the text of our audio file, we can use this script. Make sure that the path to the model matches the place you put it.
When we run the script and skip over the log output, we should get this output:
python friday is a weekly blog about the python programming language it covers a wide range of topics from web development to data visualization and artificial intelligence
There are no capital letters or punctuation, but otherwise the detected text matches the content of the audio file.
Next
This first attempt with the vosk-model-small-en-us-0.15 model worked well with my audio file. But be aware that different topics may yield totally different results. Therefore, prepare yourself for a bit of experimentation to find the best matching model.
Next week we explore the Whisper library from OpenAI and see how that works with our audio file.