#291: Extract Text From Audio Files With Vosk

With sounddevice and PyAudio we found two solutions to record audio files. With this new knowledge we can go one step further and find an option to extract the text from those recordings. For this task we ignore the online services and jump directly to two solutions that run on our local machine. In this post we use Vosk to transcribe our audio file, while next week we find out how well Whisper solves this task.

What is Vosk?

Vosk is an offline open-source speech recognition toolkit based on Kaldi. Unlike many cloud-based services, Vosk works entirely offline, ensuring privacy and does not require an internet connection after we downloaded the package and the model. It supports more than 20 languages and dialects, offers low-latency recognition, and works on multiple platforms including Windows, Linux, macOS, and even Raspberry Pi.

Installation

We can install Vosk with this command:

uv pip install vosk

Download the model

To use Vosk, we now need to download a model. For that we go to https://alphacephei.com/vosk/models and select a model that matches the language we want to transcribe. I opted for the vosk-model-small-en-us-0.15 model.

Download the model and extract the *.zip file. We can copy the whole folder next to our script or put it in a folder we can share between projects. If you opt for the later option, you must change the path in the script from the relative one I use to the absolute one that points to the model folder.

Create the demo file

To extract text from an audio file, we first need an audio file. I went back to the gTTS solution and modified the English part of the script to turn this text into an audio file:

Python Friday is a weekly blog about the Python programming language. It covers a wide range of topics from web development to data visualization and artificial intelligence.

I also saved the generated output from Google Translate to the audio_demo.mp3 file.

Transcribe the audio file

To extract the text of our audio file, we can use this script. Make sure that the path to the model matches the place you put it.

from vosk import Model, KaldiRecognizer
import wave
import json

wf = wave.open("audio_demo.wav", "rb")
if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
    raise ValueError("Audio file must be WAV format PCM mono.")

model = Model("vosk-model-small-en-us-0.15")  # Make sure this folder exists
rec = KaldiRecognizer(model, wf.getframerate())

result = ""
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        res = json.loads(rec.Result())
        result += res.get("text", "") + " "

# Final partial result
res = json.loads(rec.FinalResult())
result += res.get("text", "")

print("Recognized text:\n", result)

When we run the script and skip over the log output, we should get this output:

python friday is a weekly blog about the python programming language it covers a wide range of topics from web development to data visualization and artificial intelligence

There are no capital letters or punctuation, but otherwise the detected text matches the content of the audio file.