Offline Speech Recognition with WhisperX

joe_rigby · June 22, 2024

Hello, new to the forum.

Just wanted to put this out into the ether to potentially help someone else who was in my situation.

Some of this might not be 100% correct, but this is how I was able to get it to work for me.

Some backstory - I had some long recordings with relatively sensitive information that I needed to transcribe, about five hours with more expected in the near future.

If you've ever transcribed long recordings before, you'll know that it's the best fun that money can buy.

I wanted some way to transcribe it using local software (not uploaded to the cloud), but didn't know a whole lot about it.

All I had was ChatGPT4o and Google and a star and a wish.

ChatGPT 4o recommended VOSK.

I am a relative python novice, so as I began installing VOSK in an environment, as was recommended, it ran into a lot of issues that were solved by simply installing the Python 3x version VOSK and all its dependencies needed to an alternate install location, and then calling that version each time .. sort of a pseudo environment haha.

Once VOSK was working, I realized the output wasn't so great. Then I looked at DeepSpeech.

I had trouble getting DeepSpeech to work, so I turned to Whisper. Whisper did OK, but then I saw WhisperX.

WhisperX was a lot more resource intensive, it ran for approximately four hours before I discovered that you can run it on your GPU. Since I have a 4090, it was a no-brainer!

On a 4090, it can process 3 hours of audio in ~12 minutes or so.

I had to use Python 3.9.9 to get everything to work together and play nice, but even now it complains haha.

What follows is all the packages installed for Python 3.9.9 specifically, and their versions:

Package                 Version
aiohttp                 3.9.5
aiosignal               1.3.1
alembic                 1.13.1
antlr4-python3-runtime  4.9.3
asteroid-filterbanks    0.4.0
async-timeout           4.0.3
attrs                   23.2.0
audioread               3.0.1
av                      11.0.0
certifi                 2024.6.2
cffi                    1.16.0
charset-normalizer      3.3.2
click                   8.1.7
colorama                0.4.6
coloredlogs             15.0.1
colorlog                6.8.2
contourpy               1.2.1
ctranslate2             4.3.1
cycler                  0.12.1
decorator               5.1.1
docopt                  0.6.2
einops                  0.8.0
faster-whisper          1.0.0
filelock                3.15.3
flatbuffers             24.3.25
fonttools               4.53.0
frozenlist              1.4.1
fsspec                  2024.6.0
greenlet                3.0.3
huggingface-hub         0.23.4
humanfriendly           10.0
HyperPyYAML             1.2.2
idna                    3.7
importlib_resources     6.4.0
intel-openmp            2021.4.0
Jinja2                  3.1.4
joblib                  1.4.2
julius                  0.2.7
kiwisolver              1.4.5
lazy_loader             0.4
librosa                 0.10.2.post1
lightning               2.3.0
lightning-utilities     0.11.2
llvmlite                0.43.0
Mako                    1.3.5
markdown-it-py          3.0.0
MarkupSafe              2.1.5
matplotlib              3.9.0
mdurl                   0.1.2
mkl                     2021.4.0
more-itertools          10.3.0
mpmath                  1.3.0
msgpack                 1.0.8
multidict               6.0.5
networkx                3.2.1
nltk                    3.8.1
numba                   0.60.0
numpy                   1.25.0
omegaconf               2.3.0
onnxruntime             1.18.0
openai-whisper          20231117
optuna                  3.6.1
packaging               24.1
pandas                  2.2.2
pillow                  10.3.0
pip                     21.2.4
platformdirs            4.2.2
pooch                   1.8.2
primePy                 1.3
protobuf                5.27.1
pyannote.audio3.1.1
pyannote.core           5.0.0
pyannote.database       5.1.0
pyannote.metrics        3.2.1
pyannote.pipeline       3.0.1
pycparser               2.22
pydub                   0.25.1
Pygments                2.18.0
pyparsing               3.1.2
pyreadline3             3.4.1
python-dateutil         2.9.0.post0
pytorch-lightning       2.3.0
pytorch-metric-learning 2.5.0
pytz                    2024.1
PyYAML                  6.0.1
regex                   2024.5.15
requests                2.32.3
rich                    13.7.1
ruamel.yaml             0.18.6
ruamel.yaml.clib        0.2.8
safetensors             0.4.3
scikit-learn            1.5.0
scipy                   1.13.1
semver                  3.0.2
sentencepiece           0.2.0
setuptools              70.1.0
shellingham             1.5.4
six                     1.16.0
sortedcontainers        2.4.0
soundfile               0.12.1
soxr                    0.3.7
speechbrain             1.0.0
SQLAlchemy              2.0.31
sympy                   1.12.1
tabulate                0.9.0
tbb                     2021.12.0
tensorboardX            2.6.2.2
threadpoolctl           3.5.0
tiktoken                0.7.0
tokenizers              0.15.2
torch                   2.0.0+cu117
torch-audiomentations   0.11.1
torch-pitch-shift       1.2.4
torchaudio              2.0.1+cu117
torchmetrics            1.4.0.post0
torchvision             0.11.1+cu102
tqdm                    4.66.4
transformers            4.39.3
typer                   0.12.3
typing_extensions       4.12.2
tzdata                  2024.1
urllib3                 2.2.2
whisperx                3.1.1
yarl                    1.9.4
zipp                    3.19.2

Hopefully none of these are irrelevant to WhisperX, certainly possible with all the experimentation I had to do.

This is the script .. some notes:

It will output all of the transcribed words into the command line, but stores them into a text file with the same name of the transcribed file, in the same directory.
It will try to put each spoken sentence on a separate line, but does get confused if people are overlapping in speech.
Sometimes gets correct punctuation, but always good for a period.
Clear speech is nearly perfect, but struggles with concurrent noises.
Sometimes skips sentences for some reason??

Anyway, here is the python script I used. Remember, this is designed to run on a card with CUDA cores, so an NVIDIA card will be required:

import os
import whisperx
import torch
import re

# Paths to your directories
input_directory = r"ADDRESS OF FOLDER YOU WANT THE MP3 FILES TRANSCRIBED IN"
output_directory = input_directory # Using the same directory for output

# Check if GPU is available and set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load the WhisperX model
print("Loading model...")
model = whisperx.load_model("large", device=device)
print("Model loaded.")

# Function to transcribe a single file
def transcribe_file(audio_file_path, output_file_path):

# Transcribe the audio file
print(f"Starting transcription for {audio_file_path}...")
result = model.transcribe(audio_file_path)
print("Transcription completed.")

# Inspect the result object
print("Result object structure:")
for key in result.keys():
print(key)

# Assuming the transcriptions are under a key like 'segments' or similar
if 'segments' in result:
transcriptions = [segment['text'] for segment in result['segments']]
full_text = ' '.join(transcriptions)

# Print the transcription
print(full_text)

# Split the transcription into sentences
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', full_text)

# Write each sentence on a new line using utf-8 encoding
with open(output_file_path, 'w', encoding='utf-8') as f:
for sentence in sentences:
f.write(sentence.strip() + '\n')
print(f"Transcription saved to {output_file_path}")
else:
print("Expected key 'segments' not found in result object")

# Iterate over all files in the input directory
for filename in os.listdir(input_directory):
if filename.endswith(".mp3"):
audio_file_path = os.path.join(input_directory, filename)
output_file_path = os.path.join(output_directory, filename.replace(".mp3", "_transcription.txt"))
transcribe_file(audio_file_path, output_file_path)

This is how I call it from the command line:
C:/Python399/python.exe transcribefolder.py

It gripes about these things every time I run it:

Using device: cuda
Loading model...
No language specified, language will be first be detected for each audio file (increases inference time).
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.3.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\username.cache\torch\whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.0+cu117. Bad things might happen unless you revert torch to 1.x.Model loaded.
Starting transcription for C:\Users\username\Desktop\E\6-13-24\Audio Files in mp3 for transcription\Recording.mp3...
Detected language: en (1.00) in first 30s of audio...

Maybe you will have luck making it run without complaining, but I am afraid to breathe near it.

The output isn't perfect, but it significantly cuts down on the time required to get started, as long sections of clear audio are perfectly transcribed.

Just remember, if you run this, it will max out your GPU for the duration of the run (it's using the large model). If you have a good card, that run shouldn't last long.