Paste this URL into Claude Code and tell it to set this up for you.
What This Is
Press a hotkey, speak, text appears wherever your cursor is. Any app - terminal, browser, notes, whatever.
This uses Whisper, an open-source speech recognition model that OpenAI released in 2022. The key thing: it runs entirely on your machine. Your voice never leaves your computer. No cloud API, no account, no subscription, no privacy policy to read. The model downloads once (~1.5GB), then everything is local forever.
Accuracy is excellent - comparable to commercial cloud services. On a modern CPU (i5/i7/Ryzen from the last few years), a 15-second dictation transcribes in about 6-8 seconds. You can also feed it a custom vocabulary of project names, technical terms, and proper nouns to improve recognition of domain-specific words.
Jump to: Linux · macOS · Windows · Custom Vocabulary · Speaker Diarisation
Prerequisites
All platforms:
- A decent CPU (any modern i5/i7/Ryzen 5+ works fine)
- Python 3.10+
- ~1.5GB disk space for the model
Platform-specific:
- Linux (Kubuntu/KDE): PipeWire audio, ydotool for typing
- macOS: SoX for recording, clipboard + paste for typing
- Windows 11: ffmpeg for recording, AutoHotkey for hotkey + typing
Part 1: Linux (Kubuntu / KDE Plasma)
This is what I use. No network dependency, no API costs, excellent accuracy.
Install faster-whisper
# Create a dedicated venv
python3 -m venv ~/.local/share/whisper-venv
# Install faster-whisper
~/.local/share/whisper-venv/bin/pip install faster-whisper
Create the Transcription Wrapper
Save this to ~/.local/bin/whisper-transcribe:
#!/home/YOUR_USERNAME/.local/share/whisper-venv/bin/python3
"""Local Whisper transcription using faster-whisper."""
import sys
from pathlib import Path
from faster_whisper import WhisperModel
# distil-large-v3: good balance of speed and accuracy. Use "large-v3" for best accuracy.
MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"
# Optional: custom vocabulary file (one term per line, sorted by frequency)
VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350 # ~40-60 terms fit in Whisper's 224-token prompt budget
def load_vocab():
"""Load custom vocabulary for Whisper biasing via initial_prompt."""
if not VOCAB_PATH.exists():
return None
terms = VOCAB_PATH.read_text().strip().splitlines()
if not terms:
return None
# Take top terms that fit in the char budget
prompt_parts, char_count = [], 0
for term in terms:
if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
break
prompt_parts.append(term)
char_count += len(term) + 2
return ", ".join(prompt_parts) if prompt_parts else None
def main():
if len(sys.argv) < 2:
print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
sys.exit(1)
audio_file = sys.argv[1]
initial_prompt = load_vocab()
model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)
kwargs = dict(
beam_size=5,
language="en",
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)
if initial_prompt:
kwargs["initial_prompt"] = initial_prompt
segments, info = model.transcribe(audio_file, **kwargs)
transcript = " ".join(segment.text.strip() for segment in segments)
print(transcript)
if __name__ == "__main__":
main()
Important: Replace YOUR_USERNAME with your actual username in the shebang line.
Make it executable and download the model:
chmod +x ~/.local/bin/whisper-transcribe
# First run downloads the model (~1.5GB) - takes a few minutes
~/.local/bin/whisper-transcribe /dev/null 2>/dev/null || true
Model Options
| Model | Speed | Accuracy | Notes |
|---|---|---|---|
large-v3 | ~0.4x realtime | Best | Most accurate |
distil-large-v3 | ~2x realtime | ~1% worse | What I use - good trade-off |
medium | ~3x realtime | Good | Lighter on resources |
For a 15-second dictation, distil-large-v3 takes about 6-8 seconds on a modern CPU.
Create the Dictation Script
Save this to ~/.local/bin/dictate-hotkey:
#!/bin/bash
# Global hotkey dictation script for Wayland/KDE
# Uses local Whisper (faster-whisper) for transcription
LOCK_FILE="/tmp/dictate-hotkey.lock"
PID_FILE="/tmp/dictate-hotkey.pid"
AUDIO_FILE="/tmp/dictate-hotkey.wav"
DEBUG_AUDIO="/tmp/dictate-hotkey-debug.wav"
WHISPER_BIN="$HOME/.local/bin/whisper-transcribe"
if [[ ! -x "$WHISPER_BIN" ]]; then
notify-send -u critical "Dictation" "whisper-transcribe not found"
exit 1
fi
cleanup_stale_lock() {
if [[ -f "$LOCK_FILE" ]] && [[ -f "$PID_FILE" ]]; then
local pid
pid=$(cat "$PID_FILE" 2>/dev/null)
if [[ -n "$pid" ]] && ! kill -0 "$pid" 2>/dev/null; then
rm -f "$LOCK_FILE" "$PID_FILE"
return 0
fi
elif [[ -f "$LOCK_FILE" ]] && [[ ! -f "$PID_FILE" ]]; then
rm -f "$LOCK_FILE"
return 0
fi
return 1
}
cleanup_stale_lock
if [[ -f "$LOCK_FILE" ]]; then
# STOP recording and transcribe
if [[ -f "$PID_FILE" ]]; then
pid=$(cat "$PID_FILE" 2>/dev/null)
if [[ -n "$pid" ]]; then
kill -INT "$pid" 2>/dev/null || true
for i in {1..10}; do
kill -0 "$pid" 2>/dev/null || break
sleep 0.1
done
kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null
fi
rm -f "$PID_FILE"
fi
rm -f "$LOCK_FILE"
if [[ ! -f "$AUDIO_FILE" ]]; then
notify-send -u critical "Dictation" "No audio file found"
exit 1
fi
audio_size=$(stat -c%s "$AUDIO_FILE" 2>/dev/null || echo "0")
if [[ "$audio_size" -lt 1000 ]]; then
notify-send -u critical "Dictation" "Audio too short"
rm -f "$AUDIO_FILE"
exit 1
fi
notify-send -t 1500 "Dictation" "Transcribing locally..."
TRANSCRIPT=$("$WHISPER_BIN" "$AUDIO_FILE" 2>/dev/null)
if [[ -z "$TRANSCRIPT" ]]; then
cp "$AUDIO_FILE" "$DEBUG_AUDIO" 2>/dev/null
notify-send -u critical "Dictation" "Transcription failed"
rm -f "$AUDIO_FILE"
exit 1
fi
rm -f "$AUDIO_FILE"
notify-send -t 1500 "Dictation" "Typing: ${TRANSCRIPT:0:50}..."
sleep 0.1
if ! echo -n "$TRANSCRIPT" | ydotool type --file -; then
if [[ -n "$WAYLAND_DISPLAY" ]]; then
echo -n "$TRANSCRIPT" | wl-copy
notify-send -t 2000 "Dictation" "Copied to clipboard (ydotool failed)"
fi
fi
else
# START recording
rm -f "$AUDIO_FILE"
pw-record --channels=1 "$AUDIO_FILE" &
record_pid=$!
sleep 0.2
if ! kill -0 "$record_pid" 2>/dev/null; then
notify-send -u critical "Dictation" "Failed to start recording"
exit 1
fi
touch "$LOCK_FILE"
echo "$record_pid" > "$PID_FILE"
notify-send -t 2000 "Dictation" "Recording... Press hotkey to stop"
fi
Make it executable:
chmod +x ~/.local/bin/dictate-hotkey
Set Up ydotool Permissions
sudo apt install ydotool wl-clipboard
sudo usermod -aG input $USER
Log out and back in for the group change to take effect.
Set Up the Global Hotkey
- Open System Settings > Shortcuts > Custom Shortcuts
- Click Edit > New > Global Shortcut > Command/URL
- Name it “Dictate”
- Trigger tab: Set your preferred hotkey (I use `Ctrl+``)
- Action tab: Enter
/home/YOUR_USERNAME/.local/bin/dictate-hotkey
Test It
- Open any text input
- Press your hotkey - “Recording…” notification
- Speak clearly
- Press hotkey again - “Transcribing locally…” then text appears
Troubleshooting
No audio captured (empty file):
# Check your audio sources
wpctl status
# Test recording manually (should produce >44 bytes)
pw-record --channels=1 /tmp/test.wav &
sleep 2
kill %1
ls -la /tmp/test.wav
If pw-record produces empty files, check that your microphone is:
- Not muted in system audio settings
- Set as the default input source
Important: Use pw-record, not parecord. On PipeWire systems, parecord often produces empty files even though the PulseAudio compatibility layer is installed. The native pw-record --channels=1 works reliably.
ydotool permission denied:
- Make sure you logged out and back in after adding yourself to the
inputgroup - Check:
groupsshould showinput
Hotkey not triggering:
- KDE sometimes needs a restart of the shortcuts daemon
- Try: Log out and back in, or restart Plasma (the shortcut daemon restarts with it)
Script seems stuck (pressing hotkey does nothing useful): The script auto-recovers from stale state, but if something is really stuck:
rm -f /tmp/dictate-hotkey.lock /tmp/dictate-hotkey.pid
pkill pw-record
Part 2: macOS (Sequoia 15.x)
Install Dependencies
# Install Homebrew if you don't have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install SoX for recording
brew install sox
Install faster-whisper
# Create a dedicated venv
python3 -m venv ~/.local/share/whisper-venv
# Install faster-whisper
~/.local/share/whisper-venv/bin/pip install faster-whisper
# Create bin directory if needed
mkdir -p ~/.local/bin
Create the Transcription Wrapper
Save this to ~/.local/bin/whisper-transcribe:
#!/Users/YOUR_USERNAME/.local/share/whisper-venv/bin/python3
"""Local Whisper transcription using faster-whisper."""
import sys
from pathlib import Path
from faster_whisper import WhisperModel
MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"
VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350
def load_vocab():
if not VOCAB_PATH.exists():
return None
terms = VOCAB_PATH.read_text().strip().splitlines()
if not terms:
return None
prompt_parts, char_count = [], 0
for term in terms:
if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
break
prompt_parts.append(term)
char_count += len(term) + 2
return ", ".join(prompt_parts) if prompt_parts else None
def main():
if len(sys.argv) < 2:
print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
sys.exit(1)
audio_file = sys.argv[1]
initial_prompt = load_vocab()
model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)
kwargs = dict(
beam_size=5,
language="en",
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)
if initial_prompt:
kwargs["initial_prompt"] = initial_prompt
segments, info = model.transcribe(audio_file, **kwargs)
transcript = " ".join(segment.text.strip() for segment in segments)
print(transcript)
if __name__ == "__main__":
main()
Important: Replace YOUR_USERNAME with your actual username in the shebang line.
Make it executable and download the model:
chmod +x ~/.local/bin/whisper-transcribe
# First run downloads the model (~1.5GB)
~/.local/bin/whisper-transcribe /dev/null 2>/dev/null || true
Create the Dictation Script
Save this to ~/.local/bin/dictate-hotkey:
#!/bin/bash
# Global hotkey dictation script for macOS
# Uses local Whisper (faster-whisper) for transcription
LOCK_FILE="/tmp/dictate-hotkey.lock"
PID_FILE="/tmp/dictate-hotkey.pid"
AUDIO_FILE="/tmp/dictate-hotkey.wav"
DEBUG_AUDIO="/tmp/dictate-hotkey-debug.wav"
WHISPER_BIN="$HOME/.local/bin/whisper-transcribe"
if [[ ! -x "$WHISPER_BIN" ]]; then
osascript -e 'display notification "whisper-transcribe not found" with title "Dictation" sound name "Basso"'
exit 1
fi
cleanup_stale_lock() {
if [[ -f "$LOCK_FILE" ]] && [[ -f "$PID_FILE" ]]; then
local pid
pid=$(cat "$PID_FILE" 2>/dev/null)
if [[ -n "$pid" ]] && ! kill -0 "$pid" 2>/dev/null; then
rm -f "$LOCK_FILE" "$PID_FILE"
return 0
fi
elif [[ -f "$LOCK_FILE" ]] && [[ ! -f "$PID_FILE" ]]; then
rm -f "$LOCK_FILE"
return 0
fi
return 1
}
cleanup_stale_lock
if [[ -f "$LOCK_FILE" ]]; then
# STOP recording and transcribe
if [[ -f "$PID_FILE" ]]; then
pid=$(cat "$PID_FILE" 2>/dev/null)
if [[ -n "$pid" ]]; then
kill -INT "$pid" 2>/dev/null || true
for i in {1..10}; do
kill -0 "$pid" 2>/dev/null || break
sleep 0.1
done
kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null
fi
rm -f "$PID_FILE"
fi
rm -f "$LOCK_FILE"
if [[ ! -f "$AUDIO_FILE" ]]; then
osascript -e 'display notification "No audio file found" with title "Dictation" sound name "Basso"'
exit 1
fi
audio_size=$(stat -f%z "$AUDIO_FILE" 2>/dev/null || echo "0")
if [[ "$audio_size" -lt 1000 ]]; then
osascript -e 'display notification "Audio too short" with title "Dictation" sound name "Basso"'
rm -f "$AUDIO_FILE"
exit 1
fi
osascript -e 'display notification "Transcribing locally..." with title "Dictation"'
TRANSCRIPT=$("$WHISPER_BIN" "$AUDIO_FILE" 2>/dev/null)
if [[ -z "$TRANSCRIPT" ]]; then
cp "$AUDIO_FILE" "$DEBUG_AUDIO" 2>/dev/null
osascript -e 'display notification "Transcription failed" with title "Dictation" sound name "Basso"'
rm -f "$AUDIO_FILE"
exit 1
fi
rm -f "$AUDIO_FILE"
# Copy to clipboard and paste
echo -n "$TRANSCRIPT" | pbcopy
osascript -e 'tell application "System Events" to keystroke "v" using command down'
osascript -e "display notification \"Typed: ${TRANSCRIPT:0:50}...\" with title \"Dictation\""
else
# START recording
rm -f "$AUDIO_FILE"
osascript -e 'display notification "Recording... Press hotkey to stop" with title "Dictation"'
rec -q -r 16000 -c 1 "$AUDIO_FILE" &
record_pid=$!
sleep 0.2
if ! kill -0 "$record_pid" 2>/dev/null; then
osascript -e 'display notification "Failed to start recording" with title "Dictation" sound name "Basso"'
exit 1
fi
touch "$LOCK_FILE"
echo "$record_pid" > "$PID_FILE"
fi
Make it executable:
chmod +x ~/.local/bin/dictate-hotkey
Set Up the Global Hotkey
Option A: Automator + System Shortcuts (built-in)
- Open Automator > New Document > Quick Action
- Set “Workflow receives” to no input in any application
- Add Run Shell Script action
- Paste:
~/.local/bin/dictate-hotkey - Save as “Dictate Toggle”
- Open System Settings > Keyboard > Keyboard Shortcuts > Services
- Find “Dictate Toggle” and assign your hotkey (e.g., `Ctrl+``)
Option B: Hammerspoon (more reliable)
If you use Hammerspoon, add to your ~/.hammerspoon/init.lua:
hs.hotkey.bind({"ctrl"}, "`", function()
hs.execute("~/.local/bin/dictate-hotkey", true)
end)
Then reload Hammerspoon config.
Grant Permissions
macOS will prompt for:
- Microphone access - Allow for Terminal/iTerm/Hammerspoon
- Accessibility access - Required for the paste keystroke
Go to System Settings > Privacy & Security to grant these if needed.
Test It
Same as Linux - press hotkey, speak, press again, text appears.
Troubleshooting
SoX not recording:
# List audio devices
rec -l
# Test recording
rec -r 16000 -c 1 /tmp/test.wav
# Ctrl+C to stop
play /tmp/test.wav
Paste not working:
- Make sure Accessibility permissions are granted
- Try the clipboard fallback: just
pbcopyand manually Cmd+V
Part 3: Windows 11
Windows uses AutoHotkey for the global hotkey and ffmpeg for recording. The transcription wrapper runs in Python, same as other platforms.
Install Dependencies
Option A: Using winget (recommended)
Open PowerShell as Administrator:
winget install --id=Gyan.FFmpeg -e
winget install --id=AutoHotkey.AutoHotkey -e
winget install --id=Python.Python.3.12 -e
Option B: Using Scoop
scoop install ffmpeg autohotkey python
Install faster-whisper
# Create venv and install
python -m venv "$env:USERPROFILE\.local\share\whisper-venv"
& "$env:USERPROFILE\.local\share\whisper-venv\Scripts\pip.exe" install faster-whisper
Create the Transcription Wrapper
Save this to %USERPROFILE%\.local\bin\whisper-transcribe.py:
"""Local Whisper transcription using faster-whisper."""
import sys
from pathlib import Path
from faster_whisper import WhisperModel
MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"
VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350
def load_vocab():
if not VOCAB_PATH.exists():
return None
terms = VOCAB_PATH.read_text().strip().splitlines()
if not terms:
return None
prompt_parts, char_count = [], 0
for term in terms:
if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
break
prompt_parts.append(term)
char_count += len(term) + 2
return ", ".join(prompt_parts) if prompt_parts else None
def main():
if len(sys.argv) < 2:
print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
sys.exit(1)
audio_file = sys.argv[1]
initial_prompt = load_vocab()
model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)
kwargs = dict(
beam_size=5,
language="en",
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)
if initial_prompt:
kwargs["initial_prompt"] = initial_prompt
segments, info = model.transcribe(audio_file, **kwargs)
transcript = " ".join(segment.text.strip() for segment in segments)
print(transcript)
if __name__ == "__main__":
main()
Create the directories and download the model:
mkdir "$env:USERPROFILE\.local\bin" -Force
# First run downloads the model (~1.5GB)
& "$env:USERPROFILE\.local\share\whisper-venv\Scripts\python.exe" "$env:USERPROFILE\.local\bin\whisper-transcribe.py" NUL 2>$null
Find Your Microphone Name
ffmpeg -list_devices true -f dshow -i dummy 2>&1 | Select-String "audio"
Look for a line like:
[dshow] "Microphone Array (Realtek(R) Audio)" (audio)
Copy the name in quotes - you’ll need it for the script below.
Create the AutoHotkey Script
Save this as dictate.ahk somewhere convenient (e.g., Documents\Scripts\dictate.ahk):
#Requires AutoHotkey v2.0
#SingleInstance Force
; Configuration
global LOCK_FILE := A_Temp "\dictate-hotkey.lock"
global PID_FILE := A_Temp "\dictate-hotkey.pid"
global AUDIO_FILE := A_Temp "\dictate-hotkey.wav"
global WHISPER_VENV := EnvGet("USERPROFILE") "\.local\share\whisper-venv\Scripts\python.exe"
global WHISPER_SCRIPT := EnvGet("USERPROFILE") "\.local\bin\whisper-transcribe.py"
; Hotkey: Ctrl+` (backtick) - change this to your preference
^`:: ToggleDictation()
ToggleDictation() {
; Clean up stale lock if recording process died
if FileExist(LOCK_FILE) && FileExist(PID_FILE) {
pid := FileRead(PID_FILE)
if !ProcessExist(pid) {
FileDelete(LOCK_FILE)
FileDelete(PID_FILE)
}
}
if FileExist(LOCK_FILE) {
StopAndTranscribe()
} else {
StartRecording()
}
}
StartRecording() {
; Clean up
if FileExist(AUDIO_FILE)
FileDelete(AUDIO_FILE)
; Create lock file
FileAppend("", LOCK_FILE)
; Show notification
TrayTip("Recording... Press Ctrl+` to stop", "Dictation", "Mute")
; Start ffmpeg recording in background
; IMPORTANT: Replace with YOUR device name from the step above
deviceName := "Microphone Array (Realtek(R) Audio)"
Run('cmd /c ffmpeg -f dshow -i audio="' deviceName '" -ar 16000 -ac 1 -y "' AUDIO_FILE '"', , "Hide", &pid)
; Save PID
FileAppend(pid, PID_FILE)
}
StopAndTranscribe() {
; Remove lock
if FileExist(LOCK_FILE)
FileDelete(LOCK_FILE)
; Kill ffmpeg
if FileExist(PID_FILE) {
pid := FileRead(PID_FILE)
try {
ProcessClose(pid)
}
FileDelete(PID_FILE)
}
; Also kill any lingering ffmpeg
Run('taskkill /f /im ffmpeg.exe 2>nul', , "Hide")
Sleep(500) ; Let file finish writing
if !FileExist(AUDIO_FILE) || FileGetSize(AUDIO_FILE) < 1000 {
TrayTip("No audio recorded", "Dictation Error", "Icon!")
return
}
TrayTip("Transcribing locally...", "Dictation", "Mute")
; Run whisper transcription
tempTranscript := A_Temp "\transcript.txt"
cmd := '"' WHISPER_VENV '" "' WHISPER_SCRIPT '" "' AUDIO_FILE '" > "' tempTranscript '" 2>nul'
RunWait(A_ComSpec ' /c ' cmd, , "Hide")
transcript := ""
if FileExist(tempTranscript) {
transcript := Trim(FileRead(tempTranscript), "`r`n")
FileDelete(tempTranscript)
}
FileDelete(AUDIO_FILE)
if (transcript = "") {
TrayTip("Transcription failed", "Dictation Error", "Icon!")
return
}
; Type the result using clipboard + paste (most reliable)
A_Clipboard := transcript
Sleep(100)
Send("^v")
TrayTip("Typed: " SubStr(transcript, 1, 50) "...", "Dictation", "Mute")
}
Important: Update the deviceName variable with your actual microphone name from the previous step.
Run on Startup (Optional)
- Press
Win+R, typeshell:startup, press Enter - Create a shortcut to your
dictate.ahkfile in this folder
Test It
- Double-click
dictate.ahkto run it (green H icon appears in system tray) - Open any text field
- Press `Ctrl+`` - tray notification shows “Recording…”
- Speak clearly
- Press `Ctrl+`` again - “Transcribing locally…” then text appears
Note: The first transcription will be slow (~30-60 seconds) as the model loads. Subsequent transcriptions will be faster (~6-8 seconds for 15 seconds of audio).
Troubleshooting
“No audio recorded” error:
- Check the
deviceNamevariable in the script matches your actual microphone - Test ffmpeg recording manually:
ffmpeg -f dshow -i audio="Your Microphone Name" -t 3 -y test.wav - Make sure microphone isn’t muted in Windows Sound settings
ffmpeg not found:
- Restart your terminal/PowerShell after installing
- Check it’s in PATH:
where ffmpeg
AutoHotkey script won’t run:
- Make sure you have AutoHotkey v2 installed (not v1)
- Right-click the
.ahkfile > “Run as administrator” if needed
Hotkey conflict:
- Change `^`` to something else in the script (look for the hotkey definition), e.g.:
^+dfor Ctrl+Shift+D#dfor Win+D!dfor Alt+D
Model download fails:
- Ensure you have ~3GB free disk space
- Check Python and pip are working:
python --version
Custom Vocabulary (Optional)
Whisper’s initial_prompt parameter biases the decoder toward specific words. If you frequently dictate domain-specific terms - project names, technical jargon, people’s names - you can improve accuracy by feeding Whisper a vocabulary list.
How It Works
The whisper-transcribe script loads vocabulary from ~/.local/share/whisper-vocab.txt (or %USERPROFILE%\.local\share\whisper-vocab.txt on Windows) if it exists. The file format is simple: one term per line, sorted by frequency (most common first). The script takes the top ~40-60 terms that fit in Whisper’s 350-character prompt budget.
Example vocabulary file:
TrueNAS
Syncthing
Kubernetes
Obsidian
PostgreSQL
FastAPI
Creating Your Vocabulary File
Simple approach: Manually list terms you commonly dictate:
cat > ~/.local/share/whisper-vocab.txt << 'EOF'
YourProjectName
TechnicalTerm1
TechnicalTerm2
SomeFramework
EOF
Automated approach: If you keep notes in markdown (Obsidian, Logseq, etc.), you can extract vocabulary programmatically:
- Scan all
.mdfiles for capitalised words and wiki-links - Filter against a dictionary (
/usr/share/dict/wordson Linux) to isolate non-common terms - Sort by frequency and output to the vocab file
I wrote a Python script that does this for my Obsidian vault - it extracts ~3000 terms, and Whisper loads the top 40-50 at transcription time. The improvement is subtle but noticeable for proper nouns.
Constraints
- Token budget: Whisper’s decoder context is 448 tokens. The
initial_promptshould use at most half (~224 tokens ≈ 350 characters). The script enforces this. - Acronyms are low yield: Whisper handles common acronyms fine from training data. Focus on proper nouns and domain-specific terms.
- Don’t use
hotwordswithinitial_prompt: In faster-whisper 1.2.1, combining both causes repetition artifacts. Stick toinitial_promptonly.
Beyond Dictation: Longer Recordings with Speaker Diarisation
The dictation setup above is optimised for short bursts — press a hotkey, speak a sentence, get text. But the same Whisper stack can handle longer recordings (meetings, interviews, voice memos) with an important addition: speaker diarisation — identifying who said what.
For this, WhisperX wraps faster-whisper with word-level timestamp alignment and pyannote speaker diarisation in a single pipeline.
Install WhisperX
# Create a separate venv (don't mix with dictation — different dependencies)
python3 -m venv ~/venvs/whisperx
~/venvs/whisperx/bin/pip install whisperx
One-Time Setup: HuggingFace Auth
Pyannote’s diarisation models are gated — you need a free HuggingFace account and must accept the model licenses:
- Create an account at huggingface.co
- Accept the licenses (click “Agree” on each):
- Generate a read token at huggingface.co/settings/tokens
- Log in locally:
~/venvs/whisperx/bin/huggingface-cli login
Models cache locally after first download (~2-4GB). After that, everything runs offline.
Transcribe with Diarisation
import torch
import whisperx
from whisperx.diarize import DiarizationPipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if device == "cuda" else "int8"
audio_file = "meeting.m4a"
# 1. Transcribe
model = whisperx.load_model("distil-large-v3", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=4)
# 2. Align word-level timestamps
model_a, metadata = whisperx.load_align_model(
language_code=result["language"], device=device
)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)
# 3. Diarise (identify speakers)
diarize_model = DiarizationPipeline(
model_name="pyannote/speaker-diarization-3.1", device=device
)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=4)
result = whisperx.assign_word_speakers(diarize_segments, result)
# 4. Print with speaker labels
for seg in result["segments"]:
speaker = seg.get("speaker", "?")
print(f"[{speaker}] {seg['text'].strip()}")
What to Expect
| Scenario | Accuracy |
|---|---|
| 2-3 speakers, clear audio | Excellent |
| 4-6 speakers, moderate audio | Good, some errors |
| 7+ speakers or noisy/overlapping | Needs manual review |
Specifying min_speakers / max_speakers (or num_speakers if you know the exact count) significantly improves accuracy — the model doesn’t have to guess.
Speed on CPU: Transcription runs at ~0.6x realtime (a 10 min file takes ~6 min). Diarisation adds roughly 2-3x audio length on top. A 10-minute meeting takes ~30 minutes total on CPU. On GPU, both steps are near-instant.
Key detail: Speaker labels come back as SPEAKER_00, SPEAKER_01, etc. — you map these to real names yourself. WhisperX assigns speakers at the word level, so even within a single transcription segment, different words can have different speaker labels. This matters for conversational audio where people talk in short turns.
When to Use What
| Use case | Tool |
|---|---|
| Quick dictation (sentence or two) | The hotkey setup above (faster-whisper) |
| Transcribing a voice memo or lecture | WhisperX without diarisation |
| Meeting or interview (multiple speakers) | WhisperX with diarisation (see code above) |
The dictation setup stays as-is — it’s optimised for speed and simplicity. WhisperX is the heavier tool you reach for when you have a longer recording and care about who said what.
What’s Next
Possible improvements to the dictation setup:
- Hold-to-talk instead of toggle (harder with global shortcuts)
- Visual indicator showing recording state in system tray
- Multiple languages (Whisper supports many)
- GPU acceleration (significantly faster if you have a compatible NVIDIA GPU)
The cognitive overhead of “press, speak, press” is minimal once you’ve done it a few times.
Questions or improvements? Find me at hedwards.dev.