Global Dictation with Speech-to-Text
Press a hotkey, speak, have text typed into any app
Set up a global dictation system using local Whisper. Works with any application - Claude Code, Obsidian, browser, whatever has focus. Covers Linux, macOS, and Windows 11.
Paste this URL into Claude Code and tell it to set this up for you.
What This Is
Press a hotkey, speak, text appears wherever your cursor is. Any app - terminal, browser, notes, whatever.
This uses Whisper, an open-source speech recognition model that OpenAI released in 2022. The key thing: it runs entirely on your machine. Your voice never leaves your computer. No cloud API, no account, no subscription, no privacy policy to read. The model downloads once (~1.5GB), then everything is local forever.
Accuracy is excellent - comparable to commercial cloud services. On a modern CPU (i5/i7/Ryzen from the last few years), a 15-second dictation transcribes in about 6-8 seconds. You can also feed it a custom vocabulary of project names, technical terms, and proper nouns to improve recognition of domain-specific words.
Jump to: Linux ยท macOS ยท Windows ยท Custom Vocabulary
Prerequisites
All platforms:
- A decent CPU (any modern i5/i7/Ryzen 5+ works fine)
- Python 3.10+
- ~1.5GB disk space for the model
Platform-specific:
- Linux (Kubuntu/KDE): PipeWire audio, ydotool for typing
- macOS: SoX for recording, clipboard + paste for typing
- Windows 11: ffmpeg for recording, AutoHotkey for hotkey + typing
Part 1: Linux (Kubuntu / KDE Plasma)
This is what I use. No network dependency, no API costs, excellent accuracy.
Install faster-whisper
# Create a dedicated venv
python3 -m venv ~/.local/share/whisper-venv
# Install faster-whisper
~/.local/share/whisper-venv/bin/pip install faster-whisper
Create the Transcription Wrapper
Save this to ~/.local/bin/whisper-transcribe:
#!/home/YOUR_USERNAME/.local/share/whisper-venv/bin/python3
"""Local Whisper transcription using faster-whisper."""
import sys
from pathlib import Path
from faster_whisper import WhisperModel
# distil-large-v3: good balance of speed and accuracy. Use "large-v3" for best accuracy.
MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"
# Optional: custom vocabulary file (one term per line, sorted by frequency)
VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350 # ~40-60 terms fit in Whisper's 224-token prompt budget
def load_vocab():
"""Load custom vocabulary for Whisper biasing via initial_prompt."""
if not VOCAB_PATH.exists():
return None
terms = VOCAB_PATH.read_text().strip().splitlines()
if not terms:
return None
# Take top terms that fit in the char budget
prompt_parts, char_count = [], 0
for term in terms:
if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
break
prompt_parts.append(term)
char_count += len(term) + 2
return ", ".join(prompt_parts) if prompt_parts else None
def main():
if len(sys.argv) < 2:
print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
sys.exit(1)
audio_file = sys.argv[1]
initial_prompt = load_vocab()
model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)
kwargs = dict(
beam_size=5,
language="en",
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)
if initial_prompt:
kwargs["initial_prompt"] = initial_prompt
segments, info = model.transcribe(audio_file, **kwargs)
transcript = " ".join(segment.text.strip() for segment in segments)
print(transcript)
if __name__ == "__main__":
main()
Important: Replace YOUR_USERNAME with your actual username in the shebang line.
Make it executable and download the model:
chmod +x ~/.local/bin/whisper-transcribe
# First run downloads the model (~1.5GB) - takes a few minutes
~/.local/bin/whisper-transcribe /dev/null 2>/dev/null || true
Model Options
| Model | Speed | Accuracy | Notes |
|---|---|---|---|
large-v3 | ~0.4x realtime | Best | Most accurate |
distil-large-v3 | ~2x realtime | ~1% worse | What I use - good trade-off |
medium | ~3x realtime | Good | Lighter on resources |
For a 15-second dictation, distil-large-v3 takes about 6-8 seconds on a modern CPU.
Create the Dictation Script
Save this to ~/.local/bin/dictate-hotkey:
#!/bin/bash
# Global hotkey dictation script for Wayland/KDE
# Uses local Whisper (faster-whisper) for transcription
LOCK_FILE="/tmp/dictate-hotkey.lock"
PID_FILE="/tmp/dictate-hotkey.pid"
AUDIO_FILE="/tmp/dictate-hotkey.wav"
DEBUG_AUDIO="/tmp/dictate-hotkey-debug.wav"
WHISPER_BIN="$HOME/.local/bin/whisper-transcribe"
if [[ ! -x "$WHISPER_BIN" ]]; then
notify-send -u critical "Dictation" "whisper-transcribe not found"
exit 1
fi
cleanup_stale_lock() {
if [[ -f "$LOCK_FILE" ]] && [[ -f "$PID_FILE" ]]; then
local pid
pid=$(cat "$PID_FILE" 2>/dev/null)
if [[ -n "$pid" ]] && ! kill -0 "$pid" 2>/dev/null; then
rm -f "$LOCK_FILE" "$PID_FILE"
return 0
fi
elif [[ -f "$LOCK_FILE" ]] && [[ ! -f "$PID_FILE" ]]; then
rm -f "$LOCK_FILE"
return 0
fi
return 1
}
cleanup_stale_lock
if [[ -f "$LOCK_FILE" ]]; then
# STOP recording and transcribe
if [[ -f "$PID_FILE" ]]; then
pid=$(cat "$PID_FILE" 2>/dev/null)
if [[ -n "$pid" ]]; then
kill "$pid" 2>/dev/null || true
sleep 0.3
fi
rm -f "$PID_FILE"
fi
rm -f "$LOCK_FILE"
if [[ ! -f "$AUDIO_FILE" ]]; then
notify-send -u critical "Dictation" "No audio file found"
exit 1
fi
audio_size=$(stat -c%s "$AUDIO_FILE" 2>/dev/null || echo "0")
if [[ "$audio_size" -lt 1000 ]]; then
notify-send -u critical "Dictation" "Audio too short"
rm -f "$AUDIO_FILE"
exit 1
fi
notify-send -t 1500 "Dictation" "Transcribing locally..."
TRANSCRIPT=$("$WHISPER_BIN" "$AUDIO_FILE" 2>/dev/null)
if [[ -z "$TRANSCRIPT" ]]; then
cp "$AUDIO_FILE" "$DEBUG_AUDIO" 2>/dev/null
notify-send -u critical "Dictation" "Transcription failed"
rm -f "$AUDIO_FILE"
exit 1
fi
rm -f "$AUDIO_FILE"
notify-send -t 1500 "Dictation" "Typing: ${TRANSCRIPT:0:50}..."
sleep 0.1
if ! echo -n "$TRANSCRIPT" | ydotool type --file -; then
if [[ -n "$WAYLAND_DISPLAY" ]]; then
echo -n "$TRANSCRIPT" | wl-copy
notify-send -t 2000 "Dictation" "Copied to clipboard (ydotool failed)"
fi
fi
else
# START recording
rm -f "$AUDIO_FILE"
pw-record --channels=2 "$AUDIO_FILE" &
record_pid=$!
sleep 0.2
if ! kill -0 "$record_pid" 2>/dev/null; then
notify-send -u critical "Dictation" "Failed to start recording"
exit 1
fi
echo "$record_pid" > "$PID_FILE"
touch "$LOCK_FILE"
notify-send -t 2000 "Dictation" "Recording... Press hotkey to stop"
fi
Make it executable:
chmod +x ~/.local/bin/dictate-hotkey
Set Up ydotool Permissions
sudo apt install ydotool wl-clipboard
sudo usermod -aG input $USER
Log out and back in for the group change to take effect.
Set Up the Global Hotkey
- Open System Settings > Shortcuts > Custom Shortcuts
- Click Edit > New > Global Shortcut > Command/URL
- Name it “Dictate”
- Trigger tab: Set your preferred hotkey (I use `Ctrl+``)
- Action tab: Enter
/home/YOUR_USERNAME/.local/bin/dictate-hotkey
Test It
- Open any text input
- Press your hotkey - “Recording…” notification
- Speak clearly
- Press hotkey again - “Transcribing locally…” then text appears
Troubleshooting
No audio captured (empty file):
# Check your audio sources
wpctl status
# Test recording manually (should produce >44 bytes)
pw-record --channels=2 /tmp/test.wav &
sleep 2
kill %1
ls -la /tmp/test.wav
If pw-record produces empty files, check that your microphone is:
- Not muted in system audio settings
- Set as the default input source
Important: Use pw-record, not parecord. On PipeWire systems, parecord often produces empty files even though the PulseAudio compatibility layer is installed. The native pw-record --channels=2 works reliably.
ydotool permission denied:
- Make sure you logged out and back in after adding yourself to the
inputgroup - Check:
groupsshould showinput
Hotkey not triggering:
- KDE sometimes needs a restart of the shortcuts daemon
- Try: Log out and back in, or restart Plasma (the shortcut daemon restarts with it)
Script seems stuck (pressing hotkey does nothing useful): The script auto-recovers from stale state, but if something is really stuck:
rm -f /tmp/dictate-hotkey.lock /tmp/dictate-hotkey.pid
pkill pw-record
Part 2: macOS (Sequoia 15.x)
Install Dependencies
# Install Homebrew if you don't have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install SoX for recording
brew install sox
Install faster-whisper
# Create a dedicated venv
python3 -m venv ~/.local/share/whisper-venv
# Install faster-whisper
~/.local/share/whisper-venv/bin/pip install faster-whisper
# Create bin directory if needed
mkdir -p ~/.local/bin
Create the Transcription Wrapper
Save this to ~/.local/bin/whisper-transcribe:
#!/Users/YOUR_USERNAME/.local/share/whisper-venv/bin/python3
"""Local Whisper transcription using faster-whisper."""
import sys
from pathlib import Path
from faster_whisper import WhisperModel
MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"
VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350
def load_vocab():
if not VOCAB_PATH.exists():
return None
terms = VOCAB_PATH.read_text().strip().splitlines()
if not terms:
return None
prompt_parts, char_count = [], 0
for term in terms:
if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
break
prompt_parts.append(term)
char_count += len(term) + 2
return ", ".join(prompt_parts) if prompt_parts else None
def main():
if len(sys.argv) < 2:
print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
sys.exit(1)
audio_file = sys.argv[1]
initial_prompt = load_vocab()
model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)
kwargs = dict(
beam_size=5,
language="en",
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)
if initial_prompt:
kwargs["initial_prompt"] = initial_prompt
segments, info = model.transcribe(audio_file, **kwargs)
transcript = " ".join(segment.text.strip() for segment in segments)
print(transcript)
if __name__ == "__main__":
main()
Important: Replace YOUR_USERNAME with your actual username in the shebang line.
Make it executable and download the model:
chmod +x ~/.local/bin/whisper-transcribe
# First run downloads the model (~1.5GB)
~/.local/bin/whisper-transcribe /dev/null 2>/dev/null || true
Create the Dictation Script
Save this to ~/.local/bin/dictate-hotkey:
#!/bin/bash
# Global hotkey dictation script for macOS
# Uses local Whisper (faster-whisper) for transcription
LOCK_FILE="/tmp/dictate-hotkey.lock"
PID_FILE="/tmp/dictate-hotkey.pid"
AUDIO_FILE="/tmp/dictate-hotkey.wav"
DEBUG_AUDIO="/tmp/dictate-hotkey-debug.wav"
WHISPER_BIN="$HOME/.local/bin/whisper-transcribe"
if [[ ! -x "$WHISPER_BIN" ]]; then
osascript -e 'display notification "whisper-transcribe not found" with title "Dictation" sound name "Basso"'
exit 1
fi
cleanup_stale_lock() {
if [[ -f "$LOCK_FILE" ]] && [[ -f "$PID_FILE" ]]; then
local pid
pid=$(cat "$PID_FILE" 2>/dev/null)
if [[ -n "$pid" ]] && ! kill -0 "$pid" 2>/dev/null; then
rm -f "$LOCK_FILE" "$PID_FILE"
return 0
fi
elif [[ -f "$LOCK_FILE" ]] && [[ ! -f "$PID_FILE" ]]; then
rm -f "$LOCK_FILE"
return 0
fi
return 1
}
cleanup_stale_lock
if [[ -f "$LOCK_FILE" ]]; then
# STOP recording and transcribe
if [[ -f "$PID_FILE" ]]; then
pid=$(cat "$PID_FILE" 2>/dev/null)
if [[ -n "$pid" ]]; then
kill "$pid" 2>/dev/null || true
sleep 0.3
fi
rm -f "$PID_FILE"
fi
rm -f "$LOCK_FILE"
if [[ ! -f "$AUDIO_FILE" ]]; then
osascript -e 'display notification "No audio file found" with title "Dictation" sound name "Basso"'
exit 1
fi
audio_size=$(stat -f%z "$AUDIO_FILE" 2>/dev/null || echo "0")
if [[ "$audio_size" -lt 1000 ]]; then
osascript -e 'display notification "Audio too short" with title "Dictation" sound name "Basso"'
rm -f "$AUDIO_FILE"
exit 1
fi
osascript -e 'display notification "Transcribing locally..." with title "Dictation"'
TRANSCRIPT=$("$WHISPER_BIN" "$AUDIO_FILE" 2>/dev/null)
if [[ -z "$TRANSCRIPT" ]]; then
cp "$AUDIO_FILE" "$DEBUG_AUDIO" 2>/dev/null
osascript -e 'display notification "Transcription failed" with title "Dictation" sound name "Basso"'
rm -f "$AUDIO_FILE"
exit 1
fi
rm -f "$AUDIO_FILE"
# Copy to clipboard and paste
echo -n "$TRANSCRIPT" | pbcopy
osascript -e 'tell application "System Events" to keystroke "v" using command down'
osascript -e "display notification \"Typed: ${TRANSCRIPT:0:50}...\" with title \"Dictation\""
else
# START recording
rm -f "$AUDIO_FILE"
osascript -e 'display notification "Recording... Press hotkey to stop" with title "Dictation"'
rec -q -r 16000 -c 1 "$AUDIO_FILE" &
record_pid=$!
sleep 0.2
if ! kill -0 "$record_pid" 2>/dev/null; then
osascript -e 'display notification "Failed to start recording" with title "Dictation" sound name "Basso"'
exit 1
fi
echo "$record_pid" > "$PID_FILE"
touch "$LOCK_FILE"
fi
Make it executable:
chmod +x ~/.local/bin/dictate-hotkey
Set Up the Global Hotkey
Option A: Automator + System Shortcuts (built-in)
- Open Automator > New Document > Quick Action
- Set “Workflow receives” to no input in any application
- Add Run Shell Script action
- Paste:
~/.local/bin/dictate-hotkey - Save as “Dictate Toggle”
- Open System Settings > Keyboard > Keyboard Shortcuts > Services
- Find “Dictate Toggle” and assign your hotkey (e.g., `Ctrl+``)
Option B: Hammerspoon (more reliable)
If you use Hammerspoon, add to your ~/.hammerspoon/init.lua:
hs.hotkey.bind({"ctrl"}, "`", function()
hs.execute("~/.local/bin/dictate-hotkey", true)
end)
Then reload Hammerspoon config.
Grant Permissions
macOS will prompt for:
- Microphone access - Allow for Terminal/iTerm/Hammerspoon
- Accessibility access - Required for the paste keystroke
Go to System Settings > Privacy & Security to grant these if needed.
Test It
Same as Linux - press hotkey, speak, press again, text appears.
Troubleshooting
SoX not recording:
# List audio devices
rec -l
# Test recording
rec -r 16000 -c 1 /tmp/test.wav
# Ctrl+C to stop
play /tmp/test.wav
Paste not working:
- Make sure Accessibility permissions are granted
- Try the clipboard fallback: just
pbcopyand manually Cmd+V
Part 3: Windows 11
Windows uses AutoHotkey for the global hotkey and ffmpeg for recording. The transcription wrapper runs in Python, same as other platforms.
Install Dependencies
Option A: Using winget (recommended)
Open PowerShell as Administrator:
winget install --id=Gyan.FFmpeg -e
winget install --id=AutoHotkey.AutoHotkey -e
winget install --id=Python.Python.3.12 -e
Option B: Using Scoop
scoop install ffmpeg autohotkey python
Install faster-whisper
# Create venv and install
python -m venv "$env:USERPROFILE\.local\share\whisper-venv"
& "$env:USERPROFILE\.local\share\whisper-venv\Scripts\pip.exe" install faster-whisper
Create the Transcription Wrapper
Save this to %USERPROFILE%\.local\bin\whisper-transcribe.py:
"""Local Whisper transcription using faster-whisper."""
import sys
from pathlib import Path
from faster_whisper import WhisperModel
MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"
VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350
def load_vocab():
if not VOCAB_PATH.exists():
return None
terms = VOCAB_PATH.read_text().strip().splitlines()
if not terms:
return None
prompt_parts, char_count = [], 0
for term in terms:
if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
break
prompt_parts.append(term)
char_count += len(term) + 2
return ", ".join(prompt_parts) if prompt_parts else None
def main():
if len(sys.argv) < 2:
print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
sys.exit(1)
audio_file = sys.argv[1]
initial_prompt = load_vocab()
model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)
kwargs = dict(
beam_size=5,
language="en",
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)
if initial_prompt:
kwargs["initial_prompt"] = initial_prompt
segments, info = model.transcribe(audio_file, **kwargs)
transcript = " ".join(segment.text.strip() for segment in segments)
print(transcript)
if __name__ == "__main__":
main()
Create the directories and download the model:
mkdir "$env:USERPROFILE\.local\bin" -Force
# First run downloads the model (~1.5GB)
& "$env:USERPROFILE\.local\share\whisper-venv\Scripts\python.exe" "$env:USERPROFILE\.local\bin\whisper-transcribe.py" NUL 2>$null
Find Your Microphone Name
ffmpeg -list_devices true -f dshow -i dummy 2>&1 | Select-String "audio"
Look for a line like:
[dshow] "Microphone Array (Realtek(R) Audio)" (audio)
Copy the name in quotes - you’ll need it for the script below.
Create the AutoHotkey Script
Save this as dictate.ahk somewhere convenient (e.g., Documents\Scripts\dictate.ahk):
#Requires AutoHotkey v2.0
#SingleInstance Force
; Configuration
global LOCK_FILE := A_Temp "\dictate-hotkey.lock"
global PID_FILE := A_Temp "\dictate-hotkey.pid"
global AUDIO_FILE := A_Temp "\dictate-hotkey.wav"
global WHISPER_VENV := EnvGet("USERPROFILE") "\.local\share\whisper-venv\Scripts\python.exe"
global WHISPER_SCRIPT := EnvGet("USERPROFILE") "\.local\bin\whisper-transcribe.py"
; Hotkey: Ctrl+` (backtick) - change this to your preference
^`:: ToggleDictation()
ToggleDictation() {
; Clean up stale lock if recording process died
if FileExist(LOCK_FILE) && FileExist(PID_FILE) {
pid := FileRead(PID_FILE)
if !ProcessExist(pid) {
FileDelete(LOCK_FILE)
FileDelete(PID_FILE)
}
}
if FileExist(LOCK_FILE) {
StopAndTranscribe()
} else {
StartRecording()
}
}
StartRecording() {
; Clean up
if FileExist(AUDIO_FILE)
FileDelete(AUDIO_FILE)
; Create lock file
FileAppend("", LOCK_FILE)
; Show notification
TrayTip("Recording... Press Ctrl+` to stop", "Dictation", "Mute")
; Start ffmpeg recording in background
; IMPORTANT: Replace with YOUR device name from the step above
deviceName := "Microphone Array (Realtek(R) Audio)"
Run('cmd /c ffmpeg -f dshow -i audio="' deviceName '" -ar 16000 -ac 1 -y "' AUDIO_FILE '"', , "Hide", &pid)
; Save PID
FileAppend(pid, PID_FILE)
}
StopAndTranscribe() {
; Remove lock
if FileExist(LOCK_FILE)
FileDelete(LOCK_FILE)
; Kill ffmpeg
if FileExist(PID_FILE) {
pid := FileRead(PID_FILE)
try {
ProcessClose(pid)
}
FileDelete(PID_FILE)
}
; Also kill any lingering ffmpeg
Run('taskkill /f /im ffmpeg.exe 2>nul', , "Hide")
Sleep(500) ; Let file finish writing
if !FileExist(AUDIO_FILE) || FileGetSize(AUDIO_FILE) < 1000 {
TrayTip("No audio recorded", "Dictation Error", "Icon!")
return
}
TrayTip("Transcribing locally...", "Dictation", "Mute")
; Run whisper transcription
tempTranscript := A_Temp "\transcript.txt"
cmd := '"' WHISPER_VENV '" "' WHISPER_SCRIPT '" "' AUDIO_FILE '" > "' tempTranscript '" 2>nul'
RunWait(A_ComSpec ' /c ' cmd, , "Hide")
transcript := ""
if FileExist(tempTranscript) {
transcript := Trim(FileRead(tempTranscript), "`r`n")
FileDelete(tempTranscript)
}
FileDelete(AUDIO_FILE)
if (transcript = "") {
TrayTip("Transcription failed", "Dictation Error", "Icon!")
return
}
; Type the result using clipboard + paste (most reliable)
A_Clipboard := transcript
Sleep(100)
Send("^v")
TrayTip("Typed: " SubStr(transcript, 1, 50) "...", "Dictation", "Mute")
}
Important: Update the deviceName variable with your actual microphone name from the previous step.
Run on Startup (Optional)
- Press
Win+R, typeshell:startup, press Enter - Create a shortcut to your
dictate.ahkfile in this folder
Test It
- Double-click
dictate.ahkto run it (green H icon appears in system tray) - Open any text field
- Press `Ctrl+`` - tray notification shows “Recording…”
- Speak clearly
- Press `Ctrl+`` again - “Transcribing locally…” then text appears
Note: The first transcription will be slow (~30-60 seconds) as the model loads. Subsequent transcriptions will be faster (~6-8 seconds for 15 seconds of audio).
Troubleshooting
“No audio recorded” error:
- Check the
deviceNamevariable in the script matches your actual microphone - Test ffmpeg recording manually:
ffmpeg -f dshow -i audio="Your Microphone Name" -t 3 -y test.wav - Make sure microphone isn’t muted in Windows Sound settings
ffmpeg not found:
- Restart your terminal/PowerShell after installing
- Check it’s in PATH:
where ffmpeg
AutoHotkey script won’t run:
- Make sure you have AutoHotkey v2 installed (not v1)
- Right-click the
.ahkfile > “Run as administrator” if needed
Hotkey conflict:
- Change `^`` to something else in the script (look for the hotkey definition), e.g.:
^+dfor Ctrl+Shift+D#dfor Win+D!dfor Alt+D
Model download fails:
- Ensure you have ~3GB free disk space
- Check Python and pip are working:
python --version
Custom Vocabulary (Optional)
Whisper’s initial_prompt parameter biases the decoder toward specific words. If you frequently dictate domain-specific terms - project names, technical jargon, people’s names - you can improve accuracy by feeding Whisper a vocabulary list.
How It Works
The whisper-transcribe script loads vocabulary from ~/.local/share/whisper-vocab.txt (or %USERPROFILE%\.local\share\whisper-vocab.txt on Windows) if it exists. The file format is simple: one term per line, sorted by frequency (most common first). The script takes the top ~40-60 terms that fit in Whisper’s 350-character prompt budget.
Example vocabulary file:
TrueNAS
Syncthing
Kubernetes
Obsidian
PostgreSQL
FastAPI
Creating Your Vocabulary File
Simple approach: Manually list terms you commonly dictate:
cat > ~/.local/share/whisper-vocab.txt << 'EOF'
YourProjectName
TechnicalTerm1
TechnicalTerm2
SomeFramework
EOF
Automated approach: If you keep notes in markdown (Obsidian, Logseq, etc.), you can extract vocabulary programmatically:
- Scan all
.mdfiles for capitalised words and wiki-links - Filter against a dictionary (
/usr/share/dict/wordson Linux) to isolate non-common terms - Sort by frequency and output to the vocab file
I wrote a Python script that does this for my Obsidian vault - it extracts ~3000 terms, and Whisper loads the top 40-50 at transcription time. The improvement is subtle but noticeable for proper nouns.
Constraints
- Token budget: Whisper’s decoder context is 448 tokens. The
initial_promptshould use at most half (~224 tokens โ 350 characters). The script enforces this. - Acronyms are low yield: Whisper handles common acronyms fine from training data. Focus on proper nouns and domain-specific terms.
- Don’t use
hotwordswithinitial_prompt: In faster-whisper 1.2.1, combining both causes repetition artifacts. Stick toinitial_promptonly.
What’s Next
This is a minimal implementation. Possible improvements:
- Hold-to-talk instead of toggle (harder with global shortcuts)
- Visual indicator showing recording state in system tray
- Multiple languages (Whisper supports many)
- GPU acceleration (significantly faster if you have a compatible NVIDIA GPU)
For now, the toggle works well enough. The cognitive overhead of “press, speak, press” is minimal once you’ve done it a few times.
Questions or improvements? Find me at hedwards.dev.