Paste this URL into Claude Code and tell it to set this up for you.


What This Is

Press a hotkey, speak, text appears wherever your cursor is. Any app - terminal, browser, notes, whatever.

This uses Whisper, an open-source speech recognition model that OpenAI released in 2022. The key thing: it runs entirely on your machine. Your voice never leaves your computer. No cloud API, no account, no subscription, no privacy policy to read. The model downloads once (~1.5GB), then everything is local forever.

Accuracy is excellent - comparable to commercial cloud services. On a modern CPU (i5/i7/Ryzen from the last few years), a 15-second dictation transcribes in about 6-8 seconds. You can also feed it a custom vocabulary of project names, technical terms, and proper nouns to improve recognition of domain-specific words.

Jump to: Linux ยท macOS ยท Windows ยท Custom Vocabulary


Prerequisites

All platforms:

  • A decent CPU (any modern i5/i7/Ryzen 5+ works fine)
  • Python 3.10+
  • ~1.5GB disk space for the model

Platform-specific:

  • Linux (Kubuntu/KDE): PipeWire audio, ydotool for typing
  • macOS: SoX for recording, clipboard + paste for typing
  • Windows 11: ffmpeg for recording, AutoHotkey for hotkey + typing

Part 1: Linux (Kubuntu / KDE Plasma)

This is what I use. No network dependency, no API costs, excellent accuracy.

Install faster-whisper

# Create a dedicated venv
python3 -m venv ~/.local/share/whisper-venv

# Install faster-whisper
~/.local/share/whisper-venv/bin/pip install faster-whisper

Create the Transcription Wrapper

Save this to ~/.local/bin/whisper-transcribe:

#!/home/YOUR_USERNAME/.local/share/whisper-venv/bin/python3
"""Local Whisper transcription using faster-whisper."""

import sys
from pathlib import Path
from faster_whisper import WhisperModel

# distil-large-v3: good balance of speed and accuracy. Use "large-v3" for best accuracy.
MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"

# Optional: custom vocabulary file (one term per line, sorted by frequency)
VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350  # ~40-60 terms fit in Whisper's 224-token prompt budget


def load_vocab():
    """Load custom vocabulary for Whisper biasing via initial_prompt."""
    if not VOCAB_PATH.exists():
        return None
    terms = VOCAB_PATH.read_text().strip().splitlines()
    if not terms:
        return None
    # Take top terms that fit in the char budget
    prompt_parts, char_count = [], 0
    for term in terms:
        if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
            break
        prompt_parts.append(term)
        char_count += len(term) + 2
    return ", ".join(prompt_parts) if prompt_parts else None


def main():
    if len(sys.argv) < 2:
        print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
        sys.exit(1)

    audio_file = sys.argv[1]
    initial_prompt = load_vocab()
    model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)

    kwargs = dict(
        beam_size=5,
        language="en",
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=500),
    )
    if initial_prompt:
        kwargs["initial_prompt"] = initial_prompt

    segments, info = model.transcribe(audio_file, **kwargs)

    transcript = " ".join(segment.text.strip() for segment in segments)
    print(transcript)

if __name__ == "__main__":
    main()

Important: Replace YOUR_USERNAME with your actual username in the shebang line.

Make it executable and download the model:

chmod +x ~/.local/bin/whisper-transcribe

# First run downloads the model (~1.5GB) - takes a few minutes
~/.local/bin/whisper-transcribe /dev/null 2>/dev/null || true

Model Options

ModelSpeedAccuracyNotes
large-v3~0.4x realtimeBestMost accurate
distil-large-v3~2x realtime~1% worseWhat I use - good trade-off
medium~3x realtimeGoodLighter on resources

For a 15-second dictation, distil-large-v3 takes about 6-8 seconds on a modern CPU.

Create the Dictation Script

Save this to ~/.local/bin/dictate-hotkey:

#!/bin/bash
# Global hotkey dictation script for Wayland/KDE
# Uses local Whisper (faster-whisper) for transcription

LOCK_FILE="/tmp/dictate-hotkey.lock"
PID_FILE="/tmp/dictate-hotkey.pid"
AUDIO_FILE="/tmp/dictate-hotkey.wav"
DEBUG_AUDIO="/tmp/dictate-hotkey-debug.wav"
WHISPER_BIN="$HOME/.local/bin/whisper-transcribe"

if [[ ! -x "$WHISPER_BIN" ]]; then
    notify-send -u critical "Dictation" "whisper-transcribe not found"
    exit 1
fi

cleanup_stale_lock() {
    if [[ -f "$LOCK_FILE" ]] && [[ -f "$PID_FILE" ]]; then
        local pid
        pid=$(cat "$PID_FILE" 2>/dev/null)
        if [[ -n "$pid" ]] && ! kill -0 "$pid" 2>/dev/null; then
            rm -f "$LOCK_FILE" "$PID_FILE"
            return 0
        fi
    elif [[ -f "$LOCK_FILE" ]] && [[ ! -f "$PID_FILE" ]]; then
        rm -f "$LOCK_FILE"
        return 0
    fi
    return 1
}

cleanup_stale_lock

if [[ -f "$LOCK_FILE" ]]; then
    # STOP recording and transcribe
    if [[ -f "$PID_FILE" ]]; then
        pid=$(cat "$PID_FILE" 2>/dev/null)
        if [[ -n "$pid" ]]; then
            kill "$pid" 2>/dev/null || true
            sleep 0.3
        fi
        rm -f "$PID_FILE"
    fi

    rm -f "$LOCK_FILE"

    if [[ ! -f "$AUDIO_FILE" ]]; then
        notify-send -u critical "Dictation" "No audio file found"
        exit 1
    fi

    audio_size=$(stat -c%s "$AUDIO_FILE" 2>/dev/null || echo "0")
    if [[ "$audio_size" -lt 1000 ]]; then
        notify-send -u critical "Dictation" "Audio too short"
        rm -f "$AUDIO_FILE"
        exit 1
    fi

    notify-send -t 1500 "Dictation" "Transcribing locally..."

    TRANSCRIPT=$("$WHISPER_BIN" "$AUDIO_FILE" 2>/dev/null)

    if [[ -z "$TRANSCRIPT" ]]; then
        cp "$AUDIO_FILE" "$DEBUG_AUDIO" 2>/dev/null
        notify-send -u critical "Dictation" "Transcription failed"
        rm -f "$AUDIO_FILE"
        exit 1
    fi

    rm -f "$AUDIO_FILE"
    notify-send -t 1500 "Dictation" "Typing: ${TRANSCRIPT:0:50}..."

    sleep 0.1
    if ! echo -n "$TRANSCRIPT" | ydotool type --file -; then
        if [[ -n "$WAYLAND_DISPLAY" ]]; then
            echo -n "$TRANSCRIPT" | wl-copy
            notify-send -t 2000 "Dictation" "Copied to clipboard (ydotool failed)"
        fi
    fi

else
    # START recording
    rm -f "$AUDIO_FILE"

    pw-record --channels=2 "$AUDIO_FILE" &
    record_pid=$!

    sleep 0.2
    if ! kill -0 "$record_pid" 2>/dev/null; then
        notify-send -u critical "Dictation" "Failed to start recording"
        exit 1
    fi

    echo "$record_pid" > "$PID_FILE"
    touch "$LOCK_FILE"

    notify-send -t 2000 "Dictation" "Recording... Press hotkey to stop"
fi

Make it executable:

chmod +x ~/.local/bin/dictate-hotkey

Set Up ydotool Permissions

sudo apt install ydotool wl-clipboard
sudo usermod -aG input $USER

Log out and back in for the group change to take effect.

Set Up the Global Hotkey

  1. Open System Settings > Shortcuts > Custom Shortcuts
  2. Click Edit > New > Global Shortcut > Command/URL
  3. Name it “Dictate”
  4. Trigger tab: Set your preferred hotkey (I use `Ctrl+``)
  5. Action tab: Enter /home/YOUR_USERNAME/.local/bin/dictate-hotkey

Test It

  1. Open any text input
  2. Press your hotkey - “Recording…” notification
  3. Speak clearly
  4. Press hotkey again - “Transcribing locally…” then text appears

Troubleshooting

No audio captured (empty file):

# Check your audio sources
wpctl status

# Test recording manually (should produce >44 bytes)
pw-record --channels=2 /tmp/test.wav &
sleep 2
kill %1
ls -la /tmp/test.wav

If pw-record produces empty files, check that your microphone is:

  • Not muted in system audio settings
  • Set as the default input source

Important: Use pw-record, not parecord. On PipeWire systems, parecord often produces empty files even though the PulseAudio compatibility layer is installed. The native pw-record --channels=2 works reliably.

ydotool permission denied:

  • Make sure you logged out and back in after adding yourself to the input group
  • Check: groups should show input

Hotkey not triggering:

  • KDE sometimes needs a restart of the shortcuts daemon
  • Try: Log out and back in, or restart Plasma (the shortcut daemon restarts with it)

Script seems stuck (pressing hotkey does nothing useful): The script auto-recovers from stale state, but if something is really stuck:

rm -f /tmp/dictate-hotkey.lock /tmp/dictate-hotkey.pid
pkill pw-record

Part 2: macOS (Sequoia 15.x)

Install Dependencies

# Install Homebrew if you don't have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install SoX for recording
brew install sox

Install faster-whisper

# Create a dedicated venv
python3 -m venv ~/.local/share/whisper-venv

# Install faster-whisper
~/.local/share/whisper-venv/bin/pip install faster-whisper

# Create bin directory if needed
mkdir -p ~/.local/bin

Create the Transcription Wrapper

Save this to ~/.local/bin/whisper-transcribe:

#!/Users/YOUR_USERNAME/.local/share/whisper-venv/bin/python3
"""Local Whisper transcription using faster-whisper."""

import sys
from pathlib import Path
from faster_whisper import WhisperModel

MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"

VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350


def load_vocab():
    if not VOCAB_PATH.exists():
        return None
    terms = VOCAB_PATH.read_text().strip().splitlines()
    if not terms:
        return None
    prompt_parts, char_count = [], 0
    for term in terms:
        if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
            break
        prompt_parts.append(term)
        char_count += len(term) + 2
    return ", ".join(prompt_parts) if prompt_parts else None


def main():
    if len(sys.argv) < 2:
        print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
        sys.exit(1)

    audio_file = sys.argv[1]
    initial_prompt = load_vocab()
    model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)

    kwargs = dict(
        beam_size=5,
        language="en",
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=500),
    )
    if initial_prompt:
        kwargs["initial_prompt"] = initial_prompt

    segments, info = model.transcribe(audio_file, **kwargs)
    transcript = " ".join(segment.text.strip() for segment in segments)
    print(transcript)

if __name__ == "__main__":
    main()

Important: Replace YOUR_USERNAME with your actual username in the shebang line.

Make it executable and download the model:

chmod +x ~/.local/bin/whisper-transcribe

# First run downloads the model (~1.5GB)
~/.local/bin/whisper-transcribe /dev/null 2>/dev/null || true

Create the Dictation Script

Save this to ~/.local/bin/dictate-hotkey:

#!/bin/bash
# Global hotkey dictation script for macOS
# Uses local Whisper (faster-whisper) for transcription

LOCK_FILE="/tmp/dictate-hotkey.lock"
PID_FILE="/tmp/dictate-hotkey.pid"
AUDIO_FILE="/tmp/dictate-hotkey.wav"
DEBUG_AUDIO="/tmp/dictate-hotkey-debug.wav"
WHISPER_BIN="$HOME/.local/bin/whisper-transcribe"

if [[ ! -x "$WHISPER_BIN" ]]; then
    osascript -e 'display notification "whisper-transcribe not found" with title "Dictation" sound name "Basso"'
    exit 1
fi

cleanup_stale_lock() {
    if [[ -f "$LOCK_FILE" ]] && [[ -f "$PID_FILE" ]]; then
        local pid
        pid=$(cat "$PID_FILE" 2>/dev/null)
        if [[ -n "$pid" ]] && ! kill -0 "$pid" 2>/dev/null; then
            rm -f "$LOCK_FILE" "$PID_FILE"
            return 0
        fi
    elif [[ -f "$LOCK_FILE" ]] && [[ ! -f "$PID_FILE" ]]; then
        rm -f "$LOCK_FILE"
        return 0
    fi
    return 1
}

cleanup_stale_lock

if [[ -f "$LOCK_FILE" ]]; then
    # STOP recording and transcribe
    if [[ -f "$PID_FILE" ]]; then
        pid=$(cat "$PID_FILE" 2>/dev/null)
        if [[ -n "$pid" ]]; then
            kill "$pid" 2>/dev/null || true
            sleep 0.3
        fi
        rm -f "$PID_FILE"
    fi

    rm -f "$LOCK_FILE"

    if [[ ! -f "$AUDIO_FILE" ]]; then
        osascript -e 'display notification "No audio file found" with title "Dictation" sound name "Basso"'
        exit 1
    fi

    audio_size=$(stat -f%z "$AUDIO_FILE" 2>/dev/null || echo "0")
    if [[ "$audio_size" -lt 1000 ]]; then
        osascript -e 'display notification "Audio too short" with title "Dictation" sound name "Basso"'
        rm -f "$AUDIO_FILE"
        exit 1
    fi

    osascript -e 'display notification "Transcribing locally..." with title "Dictation"'

    TRANSCRIPT=$("$WHISPER_BIN" "$AUDIO_FILE" 2>/dev/null)

    if [[ -z "$TRANSCRIPT" ]]; then
        cp "$AUDIO_FILE" "$DEBUG_AUDIO" 2>/dev/null
        osascript -e 'display notification "Transcription failed" with title "Dictation" sound name "Basso"'
        rm -f "$AUDIO_FILE"
        exit 1
    fi

    rm -f "$AUDIO_FILE"

    # Copy to clipboard and paste
    echo -n "$TRANSCRIPT" | pbcopy
    osascript -e 'tell application "System Events" to keystroke "v" using command down'

    osascript -e "display notification \"Typed: ${TRANSCRIPT:0:50}...\" with title \"Dictation\""

else
    # START recording
    rm -f "$AUDIO_FILE"

    osascript -e 'display notification "Recording... Press hotkey to stop" with title "Dictation"'

    rec -q -r 16000 -c 1 "$AUDIO_FILE" &
    record_pid=$!

    sleep 0.2
    if ! kill -0 "$record_pid" 2>/dev/null; then
        osascript -e 'display notification "Failed to start recording" with title "Dictation" sound name "Basso"'
        exit 1
    fi

    echo "$record_pid" > "$PID_FILE"
    touch "$LOCK_FILE"
fi

Make it executable:

chmod +x ~/.local/bin/dictate-hotkey

Set Up the Global Hotkey

Option A: Automator + System Shortcuts (built-in)

  1. Open Automator > New Document > Quick Action
  2. Set “Workflow receives” to no input in any application
  3. Add Run Shell Script action
  4. Paste: ~/.local/bin/dictate-hotkey
  5. Save as “Dictate Toggle”
  6. Open System Settings > Keyboard > Keyboard Shortcuts > Services
  7. Find “Dictate Toggle” and assign your hotkey (e.g., `Ctrl+``)

Option B: Hammerspoon (more reliable)

If you use Hammerspoon, add to your ~/.hammerspoon/init.lua:

hs.hotkey.bind({"ctrl"}, "`", function()
    hs.execute("~/.local/bin/dictate-hotkey", true)
end)

Then reload Hammerspoon config.

Grant Permissions

macOS will prompt for:

  • Microphone access - Allow for Terminal/iTerm/Hammerspoon
  • Accessibility access - Required for the paste keystroke

Go to System Settings > Privacy & Security to grant these if needed.

Test It

Same as Linux - press hotkey, speak, press again, text appears.

Troubleshooting

SoX not recording:

# List audio devices
rec -l

# Test recording
rec -r 16000 -c 1 /tmp/test.wav
# Ctrl+C to stop
play /tmp/test.wav

Paste not working:

  • Make sure Accessibility permissions are granted
  • Try the clipboard fallback: just pbcopy and manually Cmd+V

Part 3: Windows 11

Windows uses AutoHotkey for the global hotkey and ffmpeg for recording. The transcription wrapper runs in Python, same as other platforms.

Install Dependencies

Option A: Using winget (recommended)

Open PowerShell as Administrator:

winget install --id=Gyan.FFmpeg -e
winget install --id=AutoHotkey.AutoHotkey -e
winget install --id=Python.Python.3.12 -e

Option B: Using Scoop

scoop install ffmpeg autohotkey python

Install faster-whisper

# Create venv and install
python -m venv "$env:USERPROFILE\.local\share\whisper-venv"
& "$env:USERPROFILE\.local\share\whisper-venv\Scripts\pip.exe" install faster-whisper

Create the Transcription Wrapper

Save this to %USERPROFILE%\.local\bin\whisper-transcribe.py:

"""Local Whisper transcription using faster-whisper."""

import sys
from pathlib import Path
from faster_whisper import WhisperModel

MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"

VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350


def load_vocab():
    if not VOCAB_PATH.exists():
        return None
    terms = VOCAB_PATH.read_text().strip().splitlines()
    if not terms:
        return None
    prompt_parts, char_count = [], 0
    for term in terms:
        if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
            break
        prompt_parts.append(term)
        char_count += len(term) + 2
    return ", ".join(prompt_parts) if prompt_parts else None


def main():
    if len(sys.argv) < 2:
        print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
        sys.exit(1)

    audio_file = sys.argv[1]
    initial_prompt = load_vocab()
    model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)

    kwargs = dict(
        beam_size=5,
        language="en",
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=500),
    )
    if initial_prompt:
        kwargs["initial_prompt"] = initial_prompt

    segments, info = model.transcribe(audio_file, **kwargs)
    transcript = " ".join(segment.text.strip() for segment in segments)
    print(transcript)

if __name__ == "__main__":
    main()

Create the directories and download the model:

mkdir "$env:USERPROFILE\.local\bin" -Force

# First run downloads the model (~1.5GB)
& "$env:USERPROFILE\.local\share\whisper-venv\Scripts\python.exe" "$env:USERPROFILE\.local\bin\whisper-transcribe.py" NUL 2>$null

Find Your Microphone Name

ffmpeg -list_devices true -f dshow -i dummy 2>&1 | Select-String "audio"

Look for a line like:

[dshow] "Microphone Array (Realtek(R) Audio)" (audio)

Copy the name in quotes - you’ll need it for the script below.

Create the AutoHotkey Script

Save this as dictate.ahk somewhere convenient (e.g., Documents\Scripts\dictate.ahk):

#Requires AutoHotkey v2.0
#SingleInstance Force

; Configuration
global LOCK_FILE := A_Temp "\dictate-hotkey.lock"
global PID_FILE := A_Temp "\dictate-hotkey.pid"
global AUDIO_FILE := A_Temp "\dictate-hotkey.wav"
global WHISPER_VENV := EnvGet("USERPROFILE") "\.local\share\whisper-venv\Scripts\python.exe"
global WHISPER_SCRIPT := EnvGet("USERPROFILE") "\.local\bin\whisper-transcribe.py"

; Hotkey: Ctrl+` (backtick) - change this to your preference
^`:: ToggleDictation()

ToggleDictation() {
    ; Clean up stale lock if recording process died
    if FileExist(LOCK_FILE) && FileExist(PID_FILE) {
        pid := FileRead(PID_FILE)
        if !ProcessExist(pid) {
            FileDelete(LOCK_FILE)
            FileDelete(PID_FILE)
        }
    }

    if FileExist(LOCK_FILE) {
        StopAndTranscribe()
    } else {
        StartRecording()
    }
}

StartRecording() {
    ; Clean up
    if FileExist(AUDIO_FILE)
        FileDelete(AUDIO_FILE)

    ; Create lock file
    FileAppend("", LOCK_FILE)

    ; Show notification
    TrayTip("Recording... Press Ctrl+` to stop", "Dictation", "Mute")

    ; Start ffmpeg recording in background
    ; IMPORTANT: Replace with YOUR device name from the step above
    deviceName := "Microphone Array (Realtek(R) Audio)"
    Run('cmd /c ffmpeg -f dshow -i audio="' deviceName '" -ar 16000 -ac 1 -y "' AUDIO_FILE '"', , "Hide", &pid)

    ; Save PID
    FileAppend(pid, PID_FILE)
}

StopAndTranscribe() {
    ; Remove lock
    if FileExist(LOCK_FILE)
        FileDelete(LOCK_FILE)

    ; Kill ffmpeg
    if FileExist(PID_FILE) {
        pid := FileRead(PID_FILE)
        try {
            ProcessClose(pid)
        }
        FileDelete(PID_FILE)
    }

    ; Also kill any lingering ffmpeg
    Run('taskkill /f /im ffmpeg.exe 2>nul', , "Hide")
    Sleep(500)  ; Let file finish writing

    if !FileExist(AUDIO_FILE) || FileGetSize(AUDIO_FILE) < 1000 {
        TrayTip("No audio recorded", "Dictation Error", "Icon!")
        return
    }

    TrayTip("Transcribing locally...", "Dictation", "Mute")

    ; Run whisper transcription
    tempTranscript := A_Temp "\transcript.txt"

    cmd := '"' WHISPER_VENV '" "' WHISPER_SCRIPT '" "' AUDIO_FILE '" > "' tempTranscript '" 2>nul'
    RunWait(A_ComSpec ' /c ' cmd, , "Hide")

    transcript := ""
    if FileExist(tempTranscript) {
        transcript := Trim(FileRead(tempTranscript), "`r`n")
        FileDelete(tempTranscript)
    }
    FileDelete(AUDIO_FILE)

    if (transcript = "") {
        TrayTip("Transcription failed", "Dictation Error", "Icon!")
        return
    }

    ; Type the result using clipboard + paste (most reliable)
    A_Clipboard := transcript
    Sleep(100)
    Send("^v")

    TrayTip("Typed: " SubStr(transcript, 1, 50) "...", "Dictation", "Mute")
}

Important: Update the deviceName variable with your actual microphone name from the previous step.

Run on Startup (Optional)

  1. Press Win+R, type shell:startup, press Enter
  2. Create a shortcut to your dictate.ahk file in this folder

Test It

  1. Double-click dictate.ahk to run it (green H icon appears in system tray)
  2. Open any text field
  3. Press `Ctrl+`` - tray notification shows “Recording…”
  4. Speak clearly
  5. Press `Ctrl+`` again - “Transcribing locally…” then text appears

Note: The first transcription will be slow (~30-60 seconds) as the model loads. Subsequent transcriptions will be faster (~6-8 seconds for 15 seconds of audio).

Troubleshooting

“No audio recorded” error:

  • Check the deviceName variable in the script matches your actual microphone
  • Test ffmpeg recording manually:
    ffmpeg -f dshow -i audio="Your Microphone Name" -t 3 -y test.wav
    
  • Make sure microphone isn’t muted in Windows Sound settings

ffmpeg not found:

  • Restart your terminal/PowerShell after installing
  • Check it’s in PATH: where ffmpeg

AutoHotkey script won’t run:

  • Make sure you have AutoHotkey v2 installed (not v1)
  • Right-click the .ahk file > “Run as administrator” if needed

Hotkey conflict:

  • Change `^`` to something else in the script (look for the hotkey definition), e.g.:
    • ^+d for Ctrl+Shift+D
    • #d for Win+D
    • !d for Alt+D

Model download fails:

  • Ensure you have ~3GB free disk space
  • Check Python and pip are working: python --version

Custom Vocabulary (Optional)

Whisper’s initial_prompt parameter biases the decoder toward specific words. If you frequently dictate domain-specific terms - project names, technical jargon, people’s names - you can improve accuracy by feeding Whisper a vocabulary list.

How It Works

The whisper-transcribe script loads vocabulary from ~/.local/share/whisper-vocab.txt (or %USERPROFILE%\.local\share\whisper-vocab.txt on Windows) if it exists. The file format is simple: one term per line, sorted by frequency (most common first). The script takes the top ~40-60 terms that fit in Whisper’s 350-character prompt budget.

Example vocabulary file:

TrueNAS
Syncthing
Kubernetes
Obsidian
PostgreSQL
FastAPI

Creating Your Vocabulary File

Simple approach: Manually list terms you commonly dictate:

cat > ~/.local/share/whisper-vocab.txt << 'EOF'
YourProjectName
TechnicalTerm1
TechnicalTerm2
SomeFramework
EOF

Automated approach: If you keep notes in markdown (Obsidian, Logseq, etc.), you can extract vocabulary programmatically:

  1. Scan all .md files for capitalised words and wiki-links
  2. Filter against a dictionary (/usr/share/dict/words on Linux) to isolate non-common terms
  3. Sort by frequency and output to the vocab file

I wrote a Python script that does this for my Obsidian vault - it extracts ~3000 terms, and Whisper loads the top 40-50 at transcription time. The improvement is subtle but noticeable for proper nouns.

Constraints

  • Token budget: Whisper’s decoder context is 448 tokens. The initial_prompt should use at most half (~224 tokens โ‰ˆ 350 characters). The script enforces this.
  • Acronyms are low yield: Whisper handles common acronyms fine from training data. Focus on proper nouns and domain-specific terms.
  • Don’t use hotwords with initial_prompt: In faster-whisper 1.2.1, combining both causes repetition artifacts. Stick to initial_prompt only.

What’s Next

This is a minimal implementation. Possible improvements:

  • Hold-to-talk instead of toggle (harder with global shortcuts)
  • Visual indicator showing recording state in system tray
  • Multiple languages (Whisper supports many)
  • GPU acceleration (significantly faster if you have a compatible NVIDIA GPU)

For now, the toggle works well enough. The cognitive overhead of “press, speak, press” is minimal once you’ve done it a few times.


Questions or improvements? Find me at hedwards.dev.