Paste this URL into Claude Code and tell it to set this up for you.


What This Is

Press a hotkey, speak, text appears wherever your cursor is. Any app - terminal, browser, notes, whatever.

This uses Whisper, an open-source speech recognition model that OpenAI released in 2022. The key thing: it runs entirely on your machine. Your voice never leaves your computer. No cloud API, no account, no subscription, no privacy policy to read. The model downloads once (~1.5GB), then everything is local forever.

Accuracy is excellent - comparable to commercial cloud services. On a modern CPU (i5/i7/Ryzen from the last few years), a 15-second dictation transcribes in about 6-8 seconds. You can also feed it a custom vocabulary of project names, technical terms, and proper nouns to improve recognition of domain-specific words.

Jump to: Linux · macOS · Windows · Custom Vocabulary · Speaker Diarisation


Prerequisites

All platforms:

  • A decent CPU (any modern i5/i7/Ryzen 5+ works fine)
  • Python 3.10+
  • ~1.5GB disk space for the model

Platform-specific:

  • Linux (Kubuntu/KDE): PipeWire audio, ydotool for typing
  • macOS: SoX for recording, clipboard + paste for typing
  • Windows 11: ffmpeg for recording, AutoHotkey for hotkey + typing

Part 1: Linux (Kubuntu / KDE Plasma)

This is what I use. No network dependency, no API costs, excellent accuracy.

Install faster-whisper

# Create a dedicated venv
python3 -m venv ~/.local/share/whisper-venv

# Install faster-whisper
~/.local/share/whisper-venv/bin/pip install faster-whisper

Create the Transcription Wrapper

Save this to ~/.local/bin/whisper-transcribe:

#!/home/YOUR_USERNAME/.local/share/whisper-venv/bin/python3
"""Local Whisper transcription using faster-whisper."""

import sys
from pathlib import Path
from faster_whisper import WhisperModel

# distil-large-v3: good balance of speed and accuracy. Use "large-v3" for best accuracy.
MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"

# Optional: custom vocabulary file (one term per line, sorted by frequency)
VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350  # ~40-60 terms fit in Whisper's 224-token prompt budget


def load_vocab():
    """Load custom vocabulary for Whisper biasing via initial_prompt."""
    if not VOCAB_PATH.exists():
        return None
    terms = VOCAB_PATH.read_text().strip().splitlines()
    if not terms:
        return None
    # Take top terms that fit in the char budget
    prompt_parts, char_count = [], 0
    for term in terms:
        if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
            break
        prompt_parts.append(term)
        char_count += len(term) + 2
    return ", ".join(prompt_parts) if prompt_parts else None


def main():
    if len(sys.argv) < 2:
        print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
        sys.exit(1)

    audio_file = sys.argv[1]
    initial_prompt = load_vocab()
    model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)

    kwargs = dict(
        beam_size=5,
        language="en",
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=500),
    )
    if initial_prompt:
        kwargs["initial_prompt"] = initial_prompt

    segments, info = model.transcribe(audio_file, **kwargs)

    transcript = " ".join(segment.text.strip() for segment in segments)
    print(transcript)

if __name__ == "__main__":
    main()

Important: Replace YOUR_USERNAME with your actual username in the shebang line.

Make it executable and download the model:

chmod +x ~/.local/bin/whisper-transcribe

# First run downloads the model (~1.5GB) - takes a few minutes
~/.local/bin/whisper-transcribe /dev/null 2>/dev/null || true

Model Options

ModelSpeedAccuracyNotes
large-v3~0.4x realtimeBestMost accurate
distil-large-v3~2x realtime~1% worseWhat I use - good trade-off
medium~3x realtimeGoodLighter on resources

For a 15-second dictation, distil-large-v3 takes about 6-8 seconds on a modern CPU.

Create the Dictation Script

Save this to ~/.local/bin/dictate-hotkey:

#!/bin/bash
# Global hotkey dictation script for Wayland/KDE
# Uses local Whisper (faster-whisper) for transcription

LOCK_FILE="/tmp/dictate-hotkey.lock"
PID_FILE="/tmp/dictate-hotkey.pid"
AUDIO_FILE="/tmp/dictate-hotkey.wav"
DEBUG_AUDIO="/tmp/dictate-hotkey-debug.wav"
WHISPER_BIN="$HOME/.local/bin/whisper-transcribe"

if [[ ! -x "$WHISPER_BIN" ]]; then
    notify-send -u critical "Dictation" "whisper-transcribe not found"
    exit 1
fi

cleanup_stale_lock() {
    if [[ -f "$LOCK_FILE" ]] && [[ -f "$PID_FILE" ]]; then
        local pid
        pid=$(cat "$PID_FILE" 2>/dev/null)
        if [[ -n "$pid" ]] && ! kill -0 "$pid" 2>/dev/null; then
            rm -f "$LOCK_FILE" "$PID_FILE"
            return 0
        fi
    elif [[ -f "$LOCK_FILE" ]] && [[ ! -f "$PID_FILE" ]]; then
        rm -f "$LOCK_FILE"
        return 0
    fi
    return 1
}

cleanup_stale_lock

if [[ -f "$LOCK_FILE" ]]; then
    # STOP recording and transcribe
    if [[ -f "$PID_FILE" ]]; then
        pid=$(cat "$PID_FILE" 2>/dev/null)
        if [[ -n "$pid" ]]; then
            kill -INT "$pid" 2>/dev/null || true
            for i in {1..10}; do
                kill -0 "$pid" 2>/dev/null || break
                sleep 0.1
            done
            kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null
        fi
        rm -f "$PID_FILE"
    fi

    rm -f "$LOCK_FILE"

    if [[ ! -f "$AUDIO_FILE" ]]; then
        notify-send -u critical "Dictation" "No audio file found"
        exit 1
    fi

    audio_size=$(stat -c%s "$AUDIO_FILE" 2>/dev/null || echo "0")
    if [[ "$audio_size" -lt 1000 ]]; then
        notify-send -u critical "Dictation" "Audio too short"
        rm -f "$AUDIO_FILE"
        exit 1
    fi

    notify-send -t 1500 "Dictation" "Transcribing locally..."

    TRANSCRIPT=$("$WHISPER_BIN" "$AUDIO_FILE" 2>/dev/null)

    if [[ -z "$TRANSCRIPT" ]]; then
        cp "$AUDIO_FILE" "$DEBUG_AUDIO" 2>/dev/null
        notify-send -u critical "Dictation" "Transcription failed"
        rm -f "$AUDIO_FILE"
        exit 1
    fi

    rm -f "$AUDIO_FILE"
    notify-send -t 1500 "Dictation" "Typing: ${TRANSCRIPT:0:50}..."

    sleep 0.1
    if ! echo -n "$TRANSCRIPT" | ydotool type --file -; then
        if [[ -n "$WAYLAND_DISPLAY" ]]; then
            echo -n "$TRANSCRIPT" | wl-copy
            notify-send -t 2000 "Dictation" "Copied to clipboard (ydotool failed)"
        fi
    fi

else
    # START recording
    rm -f "$AUDIO_FILE"

    pw-record --channels=1 "$AUDIO_FILE" &
    record_pid=$!

    sleep 0.2
    if ! kill -0 "$record_pid" 2>/dev/null; then
        notify-send -u critical "Dictation" "Failed to start recording"
        exit 1
    fi

    touch "$LOCK_FILE"
    echo "$record_pid" > "$PID_FILE"

    notify-send -t 2000 "Dictation" "Recording... Press hotkey to stop"
fi

Make it executable:

chmod +x ~/.local/bin/dictate-hotkey

Set Up ydotool Permissions

sudo apt install ydotool wl-clipboard
sudo usermod -aG input $USER

Log out and back in for the group change to take effect.

Set Up the Global Hotkey

  1. Open System Settings > Shortcuts > Custom Shortcuts
  2. Click Edit > New > Global Shortcut > Command/URL
  3. Name it “Dictate”
  4. Trigger tab: Set your preferred hotkey (I use `Ctrl+``)
  5. Action tab: Enter /home/YOUR_USERNAME/.local/bin/dictate-hotkey

Test It

  1. Open any text input
  2. Press your hotkey - “Recording…” notification
  3. Speak clearly
  4. Press hotkey again - “Transcribing locally…” then text appears

Troubleshooting

No audio captured (empty file):

# Check your audio sources
wpctl status

# Test recording manually (should produce >44 bytes)
pw-record --channels=1 /tmp/test.wav &
sleep 2
kill %1
ls -la /tmp/test.wav

If pw-record produces empty files, check that your microphone is:

  • Not muted in system audio settings
  • Set as the default input source

Important: Use pw-record, not parecord. On PipeWire systems, parecord often produces empty files even though the PulseAudio compatibility layer is installed. The native pw-record --channels=1 works reliably.

ydotool permission denied:

  • Make sure you logged out and back in after adding yourself to the input group
  • Check: groups should show input

Hotkey not triggering:

  • KDE sometimes needs a restart of the shortcuts daemon
  • Try: Log out and back in, or restart Plasma (the shortcut daemon restarts with it)

Script seems stuck (pressing hotkey does nothing useful): The script auto-recovers from stale state, but if something is really stuck:

rm -f /tmp/dictate-hotkey.lock /tmp/dictate-hotkey.pid
pkill pw-record

Part 2: macOS (Sequoia 15.x)

Install Dependencies

# Install Homebrew if you don't have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install SoX for recording
brew install sox

Install faster-whisper

# Create a dedicated venv
python3 -m venv ~/.local/share/whisper-venv

# Install faster-whisper
~/.local/share/whisper-venv/bin/pip install faster-whisper

# Create bin directory if needed
mkdir -p ~/.local/bin

Create the Transcription Wrapper

Save this to ~/.local/bin/whisper-transcribe:

#!/Users/YOUR_USERNAME/.local/share/whisper-venv/bin/python3
"""Local Whisper transcription using faster-whisper."""

import sys
from pathlib import Path
from faster_whisper import WhisperModel

MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"

VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350


def load_vocab():
    if not VOCAB_PATH.exists():
        return None
    terms = VOCAB_PATH.read_text().strip().splitlines()
    if not terms:
        return None
    prompt_parts, char_count = [], 0
    for term in terms:
        if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
            break
        prompt_parts.append(term)
        char_count += len(term) + 2
    return ", ".join(prompt_parts) if prompt_parts else None


def main():
    if len(sys.argv) < 2:
        print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
        sys.exit(1)

    audio_file = sys.argv[1]
    initial_prompt = load_vocab()
    model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)

    kwargs = dict(
        beam_size=5,
        language="en",
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=500),
    )
    if initial_prompt:
        kwargs["initial_prompt"] = initial_prompt

    segments, info = model.transcribe(audio_file, **kwargs)
    transcript = " ".join(segment.text.strip() for segment in segments)
    print(transcript)

if __name__ == "__main__":
    main()

Important: Replace YOUR_USERNAME with your actual username in the shebang line.

Make it executable and download the model:

chmod +x ~/.local/bin/whisper-transcribe

# First run downloads the model (~1.5GB)
~/.local/bin/whisper-transcribe /dev/null 2>/dev/null || true

Create the Dictation Script

Save this to ~/.local/bin/dictate-hotkey:

#!/bin/bash
# Global hotkey dictation script for macOS
# Uses local Whisper (faster-whisper) for transcription

LOCK_FILE="/tmp/dictate-hotkey.lock"
PID_FILE="/tmp/dictate-hotkey.pid"
AUDIO_FILE="/tmp/dictate-hotkey.wav"
DEBUG_AUDIO="/tmp/dictate-hotkey-debug.wav"
WHISPER_BIN="$HOME/.local/bin/whisper-transcribe"

if [[ ! -x "$WHISPER_BIN" ]]; then
    osascript -e 'display notification "whisper-transcribe not found" with title "Dictation" sound name "Basso"'
    exit 1
fi

cleanup_stale_lock() {
    if [[ -f "$LOCK_FILE" ]] && [[ -f "$PID_FILE" ]]; then
        local pid
        pid=$(cat "$PID_FILE" 2>/dev/null)
        if [[ -n "$pid" ]] && ! kill -0 "$pid" 2>/dev/null; then
            rm -f "$LOCK_FILE" "$PID_FILE"
            return 0
        fi
    elif [[ -f "$LOCK_FILE" ]] && [[ ! -f "$PID_FILE" ]]; then
        rm -f "$LOCK_FILE"
        return 0
    fi
    return 1
}

cleanup_stale_lock

if [[ -f "$LOCK_FILE" ]]; then
    # STOP recording and transcribe
    if [[ -f "$PID_FILE" ]]; then
        pid=$(cat "$PID_FILE" 2>/dev/null)
        if [[ -n "$pid" ]]; then
            kill -INT "$pid" 2>/dev/null || true
            for i in {1..10}; do
                kill -0 "$pid" 2>/dev/null || break
                sleep 0.1
            done
            kill -0 "$pid" 2>/dev/null && kill -9 "$pid" 2>/dev/null
        fi
        rm -f "$PID_FILE"
    fi

    rm -f "$LOCK_FILE"

    if [[ ! -f "$AUDIO_FILE" ]]; then
        osascript -e 'display notification "No audio file found" with title "Dictation" sound name "Basso"'
        exit 1
    fi

    audio_size=$(stat -f%z "$AUDIO_FILE" 2>/dev/null || echo "0")
    if [[ "$audio_size" -lt 1000 ]]; then
        osascript -e 'display notification "Audio too short" with title "Dictation" sound name "Basso"'
        rm -f "$AUDIO_FILE"
        exit 1
    fi

    osascript -e 'display notification "Transcribing locally..." with title "Dictation"'

    TRANSCRIPT=$("$WHISPER_BIN" "$AUDIO_FILE" 2>/dev/null)

    if [[ -z "$TRANSCRIPT" ]]; then
        cp "$AUDIO_FILE" "$DEBUG_AUDIO" 2>/dev/null
        osascript -e 'display notification "Transcription failed" with title "Dictation" sound name "Basso"'
        rm -f "$AUDIO_FILE"
        exit 1
    fi

    rm -f "$AUDIO_FILE"

    # Copy to clipboard and paste
    echo -n "$TRANSCRIPT" | pbcopy
    osascript -e 'tell application "System Events" to keystroke "v" using command down'

    osascript -e "display notification \"Typed: ${TRANSCRIPT:0:50}...\" with title \"Dictation\""

else
    # START recording
    rm -f "$AUDIO_FILE"

    osascript -e 'display notification "Recording... Press hotkey to stop" with title "Dictation"'

    rec -q -r 16000 -c 1 "$AUDIO_FILE" &
    record_pid=$!

    sleep 0.2
    if ! kill -0 "$record_pid" 2>/dev/null; then
        osascript -e 'display notification "Failed to start recording" with title "Dictation" sound name "Basso"'
        exit 1
    fi

    touch "$LOCK_FILE"
    echo "$record_pid" > "$PID_FILE"
fi

Make it executable:

chmod +x ~/.local/bin/dictate-hotkey

Set Up the Global Hotkey

Option A: Automator + System Shortcuts (built-in)

  1. Open Automator > New Document > Quick Action
  2. Set “Workflow receives” to no input in any application
  3. Add Run Shell Script action
  4. Paste: ~/.local/bin/dictate-hotkey
  5. Save as “Dictate Toggle”
  6. Open System Settings > Keyboard > Keyboard Shortcuts > Services
  7. Find “Dictate Toggle” and assign your hotkey (e.g., `Ctrl+``)

Option B: Hammerspoon (more reliable)

If you use Hammerspoon, add to your ~/.hammerspoon/init.lua:

hs.hotkey.bind({"ctrl"}, "`", function()
    hs.execute("~/.local/bin/dictate-hotkey", true)
end)

Then reload Hammerspoon config.

Grant Permissions

macOS will prompt for:

  • Microphone access - Allow for Terminal/iTerm/Hammerspoon
  • Accessibility access - Required for the paste keystroke

Go to System Settings > Privacy & Security to grant these if needed.

Test It

Same as Linux - press hotkey, speak, press again, text appears.

Troubleshooting

SoX not recording:

# List audio devices
rec -l

# Test recording
rec -r 16000 -c 1 /tmp/test.wav
# Ctrl+C to stop
play /tmp/test.wav

Paste not working:

  • Make sure Accessibility permissions are granted
  • Try the clipboard fallback: just pbcopy and manually Cmd+V

Part 3: Windows 11

Windows uses AutoHotkey for the global hotkey and ffmpeg for recording. The transcription wrapper runs in Python, same as other platforms.

Install Dependencies

Option A: Using winget (recommended)

Open PowerShell as Administrator:

winget install --id=Gyan.FFmpeg -e
winget install --id=AutoHotkey.AutoHotkey -e
winget install --id=Python.Python.3.12 -e

Option B: Using Scoop

scoop install ffmpeg autohotkey python

Install faster-whisper

# Create venv and install
python -m venv "$env:USERPROFILE\.local\share\whisper-venv"
& "$env:USERPROFILE\.local\share\whisper-venv\Scripts\pip.exe" install faster-whisper

Create the Transcription Wrapper

Save this to %USERPROFILE%\.local\bin\whisper-transcribe.py:

"""Local Whisper transcription using faster-whisper."""

import sys
from pathlib import Path
from faster_whisper import WhisperModel

MODEL_SIZE = "distil-large-v3"
DEVICE = "cpu"
COMPUTE_TYPE = "int8"

VOCAB_PATH = Path.home() / ".local/share/whisper-vocab.txt"
PROMPT_MAX_CHARS = 350


def load_vocab():
    if not VOCAB_PATH.exists():
        return None
    terms = VOCAB_PATH.read_text().strip().splitlines()
    if not terms:
        return None
    prompt_parts, char_count = [], 0
    for term in terms:
        if char_count + len(term) + 2 > PROMPT_MAX_CHARS:
            break
        prompt_parts.append(term)
        char_count += len(term) + 2
    return ", ".join(prompt_parts) if prompt_parts else None


def main():
    if len(sys.argv) < 2:
        print("Usage: whisper-transcribe <audio_file>", file=sys.stderr)
        sys.exit(1)

    audio_file = sys.argv[1]
    initial_prompt = load_vocab()
    model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)

    kwargs = dict(
        beam_size=5,
        language="en",
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=500),
    )
    if initial_prompt:
        kwargs["initial_prompt"] = initial_prompt

    segments, info = model.transcribe(audio_file, **kwargs)
    transcript = " ".join(segment.text.strip() for segment in segments)
    print(transcript)

if __name__ == "__main__":
    main()

Create the directories and download the model:

mkdir "$env:USERPROFILE\.local\bin" -Force

# First run downloads the model (~1.5GB)
& "$env:USERPROFILE\.local\share\whisper-venv\Scripts\python.exe" "$env:USERPROFILE\.local\bin\whisper-transcribe.py" NUL 2>$null

Find Your Microphone Name

ffmpeg -list_devices true -f dshow -i dummy 2>&1 | Select-String "audio"

Look for a line like:

[dshow] "Microphone Array (Realtek(R) Audio)" (audio)

Copy the name in quotes - you’ll need it for the script below.

Create the AutoHotkey Script

Save this as dictate.ahk somewhere convenient (e.g., Documents\Scripts\dictate.ahk):

#Requires AutoHotkey v2.0
#SingleInstance Force

; Configuration
global LOCK_FILE := A_Temp "\dictate-hotkey.lock"
global PID_FILE := A_Temp "\dictate-hotkey.pid"
global AUDIO_FILE := A_Temp "\dictate-hotkey.wav"
global WHISPER_VENV := EnvGet("USERPROFILE") "\.local\share\whisper-venv\Scripts\python.exe"
global WHISPER_SCRIPT := EnvGet("USERPROFILE") "\.local\bin\whisper-transcribe.py"

; Hotkey: Ctrl+` (backtick) - change this to your preference
^`:: ToggleDictation()

ToggleDictation() {
    ; Clean up stale lock if recording process died
    if FileExist(LOCK_FILE) && FileExist(PID_FILE) {
        pid := FileRead(PID_FILE)
        if !ProcessExist(pid) {
            FileDelete(LOCK_FILE)
            FileDelete(PID_FILE)
        }
    }

    if FileExist(LOCK_FILE) {
        StopAndTranscribe()
    } else {
        StartRecording()
    }
}

StartRecording() {
    ; Clean up
    if FileExist(AUDIO_FILE)
        FileDelete(AUDIO_FILE)

    ; Create lock file
    FileAppend("", LOCK_FILE)

    ; Show notification
    TrayTip("Recording... Press Ctrl+` to stop", "Dictation", "Mute")

    ; Start ffmpeg recording in background
    ; IMPORTANT: Replace with YOUR device name from the step above
    deviceName := "Microphone Array (Realtek(R) Audio)"
    Run('cmd /c ffmpeg -f dshow -i audio="' deviceName '" -ar 16000 -ac 1 -y "' AUDIO_FILE '"', , "Hide", &pid)

    ; Save PID
    FileAppend(pid, PID_FILE)
}

StopAndTranscribe() {
    ; Remove lock
    if FileExist(LOCK_FILE)
        FileDelete(LOCK_FILE)

    ; Kill ffmpeg
    if FileExist(PID_FILE) {
        pid := FileRead(PID_FILE)
        try {
            ProcessClose(pid)
        }
        FileDelete(PID_FILE)
    }

    ; Also kill any lingering ffmpeg
    Run('taskkill /f /im ffmpeg.exe 2>nul', , "Hide")
    Sleep(500)  ; Let file finish writing

    if !FileExist(AUDIO_FILE) || FileGetSize(AUDIO_FILE) < 1000 {
        TrayTip("No audio recorded", "Dictation Error", "Icon!")
        return
    }

    TrayTip("Transcribing locally...", "Dictation", "Mute")

    ; Run whisper transcription
    tempTranscript := A_Temp "\transcript.txt"

    cmd := '"' WHISPER_VENV '" "' WHISPER_SCRIPT '" "' AUDIO_FILE '" > "' tempTranscript '" 2>nul'
    RunWait(A_ComSpec ' /c ' cmd, , "Hide")

    transcript := ""
    if FileExist(tempTranscript) {
        transcript := Trim(FileRead(tempTranscript), "`r`n")
        FileDelete(tempTranscript)
    }
    FileDelete(AUDIO_FILE)

    if (transcript = "") {
        TrayTip("Transcription failed", "Dictation Error", "Icon!")
        return
    }

    ; Type the result using clipboard + paste (most reliable)
    A_Clipboard := transcript
    Sleep(100)
    Send("^v")

    TrayTip("Typed: " SubStr(transcript, 1, 50) "...", "Dictation", "Mute")
}

Important: Update the deviceName variable with your actual microphone name from the previous step.

Run on Startup (Optional)

  1. Press Win+R, type shell:startup, press Enter
  2. Create a shortcut to your dictate.ahk file in this folder

Test It

  1. Double-click dictate.ahk to run it (green H icon appears in system tray)
  2. Open any text field
  3. Press `Ctrl+`` - tray notification shows “Recording…”
  4. Speak clearly
  5. Press `Ctrl+`` again - “Transcribing locally…” then text appears

Note: The first transcription will be slow (~30-60 seconds) as the model loads. Subsequent transcriptions will be faster (~6-8 seconds for 15 seconds of audio).

Troubleshooting

“No audio recorded” error:

  • Check the deviceName variable in the script matches your actual microphone
  • Test ffmpeg recording manually:
    ffmpeg -f dshow -i audio="Your Microphone Name" -t 3 -y test.wav
    
  • Make sure microphone isn’t muted in Windows Sound settings

ffmpeg not found:

  • Restart your terminal/PowerShell after installing
  • Check it’s in PATH: where ffmpeg

AutoHotkey script won’t run:

  • Make sure you have AutoHotkey v2 installed (not v1)
  • Right-click the .ahk file > “Run as administrator” if needed

Hotkey conflict:

  • Change `^`` to something else in the script (look for the hotkey definition), e.g.:
    • ^+d for Ctrl+Shift+D
    • #d for Win+D
    • !d for Alt+D

Model download fails:

  • Ensure you have ~3GB free disk space
  • Check Python and pip are working: python --version

Custom Vocabulary (Optional)

Whisper’s initial_prompt parameter biases the decoder toward specific words. If you frequently dictate domain-specific terms - project names, technical jargon, people’s names - you can improve accuracy by feeding Whisper a vocabulary list.

How It Works

The whisper-transcribe script loads vocabulary from ~/.local/share/whisper-vocab.txt (or %USERPROFILE%\.local\share\whisper-vocab.txt on Windows) if it exists. The file format is simple: one term per line, sorted by frequency (most common first). The script takes the top ~40-60 terms that fit in Whisper’s 350-character prompt budget.

Example vocabulary file:

TrueNAS
Syncthing
Kubernetes
Obsidian
PostgreSQL
FastAPI

Creating Your Vocabulary File

Simple approach: Manually list terms you commonly dictate:

cat > ~/.local/share/whisper-vocab.txt << 'EOF'
YourProjectName
TechnicalTerm1
TechnicalTerm2
SomeFramework
EOF

Automated approach: If you keep notes in markdown (Obsidian, Logseq, etc.), you can extract vocabulary programmatically:

  1. Scan all .md files for capitalised words and wiki-links
  2. Filter against a dictionary (/usr/share/dict/words on Linux) to isolate non-common terms
  3. Sort by frequency and output to the vocab file

I wrote a Python script that does this for my Obsidian vault - it extracts ~3000 terms, and Whisper loads the top 40-50 at transcription time. The improvement is subtle but noticeable for proper nouns.

Constraints

  • Token budget: Whisper’s decoder context is 448 tokens. The initial_prompt should use at most half (~224 tokens ≈ 350 characters). The script enforces this.
  • Acronyms are low yield: Whisper handles common acronyms fine from training data. Focus on proper nouns and domain-specific terms.
  • Don’t use hotwords with initial_prompt: In faster-whisper 1.2.1, combining both causes repetition artifacts. Stick to initial_prompt only.

Beyond Dictation: Longer Recordings with Speaker Diarisation

The dictation setup above is optimised for short bursts — press a hotkey, speak a sentence, get text. But the same Whisper stack can handle longer recordings (meetings, interviews, voice memos) with an important addition: speaker diarisation — identifying who said what.

For this, WhisperX wraps faster-whisper with word-level timestamp alignment and pyannote speaker diarisation in a single pipeline.

Install WhisperX

# Create a separate venv (don't mix with dictation — different dependencies)
python3 -m venv ~/venvs/whisperx
~/venvs/whisperx/bin/pip install whisperx

One-Time Setup: HuggingFace Auth

Pyannote’s diarisation models are gated — you need a free HuggingFace account and must accept the model licenses:

  1. Create an account at huggingface.co
  2. Accept the licenses (click “Agree” on each):
  3. Generate a read token at huggingface.co/settings/tokens
  4. Log in locally:
    ~/venvs/whisperx/bin/huggingface-cli login
    

Models cache locally after first download (~2-4GB). After that, everything runs offline.

Transcribe with Diarisation

import torch
import whisperx
from whisperx.diarize import DiarizationPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if device == "cuda" else "int8"

audio_file = "meeting.m4a"

# 1. Transcribe
model = whisperx.load_model("distil-large-v3", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=4)

# 2. Align word-level timestamps
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)

# 3. Diarise (identify speakers)
diarize_model = DiarizationPipeline(
    model_name="pyannote/speaker-diarization-3.1", device=device
)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=4)
result = whisperx.assign_word_speakers(diarize_segments, result)

# 4. Print with speaker labels
for seg in result["segments"]:
    speaker = seg.get("speaker", "?")
    print(f"[{speaker}] {seg['text'].strip()}")

What to Expect

ScenarioAccuracy
2-3 speakers, clear audioExcellent
4-6 speakers, moderate audioGood, some errors
7+ speakers or noisy/overlappingNeeds manual review

Specifying min_speakers / max_speakers (or num_speakers if you know the exact count) significantly improves accuracy — the model doesn’t have to guess.

Speed on CPU: Transcription runs at ~0.6x realtime (a 10 min file takes ~6 min). Diarisation adds roughly 2-3x audio length on top. A 10-minute meeting takes ~30 minutes total on CPU. On GPU, both steps are near-instant.

Key detail: Speaker labels come back as SPEAKER_00, SPEAKER_01, etc. — you map these to real names yourself. WhisperX assigns speakers at the word level, so even within a single transcription segment, different words can have different speaker labels. This matters for conversational audio where people talk in short turns.

When to Use What

Use caseTool
Quick dictation (sentence or two)The hotkey setup above (faster-whisper)
Transcribing a voice memo or lectureWhisperX without diarisation
Meeting or interview (multiple speakers)WhisperX with diarisation (see code above)

The dictation setup stays as-is — it’s optimised for speed and simplicity. WhisperX is the heavier tool you reach for when you have a longer recording and care about who said what.


What’s Next

Possible improvements to the dictation setup:

  • Hold-to-talk instead of toggle (harder with global shortcuts)
  • Visual indicator showing recording state in system tray
  • Multiple languages (Whisper supports many)
  • GPU acceleration (significantly faster if you have a compatible NVIDIA GPU)

The cognitive overhead of “press, speak, press” is minimal once you’ve done it a few times.


Questions or improvements? Find me at hedwards.dev.