Works on Linux, macOS, or cloud GPUs (RunPod, Vast.ai).


What This Is

Take any song. Separate the vocals from the instrumentals. Convert the vocals to sound like a different singer. Remix. You now have Eddie Vedder singing Complicated by Avril Lavigne โ€” or whatever combination you want.

This uses Retrieval-based Voice Conversion (RVC), an open-source voice conversion algorithm that’s become the standard for AI covers. It takes an input voice and converts it to sound like a target speaker, preserving the original melody, timing, and emotion. With a pre-trained voice model (available for most well-known artists), the whole process takes a few minutes per song on a cloud GPU and costs cents.

The pipeline:

  1. Download the source song
  2. Separate vocals from instrumentals
  3. Convert the vocals to your target singer’s voice
  4. Remix the converted vocals with the original instrumentals

Jump to: Prerequisites ยท Step 1: Get the Song ยท Step 2: Separate Vocals ยท Step 3: Voice Conversion ยท Step 4: Remix ยท RunPod Cloud Setup ยท Quality Tips ยท Voice Models ยท Tools Overview


Prerequisites

  • A cloud GPU โ€” RunPod ($0.09-0.20/hr), Vast.ai, or similar. A local GPU (8GB+ VRAM) also works.
  • Python 3.10+
  • ffmpeg
  • yt-dlp (for downloading source audio)

This guide uses RunPod for GPU compute. You don’t need to own a GPU โ€” rent one for the duration of the job (minutes), then delete it. See the RunPod section for setup.


Step 1: Get the Song

Download the source audio. yt-dlp extracts best-quality audio:

# Install yt-dlp if you don't have it
pip install yt-dlp

# Download audio only, best quality
yt-dlp -x --audio-format wav -o "complicated.wav" "https://www.youtube.com/watch?v=SONG_ID"

Download as WAV โ€” this avoids a second lossy transcode. YouTube’s source audio is already lossy (~251kbps Opus), so if you have access to a lossless source (CD rip, Bandcamp FLAC, Qobuz), prefer that. Avoid --audio-format mp3, which would add another generation of compression artifacts on top of YouTube’s.


Step 2: Separate Vocals

You need to isolate the vocals from the instrumentals. The vocals go through voice conversion; the instrumentals stay untouched.

Option A: UVR5 + BS-RoFormer (best quality)

Ultimate Vocal Remover 5 (UVR5) is a GUI that wraps multiple separation models. BS-RoFormer is the current state of the art โ€” it won the SDX23 Challenge with 12.9 dB SDR for vocals.

Install from the UVR5 GitHub repo โ€” follow their README for current install instructions (dependencies vary by platform and GPU).

In the GUI:

  1. Select BS-RoFormer as your model
  2. Load your song
  3. Process โ€” you’ll get vocals.wav and instrumental.wav

Optional second pass: Run the vocals through a de-reverb model to strip room reverb and backing vocals. This gives the voice conversion cleaner input.

Option B: Demucs (simpler, still good)

If you’ve used Demucs before or just want a CLI approach:

pip install demucs

# Separate into vocals + accompaniment
python -m demucs --two-stems=vocals complicated.wav

Output lands in separated/htdemucs/complicated/vocals.wav and no_vocals.wav.

Demucs v4 (htdemucs) is still very capable โ€” it falls slightly behind BS-RoFormer for pure vocal isolation but has the advantage of a clean CLI interface and no GUI dependency.


Step 3: Voice Conversion

This is the core step. Applio is the community consensus tool for RVC voice conversion โ€” the most actively maintained fork with the best documentation.

Install Applio

# Clone the repo
git clone https://github.com/IAHispano/Applio.git
cd Applio

# Install dependencies (uses uv if available, pip otherwise)
pip install -r requirements.txt

# Run the web UI
python app.py

The Gradio web UI opens in your browser.

Run Inference

  1. Download a voice model (see Finding Voice Models below) and place the .pth and .index files in logs/ inside the Applio directory
  2. Select your model in the Inference tab
  3. Upload the isolated vocals.wav from Step 2
  4. Set pitch โ€” if the original singer and target singer are different genders, you’ll need to transpose:
    • Female โ†’ Male: try -4 to -6 semitones
    • Male โ†’ Female: try +4 to +6 semitones
    • Same gender, different range: adjust by ear (ยฑ1-3 semitones)
  5. Export the converted vocals

Applio also has a CLI for scripted/batch use โ€” check their docs for the current interface, as it changes between releases.


Step 4: Remix

Combine the converted vocals with the original instrumentals:

ffmpeg -i vocals_converted.wav -i instrumental.wav \
  -filter_complex "[0:a][1:a]amix=inputs=2:duration=longest" \
  -ac 2 output_cover.wav

For better results, use a proper mixing approach that preserves volume levels:

# Overlay vocals on instrumentals with volume control
ffmpeg -i vocals_converted.wav -i instrumental.wav \
  -filter_complex "[0:a]volume=1.0[v];[1:a]volume=0.85[i];[v][i]amix=inputs=2:duration=longest:normalize=0" \
  -ac 2 output_cover.wav

Adjust the volume values to taste. The vocals usually want to be slightly louder than the instrumentals. If you want more control, open both tracks in Audacity or any DAW and mix manually โ€” you can add EQ, compression, reverb to glue it together.


Running on RunPod (Cloud GPU)

No local GPU? RunPod lets you rent one for cents. An RTX A4000 (16GB VRAM, $0.09/hr spot) handles the entire pipeline comfortably.

One-shot approach (first time)

  1. Create an account at runpod.io, load $5 in credits
  2. Create a pod:
    runpodctl create pod \
      --name "ai-covers" \
      --gpuType "NVIDIA RTX A4000" \
      --gpuCount 1 \
      --templateId "runpod-torch-v240" \
      --imageName "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04" \
      --containerDiskSize 30 \
      --startSSH \
      --communityCloud
    
  3. SSH in (get the full SSH command from the RunPod web UI โ€” it includes a hash suffix not available from the CLI)
  4. Install everything:
    apt-get update && apt-get install -y ffmpeg
    
    # Install Applio
    git clone https://github.com/IAHispano/Applio.git
    cd Applio && pip install -r requirements.txt
    
    # Install demucs (for stem separation)
    pip install demucs
    
  5. Upload your audio via runpodctl send/receive or the Jupyter upload interface
  6. Run the pipeline (Steps 1-4 above)
  7. Download results and delete the pod

Recurring use: Docker image + Network Volume

If you’re doing this regularly (every week or two), save setup time:

  1. Build a custom Docker image with Applio + Demucs + ffmpeg + all dependencies pre-installed. Push to Docker Hub.
  2. Create a Network Volume (~5-10GB, ~$0.50-1/month) to persist voice models and output files across sessions.
  3. Session workflow: Spin up pod with your image + volume โ†’ ready in 1-3 min โ†’ process โ†’ download โ†’ delete pod.

This turns a 15-30 minute setup into a 1-3 minute cold start.


Quality Tips

These parameters make a real difference in Applio:

Pitch Algorithm

AlgorithmSpeedQualityUse when
RMVPEFastGoodDefault choice for most covers
CrepeSlowerBestVery clean source audio, breathy/soft voices
FCPEFastestAcceptableReal-time conversion, not ideal for covers

Key Parameters

  • Index Ratio (Search Feature Ratio): Controls how much the voice model’s stored characteristics influence the output. Lower it (0.3-0.5) to reduce artifacts, especially with noisy source audio. Higher (0.7-0.8) for more faithful voice reproduction with clean input.
  • Protection: Keep at 0.33-0.5. Going too low strips breath sounds and makes the output sound robotic.
  • Split Audio: Enable this โ€” faster inference and more consistent volume across the track.
  • Embedder: ContentVec is the default. Spin V2 may give cleaner pronunciation โ€” worth A/B testing on your specific cover.

General Tips

  • Clean input matters most. Spend time getting good vocal isolation in Step 2. A second de-reverb pass is worth it.
  • Pitch transposition is trial and error. Start with the recommended range, then adjust by ear. Half-semitone tweaks can make a big difference.
  • Training your own model from 10+ minutes of clean isolated vocals (200-300 epochs) will almost always beat a random community model. More epochs risk overfitting.

Finding Voice Models

Pre-trained RVC voice models for popular artists are widely available:

SourceNotes
weights.ggLargest curated library. Ratings, comments, previews. Requires account.
voice-models.com200k+ model index. Links to HuggingFace/Google Drive downloads.
HuggingFaceSearch “[artist name] RVC” for community uploads.

Popular artists (Eddie Vedder, Michael Jackson, Adele, etc.) have well-trained models with 300+ epochs available. For obscure artists, you’ll need to train your own โ€” which only requires 5-10 minutes of clean vocal audio and a few hundred training epochs.

Model files are small: a .pth file (~150MB) and an .index file. Download both.


Tools Overview

Voice Conversion

ToolRecommendationNotes
ApplioUse thisCommunity consensus RVC fork. 3.1k stars, 149 contributors. Best docs, biggest community.
ultimate-rvcAlternativeWraps the full pipeline (separation + conversion + remix) into one tool. Smaller community.
RVC WebUILegacyThe original project. No major quality improvements since 2023.
Seed-VCExperimentalZero-shot voice conversion (no training needed). Archived Nov 2025.

Vocal Separation

ToolRecommendationNotes
UVR5 + BS-RoFormerBest qualityWon SDX23 Challenge. Use through UVR5’s GUI.
Demucs v4SimplestClean CLI, still very good. Falls slightly behind BS-RoFormer.

Commercial Alternatives

If you don’t want to run your own pipeline, Jammable (formerly Voicify AI) does this end-to-end for ~$8/month. Upload a song, pick a voice, get a cover in 30-60 seconds. Quality is hit-or-miss (reviews cite pitch issues and robotic artifacts), but it’s zero-effort. Other similar services: Musicfy, Covers.ai, MusicAI, Kit AI.

The DIY route gives you more control over quality parameters, access to any voice model (not just what the platform offers), and costs cents per song instead of a monthly subscription.


Cost Summary

ApproachCostSetup timePer-song time
RunPod spot (RTX A4000)~$0.01-0.05/song15-30 min first time, 1-3 min with Docker image3-5 min
Local GPU (8GB+ VRAM)Free (hardware you own)30 min (one-time)5-10 min
Jammable (commercial)$8/monthNone30-60 sec

RunPod with a pre-built Docker image is the sweet spot. Costs almost nothing, takes a few minutes per song, and you get full control over quality.