Works on Linux, macOS, or cloud GPUs (RunPod, Vast.ai).
What This Is
Take any song. Separate the vocals from the instrumentals. Convert the vocals to sound like a different singer. Remix. You now have Eddie Vedder singing Complicated by Avril Lavigne โ or whatever combination you want.
This uses Retrieval-based Voice Conversion (RVC), an open-source voice conversion algorithm that’s become the standard for AI covers. It takes an input voice and converts it to sound like a target speaker, preserving the original melody, timing, and emotion. With a pre-trained voice model (available for most well-known artists), the whole process takes a few minutes per song on a cloud GPU and costs cents.
The pipeline:
- Download the source song
- Separate vocals from instrumentals
- Convert the vocals to your target singer’s voice
- Remix the converted vocals with the original instrumentals
Jump to: Prerequisites ยท Step 1: Get the Song ยท Step 2: Separate Vocals ยท Step 3: Voice Conversion ยท Step 4: Remix ยท RunPod Cloud Setup ยท Quality Tips ยท Voice Models ยท Tools Overview
Prerequisites
- A cloud GPU โ RunPod ($0.09-0.20/hr), Vast.ai, or similar. A local GPU (8GB+ VRAM) also works.
- Python 3.10+
- ffmpeg
- yt-dlp (for downloading source audio)
This guide uses RunPod for GPU compute. You don’t need to own a GPU โ rent one for the duration of the job (minutes), then delete it. See the RunPod section for setup.
Step 1: Get the Song
Download the source audio. yt-dlp extracts best-quality audio:
# Install yt-dlp if you don't have it
pip install yt-dlp
# Download audio only, best quality
yt-dlp -x --audio-format wav -o "complicated.wav" "https://www.youtube.com/watch?v=SONG_ID"
Download as WAV โ this avoids a second lossy transcode. YouTube’s source audio is already lossy (~251kbps Opus), so if you have access to a lossless source (CD rip, Bandcamp FLAC, Qobuz), prefer that. Avoid --audio-format mp3, which would add another generation of compression artifacts on top of YouTube’s.
Step 2: Separate Vocals
You need to isolate the vocals from the instrumentals. The vocals go through voice conversion; the instrumentals stay untouched.
Option A: UVR5 + BS-RoFormer (best quality)
Ultimate Vocal Remover 5 (UVR5) is a GUI that wraps multiple separation models. BS-RoFormer is the current state of the art โ it won the SDX23 Challenge with 12.9 dB SDR for vocals.
Install from the UVR5 GitHub repo โ follow their README for current install instructions (dependencies vary by platform and GPU).
In the GUI:
- Select BS-RoFormer as your model
- Load your song
- Process โ you’ll get
vocals.wavandinstrumental.wav
Optional second pass: Run the vocals through a de-reverb model to strip room reverb and backing vocals. This gives the voice conversion cleaner input.
Option B: Demucs (simpler, still good)
If you’ve used Demucs before or just want a CLI approach:
pip install demucs
# Separate into vocals + accompaniment
python -m demucs --two-stems=vocals complicated.wav
Output lands in separated/htdemucs/complicated/vocals.wav and no_vocals.wav.
Demucs v4 (htdemucs) is still very capable โ it falls slightly behind BS-RoFormer for pure vocal isolation but has the advantage of a clean CLI interface and no GUI dependency.
Step 3: Voice Conversion
This is the core step. Applio is the community consensus tool for RVC voice conversion โ the most actively maintained fork with the best documentation.
Install Applio
# Clone the repo
git clone https://github.com/IAHispano/Applio.git
cd Applio
# Install dependencies (uses uv if available, pip otherwise)
pip install -r requirements.txt
# Run the web UI
python app.py
The Gradio web UI opens in your browser.
Run Inference
- Download a voice model (see Finding Voice Models below) and place the
.pthand.indexfiles inlogs/inside the Applio directory - Select your model in the Inference tab
- Upload the isolated
vocals.wavfrom Step 2 - Set pitch โ if the original singer and target singer are different genders, you’ll need to transpose:
- Female โ Male: try -4 to -6 semitones
- Male โ Female: try +4 to +6 semitones
- Same gender, different range: adjust by ear (ยฑ1-3 semitones)
- Export the converted vocals
Applio also has a CLI for scripted/batch use โ check their docs for the current interface, as it changes between releases.
Step 4: Remix
Combine the converted vocals with the original instrumentals:
ffmpeg -i vocals_converted.wav -i instrumental.wav \
-filter_complex "[0:a][1:a]amix=inputs=2:duration=longest" \
-ac 2 output_cover.wav
For better results, use a proper mixing approach that preserves volume levels:
# Overlay vocals on instrumentals with volume control
ffmpeg -i vocals_converted.wav -i instrumental.wav \
-filter_complex "[0:a]volume=1.0[v];[1:a]volume=0.85[i];[v][i]amix=inputs=2:duration=longest:normalize=0" \
-ac 2 output_cover.wav
Adjust the volume values to taste. The vocals usually want to be slightly louder than the instrumentals. If you want more control, open both tracks in Audacity or any DAW and mix manually โ you can add EQ, compression, reverb to glue it together.
Running on RunPod (Cloud GPU)
No local GPU? RunPod lets you rent one for cents. An RTX A4000 (16GB VRAM, $0.09/hr spot) handles the entire pipeline comfortably.
One-shot approach (first time)
- Create an account at runpod.io, load $5 in credits
- Create a pod:
runpodctl create pod \ --name "ai-covers" \ --gpuType "NVIDIA RTX A4000" \ --gpuCount 1 \ --templateId "runpod-torch-v240" \ --imageName "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04" \ --containerDiskSize 30 \ --startSSH \ --communityCloud - SSH in (get the full SSH command from the RunPod web UI โ it includes a hash suffix not available from the CLI)
- Install everything:
apt-get update && apt-get install -y ffmpeg # Install Applio git clone https://github.com/IAHispano/Applio.git cd Applio && pip install -r requirements.txt # Install demucs (for stem separation) pip install demucs - Upload your audio via
runpodctl send/receiveor the Jupyter upload interface - Run the pipeline (Steps 1-4 above)
- Download results and delete the pod
Recurring use: Docker image + Network Volume
If you’re doing this regularly (every week or two), save setup time:
- Build a custom Docker image with Applio + Demucs + ffmpeg + all dependencies pre-installed. Push to Docker Hub.
- Create a Network Volume (~5-10GB, ~$0.50-1/month) to persist voice models and output files across sessions.
- Session workflow: Spin up pod with your image + volume โ ready in 1-3 min โ process โ download โ delete pod.
This turns a 15-30 minute setup into a 1-3 minute cold start.
Quality Tips
These parameters make a real difference in Applio:
Pitch Algorithm
| Algorithm | Speed | Quality | Use when |
|---|---|---|---|
| RMVPE | Fast | Good | Default choice for most covers |
| Crepe | Slower | Best | Very clean source audio, breathy/soft voices |
| FCPE | Fastest | Acceptable | Real-time conversion, not ideal for covers |
Key Parameters
- Index Ratio (Search Feature Ratio): Controls how much the voice model’s stored characteristics influence the output. Lower it (0.3-0.5) to reduce artifacts, especially with noisy source audio. Higher (0.7-0.8) for more faithful voice reproduction with clean input.
- Protection: Keep at 0.33-0.5. Going too low strips breath sounds and makes the output sound robotic.
- Split Audio: Enable this โ faster inference and more consistent volume across the track.
- Embedder: ContentVec is the default. Spin V2 may give cleaner pronunciation โ worth A/B testing on your specific cover.
General Tips
- Clean input matters most. Spend time getting good vocal isolation in Step 2. A second de-reverb pass is worth it.
- Pitch transposition is trial and error. Start with the recommended range, then adjust by ear. Half-semitone tweaks can make a big difference.
- Training your own model from 10+ minutes of clean isolated vocals (200-300 epochs) will almost always beat a random community model. More epochs risk overfitting.
Finding Voice Models
Pre-trained RVC voice models for popular artists are widely available:
| Source | Notes |
|---|---|
| weights.gg | Largest curated library. Ratings, comments, previews. Requires account. |
| voice-models.com | 200k+ model index. Links to HuggingFace/Google Drive downloads. |
| HuggingFace | Search “[artist name] RVC” for community uploads. |
Popular artists (Eddie Vedder, Michael Jackson, Adele, etc.) have well-trained models with 300+ epochs available. For obscure artists, you’ll need to train your own โ which only requires 5-10 minutes of clean vocal audio and a few hundred training epochs.
Model files are small: a .pth file (~150MB) and an .index file. Download both.
Tools Overview
Voice Conversion
| Tool | Recommendation | Notes |
|---|---|---|
| Applio | Use this | Community consensus RVC fork. 3.1k stars, 149 contributors. Best docs, biggest community. |
| ultimate-rvc | Alternative | Wraps the full pipeline (separation + conversion + remix) into one tool. Smaller community. |
| RVC WebUI | Legacy | The original project. No major quality improvements since 2023. |
| Seed-VC | Experimental | Zero-shot voice conversion (no training needed). Archived Nov 2025. |
Vocal Separation
| Tool | Recommendation | Notes |
|---|---|---|
| UVR5 + BS-RoFormer | Best quality | Won SDX23 Challenge. Use through UVR5’s GUI. |
| Demucs v4 | Simplest | Clean CLI, still very good. Falls slightly behind BS-RoFormer. |
Commercial Alternatives
If you don’t want to run your own pipeline, Jammable (formerly Voicify AI) does this end-to-end for ~$8/month. Upload a song, pick a voice, get a cover in 30-60 seconds. Quality is hit-or-miss (reviews cite pitch issues and robotic artifacts), but it’s zero-effort. Other similar services: Musicfy, Covers.ai, MusicAI, Kit AI.
The DIY route gives you more control over quality parameters, access to any voice model (not just what the platform offers), and costs cents per song instead of a monthly subscription.
Cost Summary
| Approach | Cost | Setup time | Per-song time |
|---|---|---|---|
| RunPod spot (RTX A4000) | ~$0.01-0.05/song | 15-30 min first time, 1-3 min with Docker image | 3-5 min |
| Local GPU (8GB+ VRAM) | Free (hardware you own) | 30 min (one-time) | 5-10 min |
| Jammable (commercial) | $8/month | None | 30-60 sec |
RunPod with a pre-built Docker image is the sweet spot. Costs almost nothing, takes a few minutes per song, and you get full control over quality.