AI Music Covers with Voice Conversion

Works on Linux, macOS, or cloud GPUs (RunPod, Vast.ai).

What This Is

Take any song. Separate the vocals from the instrumentals. Convert the vocals to sound like a different singer. Remix. You now have Eddie Vedder singing Complicated by Avril Lavigne — or whatever combination you want.

This uses Retrieval-based Voice Conversion (RVC), an open-source voice conversion algorithm that’s become the standard for AI covers. It takes an input voice and converts it to sound like a target speaker, preserving the original melody, timing, and emotion. With a pre-trained voice model (available for most well-known artists), the whole process takes a few minutes per song on a cloud GPU and costs cents.

The pipeline:

Download the source song
Separate vocals from instrumentals
Convert the vocals to your target singer’s voice
Remix the converted vocals with the original instrumentals

Jump to: Prerequisites · Step 1: Get the Song · Step 2: Separate Vocals · Step 3: Voice Conversion · Step 4: Remix · RunPod Cloud Setup · Quality Tips · Voice Models · Tools Overview

Prerequisites

A cloud GPU — RunPod ($0.09-0.20/hr), Vast.ai, or similar. A local GPU (8GB+ VRAM) also works.
Python 3.10+
ffmpeg
yt-dlp (for downloading source audio)

This guide uses RunPod for GPU compute. You don’t need to own a GPU — rent one for the duration of the job (minutes), then delete it. See the RunPod section for setup.

Step 1: Get the Song

Download the source audio. yt-dlp extracts best-quality audio:

# Install yt-dlp if you don't have it
pip install yt-dlp

# Download audio only, best quality
yt-dlp -x --audio-format wav -o "complicated.wav" "https://www.youtube.com/watch?v=SONG_ID"

Download as WAV — this avoids a second lossy transcode. YouTube’s source audio is already lossy (~251kbps Opus), so if you have access to a lossless source (CD rip, Bandcamp FLAC, Qobuz), prefer that. Avoid --audio-format mp3, which would add another generation of compression artifacts on top of YouTube’s.

Step 2: Separate Vocals

You need to isolate the vocals from the instrumentals. The vocals go through voice conversion; the instrumentals stay untouched.

Option A: UVR5 + BS-RoFormer (best quality)

Ultimate Vocal Remover 5 (UVR5) is a GUI that wraps multiple separation models. BS-RoFormer is the current state of the art — it won the SDX23 Challenge with 12.9 dB SDR for vocals.

Install from the UVR5 GitHub repo — follow their README for current install instructions (dependencies vary by platform and GPU).

In the GUI:

Select BS-RoFormer as your model
Load your song
Process — you’ll get vocals.wav and instrumental.wav

Optional second pass: Run the vocals through a de-reverb model to strip room reverb and backing vocals. This gives the voice conversion cleaner input.

Option B: Demucs (simpler, still good)

If you’ve used Demucs before or just want a CLI approach:

pip install demucs

# Separate into vocals + accompaniment
python -m demucs --two-stems=vocals complicated.wav

Output lands in separated/htdemucs/complicated/vocals.wav and no_vocals.wav.

Demucs v4 (htdemucs) is still very capable — it falls slightly behind BS-RoFormer for pure vocal isolation but has the advantage of a clean CLI interface and no GUI dependency.

Step 3: Voice Conversion

This is the core step. Applio is the community consensus tool for RVC voice conversion — the most actively maintained fork with the best documentation.

Install Applio

# Clone the repo
git clone https://github.com/IAHispano/Applio.git
cd Applio

# Install dependencies (uses uv if available, pip otherwise)
pip install -r requirements.txt

# Run the web UI
python app.py

The Gradio web UI opens in your browser.

Run Inference

Download a voice model (see Finding Voice Models below) and place the .pth and .index files in logs/ inside the Applio directory
Select your model in the Inference tab
Upload the isolated vocals.wav from Step 2
Set pitch — if the original singer and target singer are different genders, you’ll need to transpose:
- Female → Male: try -4 to -6 semitones
- Male → Female: try +4 to +6 semitones
- Same gender, different range: adjust by ear (±1-3 semitones)
Export the converted vocals

Applio also has a CLI for scripted/batch use — check their docs for the current interface, as it changes between releases.

Step 4: Remix

Combine the converted vocals with the original instrumentals:

ffmpeg -i vocals_converted.wav -i instrumental.wav \
  -filter_complex "[0:a][1:a]amix=inputs=2:duration=longest" \
  -ac 2 output_cover.wav

For better results, use a proper mixing approach that preserves volume levels:

# Overlay vocals on instrumentals with volume control
ffmpeg -i vocals_converted.wav -i instrumental.wav \
  -filter_complex "[0:a]volume=1.0[v];[1:a]volume=0.85[i];[v][i]amix=inputs=2:duration=longest:normalize=0" \
  -ac 2 output_cover.wav

Adjust the volume values to taste. The vocals usually want to be slightly louder than the instrumentals. If you want more control, open both tracks in Audacity or any DAW and mix manually — you can add EQ, compression, reverb to glue it together.

Running on RunPod (Cloud GPU)

No local GPU? RunPod lets you rent one for cents. An RTX A4000 (16GB VRAM, $0.09/hr spot) handles the entire pipeline comfortably.

One-shot approach (first time)

Create an account at runpod.io, load $5 in credits

Create a pod:

runpodctl create pod \
  --name "ai-covers" \
  --gpuType "NVIDIA RTX A4000" \
  --gpuCount 1 \
  --templateId "runpod-torch-v240" \
  --imageName "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04" \
  --containerDiskSize 30 \
  --startSSH \
  --communityCloud

SSH in (get the full SSH command from the RunPod web UI — it includes a hash suffix not available from the CLI)

Install everything:

apt-get update && apt-get install -y ffmpeg

# Install Applio
git clone https://github.com/IAHispano/Applio.git
cd Applio && pip install -r requirements.txt

# Install demucs (for stem separation)
pip install demucs

Upload your audio via runpodctl send/receive or the Jupyter upload interface
Run the pipeline (Steps 1-4 above)
Download results and delete the pod

Recurring use: Docker image + Network Volume

If you’re doing this regularly (every week or two), save setup time:

Build a custom Docker image with Applio + Demucs + ffmpeg + all dependencies pre-installed. Push to Docker Hub.
Create a Network Volume (~5-10GB, ~$0.50-1/month) to persist voice models and output files across sessions.
Session workflow: Spin up pod with your image + volume → ready in 1-3 min → process → download → delete pod.

This turns a 15-30 minute setup into a 1-3 minute cold start.

Quality Tips

These parameters make a real difference in Applio:

Pitch Algorithm

Algorithm	Speed	Quality	Use when
RMVPE	Fast	Good	Default choice for most covers
Crepe	Slower	Best	Very clean source audio, breathy/soft voices
FCPE	Fastest	Acceptable	Real-time conversion, not ideal for covers

Key Parameters

Index Ratio (Search Feature Ratio): Controls how much the voice model’s stored characteristics influence the output. Lower it (0.3-0.5) to reduce artifacts, especially with noisy source audio. Higher (0.7-0.8) for more faithful voice reproduction with clean input.
Protection: Keep at 0.33-0.5. Going too low strips breath sounds and makes the output sound robotic.
Split Audio: Enable this — faster inference and more consistent volume across the track.
Embedder: ContentVec is the default. Spin V2 may give cleaner pronunciation — worth A/B testing on your specific cover.

General Tips

Clean input matters most. Spend time getting good vocal isolation in Step 2. A second de-reverb pass is worth it.
Pitch transposition is trial and error. Start with the recommended range, then adjust by ear. Half-semitone tweaks can make a big difference.
Training your own model from 10+ minutes of clean isolated vocals (200-300 epochs) will almost always beat a random community model. More epochs risk overfitting.

Finding Voice Models

Pre-trained RVC voice models for popular artists are widely available:

Source	Notes
weights.gg	Largest curated library. Ratings, comments, previews. Requires account.
voice-models.com	200k+ model index. Links to HuggingFace/Google Drive downloads.
HuggingFace	Search “[artist name] RVC” for community uploads.

Popular artists (Eddie Vedder, Michael Jackson, Adele, etc.) have well-trained models with 300+ epochs available. For obscure artists, you’ll need to train your own — which only requires 5-10 minutes of clean vocal audio and a few hundred training epochs.

Model files are small: a .pth file (~150MB) and an .index file. Download both.

Tools Overview

Voice Conversion

Tool	Recommendation	Notes
Applio	Use this	Community consensus RVC fork. 3.1k stars, 149 contributors. Best docs, biggest community.
ultimate-rvc	Alternative	Wraps the full pipeline (separation + conversion + remix) into one tool. Smaller community.
RVC WebUI	Legacy	The original project. No major quality improvements since 2023.
Seed-VC	Experimental	Zero-shot voice conversion (no training needed). Archived Nov 2025.

Vocal Separation

Tool	Recommendation	Notes
UVR5 + BS-RoFormer	Best quality	Won SDX23 Challenge. Use through UVR5’s GUI.
Demucs v4	Simplest	Clean CLI, still very good. Falls slightly behind BS-RoFormer.

Commercial Alternatives

If you don’t want to run your own pipeline, Jammable (formerly Voicify AI) does this end-to-end for ~$8/month. Upload a song, pick a voice, get a cover in 30-60 seconds. Quality is hit-or-miss (reviews cite pitch issues and robotic artifacts), but it’s zero-effort. Other similar services: Musicfy, Covers.ai, MusicAI, Kit AI.

The DIY route gives you more control over quality parameters, access to any voice model (not just what the platform offers), and costs cents per song instead of a monthly subscription.

Cost Summary

Approach	Cost	Setup time	Per-song time
RunPod spot (RTX A4000)	~$0.01-0.05/song	15-30 min first time, 1-3 min with Docker image	3-5 min
Local GPU (8GB+ VRAM)	Free (hardware you own)	30 min (one-time)	5-10 min
Jammable (commercial)	$8/month	None	30-60 sec

RunPod with a pre-built Docker image is the sweet spot. Costs almost nothing, takes a few minutes per song, and you get full control over quality.