* feat: WIP doc (vibe started and iterated) * install from scratch docs * caddyfile.example * gitignore * authentik script * authentik script * authentik script * llm doc * authentik ongoing * more daily setup logs * doc website * gpu self hosted setup guide (no-mistakes) * doc review round * doc review round * doc review round * update doc site sidebars * feat(docs): add mermaid diagram support * docs polishing * live pipeline doc * move pipeline dev docs to dev docs location * doc pr review iteration * dockerfile healthcheck * docs/pr-comments * remove jwt comment * llm suggestion * pr comments * pr comments * document auto migrations * cleanup docs --------- Co-authored-by: Mathieu Virbel <mat@meltingrocks.com> Co-authored-by: Igor Loskutov <igor.loskutoff@gmail.com>
7.1 KiB
sidebar_position, title
| sidebar_position | title |
|---|---|
| 5 | Self-Hosted GPU Setup |
Self-Hosted GPU Setup
This guide covers deploying Reflector's GPU processing on your own server instead of Modal.com. For the complete deployment guide, see Deployment Guide.
When to Use Self-Hosted GPU
Choose self-hosted GPU if you:
- Have GPU hardware available (NVIDIA required)
- Want full control over processing
- Prefer fixed infrastructure costs over pay-per-use
- Have privacy or data locality requirements
- Need to process audio without external API calls
Choose Modal.com instead if you:
- Don't have GPU hardware
- Want zero infrastructure management
- Prefer pay-per-use pricing
- Need instant scaling for variable workloads
See Modal.com Setup for cloud GPU deployment.
What Gets Deployed
The self-hosted GPU service provides the same API endpoints as Modal:
POST /v1/audio/transcriptions- Whisper transcriptionPOST /v1/audio/transcriptions-from-url- Transcribe from URLPOST /diarize- Pyannote speaker diarizationPOST /translate- Audio translation
Your main Reflector server connects to this service exactly like it connects to Modal - only the URL changes.
Prerequisites
Hardware
- GPU: NVIDIA GPU with 8GB+ VRAM (tested on Tesla T4 with 15GB)
- CPU: 4+ cores recommended
- RAM: 8GB minimum, 16GB recommended
- Disk: 40-50GB minimum
Software
- Public IP address
- Domain name with DNS A record pointing to server
Accounts
- HuggingFace account with accepted Pyannote licenses:
- HuggingFace access token from https://huggingface.co/settings/tokens
Docker Deployment
Step 1: Install NVIDIA Driver
sudo apt update
sudo apt install -y nvidia-driver-535
sudo reboot
# After reboot, verify installation
nvidia-smi
Expected output: GPU details with driver version and CUDA version.
Step 2: Install Docker
Follow the official Docker installation guide for your distribution.
After installation, add your user to the docker group:
sudo usermod -aG docker $USER
# Log out and back in for group changes
exit
# SSH back in
Step 3: Install NVIDIA Container Toolkit
# Add NVIDIA repository and install toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Step 4: Clone Repository and Configure
git clone https://github.com/monadical-sas/reflector.git
cd reflector/gpu/self_hosted
# Create environment file
cat > .env << EOF
REFLECTOR_GPU_APIKEY=$(openssl rand -hex 16)
HF_TOKEN=your_huggingface_token_here
EOF
# Note the generated API key - you'll need it for main server config
cat .env
Step 5: Build and Start
The repository includes a compose.yml file. Build and start:
# Build image (takes ~5 minutes, downloads ~10GB)
sudo docker compose build
# Start service
sudo docker compose up -d
# Wait for startup and verify
sleep 30
sudo docker compose logs
Look for: INFO: Application startup complete. Uvicorn running on http://0.0.0.0:8000
Step 7: Verify GPU Access
# Check GPU is accessible from container
sudo docker exec $(sudo docker ps -q) nvidia-smi
Should show GPU with ~3GB VRAM used (models loaded).
Configure HTTPS with Caddy
Caddy handles SSL automatically.
Install Caddy
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | \
sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | \
sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update
sudo apt install -y caddy
Configure Reverse Proxy
Edit the Caddyfile with your domain:
sudo nano /etc/caddy/Caddyfile
Add (replace gpu.example.com with your domain):
gpu.example.com {
reverse_proxy localhost:8000
}
Reload Caddy (auto-provisions SSL certificate):
sudo systemctl reload caddy
Verify HTTPS
curl -I https://gpu.example.com/docs
# Should return HTTP/2 200
Configure Main Reflector Server
On your main Reflector server, update server/.env:
# GPU Processing - Self-hosted
TRANSCRIPT_BACKEND=modal
TRANSCRIPT_URL=https://gpu.example.com
TRANSCRIPT_MODAL_API_KEY=<your-generated-api-key>
DIARIZATION_BACKEND=modal
DIARIZATION_URL=https://gpu.example.com
DIARIZATION_MODAL_API_KEY=<your-generated-api-key>
Note: The backend type is modal because the self-hosted GPU service implements the same API contract as Modal.com. This allows you to switch between cloud and self-hosted GPU processing by only changing the URL and API key.
Restart services to apply:
docker compose -f docker-compose.prod.yml restart server worker
Service Management
All commands in this section assume you're in ~/reflector/gpu/self_hosted/.
# View logs
sudo docker compose logs -f
# Restart service
sudo docker compose restart
# Stop service
sudo docker compose down
# Check status
sudo docker compose ps
Monitor GPU
# Check GPU usage
nvidia-smi
# Watch in real-time
watch -n 1 nvidia-smi
Typical GPU memory usage:
- Idle (models loaded): ~3GB VRAM
- During transcription: ~4-5GB VRAM
Troubleshooting
nvidia-smi fails after driver install
# Manually load kernel modules
sudo modprobe nvidia
nvidia-smi
Service fails with "Could not download pyannote pipeline"
- Verify HF_TOKEN is valid:
echo $HF_TOKEN - Check model access at https://huggingface.co/pyannote/speaker-diarization-3.1
- Update .env with correct token
- Restart service:
sudo docker compose restart
Cannot connect to HTTPS endpoint
- Verify DNS resolves:
dig +short gpu.example.com - Check firewall:
sudo ufw status(ports 80, 443 must be open) - Check Caddy:
sudo systemctl status caddy - View Caddy logs:
sudo journalctl -u caddy -n 50
SSL certificate not provisioning
Requirements for Let's Encrypt:
- Ports 80 and 443 publicly accessible
- DNS resolves to server's public IP
- Valid domain (not localhost or private IP)
Docker container won't start
# Check logs
sudo docker compose logs
# Common issues:
# - Port 8000 already in use
# - GPU not accessible (nvidia-ctk not configured)
# - Missing .env file
Updating
cd ~/reflector/gpu/self_hosted
git pull
sudo docker compose build
sudo docker compose up -d