* Self-hosted gpu api * Refactor self-hosted api * Rename model api tests * Use lifespan instead of startup event * Fix self hosted imports * Add newlines * Add response models * Move gpu dir to the root * Add project description * Refactor lifespan * Update env var names for model api tests * Preload diarizarion service * Refactor uploaded file paths
6.0 KiB
Reflector GPU Transcription API (Specification)
This document defines the Reflector GPU transcription API that all implementations must adhere to. Current implementations include NVIDIA Parakeet (NeMo) and Whisper (faster-whisper), both deployed on Modal.com. The API surface and response shapes are OpenAI/Whisper-compatible, so clients can switch implementations by changing only the base URL.
Base URL and Authentication
-
Example base URLs (Modal web endpoints):
- Parakeet:
https://<account>--reflector-transcriber-parakeet-web.modal.run - Whisper:
https://<account>--reflector-transcriber-web.modal.run
- Parakeet:
-
All endpoints are served under
/v1and require a Bearer token:
Authorization: Bearer <REFLECTOR_GPU_APIKEY>
Note: To switch implementations, deploy the desired variant and point TRANSCRIPT_URL to its base URL. The API is identical.
Supported file types
mp3, mp4, mpeg, mpga, m4a, wav, webm
Models and languages
- Parakeet (NVIDIA NeMo): default
nvidia/parakeet-tdt-0.6b-v2- Language support: only
en. Other languages return HTTP 400.
- Language support: only
- Whisper (faster-whisper): default
large-v2(or deployment-specific)- Language support: multilingual (per Whisper model capabilities).
Note: The model parameter is accepted by all implementations for interface parity. Some backends may treat it as informational.
Endpoints
POST /v1/audio/transcriptions
Transcribe one or more uploaded audio files.
Request: multipart/form-data
file(File) — optional. Single file to transcribe.files(File[]) — optional. One or more files to transcribe.model(string) — optional. Defaults to the implementation-specific model (see above).language(string) — optional, defaults toen.- Parakeet: only
enis accepted; other values return HTTP 400 - Whisper: model-dependent; typically multilingual
- Parakeet: only
batch(boolean) — optional, defaults tofalse.
Notes:
- Provide either
fileorfiles, not both. If neither is provided, HTTP 400. batchrequiresfiles; usingbatch=truewithoutfilesreturns HTTP 400.- Response shape for multiple files is the same regardless of
batch. - Files sent to this endpoint are processed in a single pass (no VAD/chunking). This is intended for short clips (roughly ≤ 30s; depends on GPU memory/model). For longer audio, prefer
/v1/audio/transcriptions-from-urlwhich supports VAD-based chunking.
Responses
Single file response:
{
"text": "transcribed text",
"words": [
{ "word": "hello", "start": 0.0, "end": 0.5 },
{ "word": "world", "start": 0.5, "end": 1.0 }
],
"filename": "audio.mp3"
}
Multiple files response:
{
"results": [
{"filename": "a1.mp3", "text": "...", "words": [...]},
{"filename": "a2.mp3", "text": "...", "words": [...]}]
}
Notes:
- Word objects always include keys:
word,start,end. - Some implementations may include a trailing space in
wordto match Whisper tokenization behavior; clients should trim if needed.
Example curl (single file):
curl -X POST \
-H "Authorization: Bearer $REFLECTOR_GPU_APIKEY" \
-F "file=@/path/to/audio.mp3" \
-F "language=en" \
"$BASE_URL/v1/audio/transcriptions"
Example curl (multiple files, batch):
curl -X POST \
-H "Authorization: Bearer $REFLECTOR_GPU_APIKEY" \
-F "files=@/path/a1.mp3" -F "files=@/path/a2.mp3" \
-F "batch=true" -F "language=en" \
"$BASE_URL/v1/audio/transcriptions"
POST /v1/audio/transcriptions-from-url
Transcribe a single remote audio file by URL.
Request: application/json
Body parameters:
audio_file_url(string) — required. URL of the audio file to transcribe.model(string) — optional. Defaults to the implementation-specific model (see above).language(string) — optional, defaults toen. Parakeet only acceptsen.timestamp_offset(number) — optional, defaults to0.0. Added to each word'sstart/endin the response.
{
"audio_file_url": "https://example.com/audio.mp3",
"model": "nvidia/parakeet-tdt-0.6b-v2",
"language": "en",
"timestamp_offset": 0.0
}
Response:
{
"text": "transcribed text",
"words": [
{ "word": "hello", "start": 10.0, "end": 10.5 },
{ "word": "world", "start": 10.5, "end": 11.0 }
]
}
Notes:
timestamp_offsetis added to each word’sstart/endin the response.- Implementations may perform VAD-based chunking and batching for long-form audio; word timings are adjusted accordingly.
Example curl:
curl -X POST \
-H "Authorization: Bearer $REFLECTOR_GPU_APIKEY" \
-H "Content-Type: application/json" \
-d '{
"audio_file_url": "https://example.com/audio.mp3",
"language": "en",
"timestamp_offset": 0
}' \
"$BASE_URL/v1/audio/transcriptions-from-url"
Error handling
- 400 Bad Request
- Parakeet:
languageother thanen - Missing required parameters (
file/filesfor upload;audio_file_urlfor URL endpoint) - Unsupported file extension
- Parakeet:
- 401 Unauthorized
- Missing or invalid Bearer token
- 404 Not Found
audio_file_urldoes not exist
Implementation details
- GPUs: A10G for small-file/live, L40S for large-file URL transcription (subject to deployment)
- VAD chunking and segment batching; word timings adjusted and overlapping ends constrained
- Pads very short segments (< 0.5s) to avoid model crashes on some backends
Server configuration (Reflector API)
Set the Reflector server to use the Modal backend and point TRANSCRIPT_URL to your chosen deployment:
TRANSCRIPT_BACKEND=modal
TRANSCRIPT_URL=https://<account>--reflector-transcriber-parakeet-web.modal.run
TRANSCRIPT_MODAL_API_KEY=<REFLECTOR_GPU_APIKEY>
Conformance tests
Use the pytest-based conformance tests to validate any new implementation (including self-hosted) against this spec:
TRANSCRIPT_URL=https://<your-deployment-base> \
TRANSCRIPT_MODAL_API_KEY=your-api-key \
uv run -m pytest -m model_api --no-cov server/tests/test_model_api_transcript.py