* Self-hosted gpu api * Refactor self-hosted api * Rename model api tests * Use lifespan instead of startup event * Fix self hosted imports * Add newlines * Add response models * Move gpu dir to the root * Add project description * Refactor lifespan * Update env var names for model api tests * Preload diarizarion service * Refactor uploaded file paths
4.1 KiB
Reflector GPU implementation - Transcription and LLM
This repository hold an API for the GPU implementation of the Reflector API service, and use Modal.com
reflector_diarizer.py- Diarization APIreflector_transcriber.py- Transcription API (Whisper)reflector_transcriber_parakeet.py- Transcription API (NVIDIA Parakeet)reflector_translator.py- Translation API
Modal.com deployment
Create a modal secret, and name it reflector-gpu.
It should contain an REFLECTOR_APIKEY environment variable with a value.
The deployment is done using Modal.com service.
$ modal deploy reflector_transcriber.py
...
└── 🔨 Created web => https://xxxx--reflector-transcriber-web.modal.run
$ modal deploy reflector_transcriber_parakeet.py
...
└── 🔨 Created web => https://xxxx--reflector-transcriber-parakeet-web.modal.run
$ modal deploy reflector_llm.py
...
└── 🔨 Created web => https://xxxx--reflector-llm-web.modal.run
Then in your reflector api configuration .env, you can set these keys:
TRANSCRIPT_BACKEND=modal
TRANSCRIPT_URL=https://xxxx--reflector-transcriber-web.modal.run
TRANSCRIPT_MODAL_API_KEY=REFLECTOR_APIKEY
DIARIZATION_BACKEND=modal
DIARIZATION_URL=https://xxxx--reflector-diarizer-web.modal.run
DIARIZATION_MODAL_API_KEY=REFLECTOR_APIKEY
TRANSLATION_BACKEND=modal
TRANSLATION_URL=https://xxxx--reflector-translator-web.modal.run
TRANSLATION_MODAL_API_KEY=REFLECTOR_APIKEY
API
Authentication must be passed with the Authorization header, using the bearer scheme.
Authorization: bearer <REFLECTOR_APIKEY>
LLM
POST /llm
request
{
"prompt": "xxx"
}
response
{
"text": "xxx completed"
}
Transcription
Parakeet Transcriber (reflector_transcriber_parakeet.py)
NVIDIA Parakeet is a state-of-the-art ASR model optimized for real-time transcription with superior word-level timestamps.
GPU Configuration:
-
A10G GPU - Used for
/v1/audio/transcriptionsendpoint (small files, live transcription)- Higher concurrency (max_inputs=10)
- Optimized for multiple small audio files
- Supports batch processing for efficiency
-
L40S GPU - Used for
/v1/audio/transcriptions-from-urlendpoint (large files)- Lower concurrency but more powerful processing
- Optimized for single large audio files
- VAD-based chunking for long-form audio
/v1/audio/transcriptions - Small file transcription
request (multipart/form-data)
fileorfiles[]- audio file(s) to transcribemodel- model name (default:nvidia/parakeet-tdt-0.6b-v2)language- language code (default:en)batch- whether to use batch processing for multiple files (default:true)
response
{
"text": "transcribed text",
"words": [
{"word": "hello", "start": 0.0, "end": 0.5},
{"word": "world", "start": 0.5, "end": 1.0}
],
"filename": "audio.mp3"
}
For multiple files with batch=true:
{
"results": [
{
"filename": "audio1.mp3",
"text": "transcribed text",
"words": [...]
},
{
"filename": "audio2.mp3",
"text": "transcribed text",
"words": [...]
}
]
}
/v1/audio/transcriptions-from-url - Large file transcription
request (application/json)
{
"audio_file_url": "https://example.com/audio.mp3",
"model": "nvidia/parakeet-tdt-0.6b-v2",
"language": "en",
"timestamp_offset": 0.0
}
response
{
"text": "transcribed text from large file",
"words": [
{"word": "hello", "start": 0.0, "end": 0.5},
{"word": "world", "start": 0.5, "end": 1.0}
]
}
Supported file types: mp3, mp4, mpeg, mpga, m4a, wav, webm
Whisper Transcriber (reflector_transcriber.py)
POST /transcribe
request (multipart/form-data)
file- audio filelanguage- language code (e.g.en)
response
{
"text": "xxx",
"words": [
{"text": "xxx", "start": 0.0, "end": 1.0}
]
}