reflector/server/docs/gpu/api-transcription.md at af921ce927edb8d092c817185d4d50bf09a67cad

mirror of https://github.com/Monadical-SAS/reflector.git synced 2025-12-20 20:29:06 +00:00

Files

Sergey Mankovsky ab859d65a6 feat: self-hosted gpu api (#636 )

* Self-hosted gpu api

* Refactor self-hosted api

* Rename model api tests

* Use lifespan instead of startup event

* Fix self hosted imports

* Add newlines

* Add response models

* Move gpu dir to the root

* Add project description

* Refactor lifespan

* Update env var names for model api tests

* Preload diarizarion service

* Refactor uploaded file paths

2025-09-17 18:52:03 +02:00

6.0 KiB

Raw Blame History

Reflector GPU Transcription API (Specification)

This document defines the Reflector GPU transcription API that all implementations must adhere to. Current implementations include NVIDIA Parakeet (NeMo) and Whisper (faster-whisper), both deployed on Modal.com. The API surface and response shapes are OpenAI/Whisper-compatible, so clients can switch implementations by changing only the base URL.

Base URL and Authentication

Example base URLs (Modal web endpoints):
- Parakeet: https://<account>--reflector-transcriber-parakeet-web.modal.run
- Whisper: https://<account>--reflector-transcriber-web.modal.run
All endpoints are served under /v1 and require a Bearer token:

Authorization: Bearer <REFLECTOR_GPU_APIKEY>

Note: To switch implementations, deploy the desired variant and point TRANSCRIPT_URL to its base URL. The API is identical.

Supported file types

mp3, mp4, mpeg, mpga, m4a, wav, webm

Models and languages

Parakeet (NVIDIA NeMo): default nvidia/parakeet-tdt-0.6b-v2
- Language support: only en. Other languages return HTTP 400.
Whisper (faster-whisper): default large-v2 (or deployment-specific)
- Language support: multilingual (per Whisper model capabilities).

Note: The model parameter is accepted by all implementations for interface parity. Some backends may treat it as informational.

Endpoints

POST /v1/audio/transcriptions

Transcribe one or more uploaded audio files.

Request: multipart/form-data

file (File) — optional. Single file to transcribe.
files (File[]) — optional. One or more files to transcribe.
model (string) — optional. Defaults to the implementation-specific model (see above).
language (string) — optional, defaults to en.
- Parakeet: only en is accepted; other values return HTTP 400
- Whisper: model-dependent; typically multilingual
batch (boolean) — optional, defaults to false.

Notes:

Provide either file or files, not both. If neither is provided, HTTP 400.
batch requires files; using batch=true without files returns HTTP 400.
Response shape for multiple files is the same regardless of batch.
Files sent to this endpoint are processed in a single pass (no VAD/chunking). This is intended for short clips (roughly ≤ 30s; depends on GPU memory/model). For longer audio, prefer /v1/audio/transcriptions-from-url which supports VAD-based chunking.

Responses

Single file response:

{
  "text": "transcribed text",
  "words": [
    { "word": "hello", "start": 0.0, "end": 0.5 },
    { "word": "world", "start": 0.5, "end": 1.0 }
  ],
  "filename": "audio.mp3"
}

Multiple files response:

{
  "results": [
    {"filename": "a1.mp3", "text": "...", "words": [...]},
    {"filename": "a2.mp3", "text": "...", "words": [...]}]
}

Notes:

Word objects always include keys: word, start, end.
Some implementations may include a trailing space in word to match Whisper tokenization behavior; clients should trim if needed.

Example curl (single file):

curl -X POST \
  -H "Authorization: Bearer $REFLECTOR_GPU_APIKEY" \
  -F "file=@/path/to/audio.mp3" \
  -F "language=en" \
  "$BASE_URL/v1/audio/transcriptions"

Example curl (multiple files, batch):

curl -X POST \
  -H "Authorization: Bearer $REFLECTOR_GPU_APIKEY" \
  -F "files=@/path/a1.mp3" -F "files=@/path/a2.mp3" \
  -F "batch=true" -F "language=en" \
  "$BASE_URL/v1/audio/transcriptions"

POST /v1/audio/transcriptions-from-url

Transcribe a single remote audio file by URL.

Request: application/json

Body parameters:

audio_file_url (string) — required. URL of the audio file to transcribe.
model (string) — optional. Defaults to the implementation-specific model (see above).
language (string) — optional, defaults to en. Parakeet only accepts en.
timestamp_offset (number) — optional, defaults to 0.0. Added to each word's start/end in the response.

{
  "audio_file_url": "https://example.com/audio.mp3",
  "model": "nvidia/parakeet-tdt-0.6b-v2",
  "language": "en",
  "timestamp_offset": 0.0
}

Response:

{
  "text": "transcribed text",
  "words": [
    { "word": "hello", "start": 10.0, "end": 10.5 },
    { "word": "world", "start": 10.5, "end": 11.0 }
  ]
}

Notes:

timestamp_offset is added to each word’s start/end in the response.
Implementations may perform VAD-based chunking and batching for long-form audio; word timings are adjusted accordingly.

Example curl:

curl -X POST \
  -H "Authorization: Bearer $REFLECTOR_GPU_APIKEY" \
  -H "Content-Type: application/json" \
  -d '{
        "audio_file_url": "https://example.com/audio.mp3",
        "language": "en",
        "timestamp_offset": 0
      }' \
  "$BASE_URL/v1/audio/transcriptions-from-url"

Error handling

400 Bad Request
- Parakeet: language other than en
- Missing required parameters (file/files for upload; audio_file_url for URL endpoint)
- Unsupported file extension
401 Unauthorized
- Missing or invalid Bearer token
404 Not Found
- audio_file_url does not exist

Implementation details

GPUs: A10G for small-file/live, L40S for large-file URL transcription (subject to deployment)
VAD chunking and segment batching; word timings adjusted and overlapping ends constrained
Pads very short segments (< 0.5s) to avoid model crashes on some backends

Server configuration (Reflector API)

Set the Reflector server to use the Modal backend and point TRANSCRIPT_URL to your chosen deployment:

TRANSCRIPT_BACKEND=modal
TRANSCRIPT_URL=https://<account>--reflector-transcriber-parakeet-web.modal.run
TRANSCRIPT_MODAL_API_KEY=<REFLECTOR_GPU_APIKEY>

Conformance tests

Use the pytest-based conformance tests to validate any new implementation (including self-hosted) against this spec:

TRANSCRIPT_URL=https://<your-deployment-base> \
TRANSCRIPT_MODAL_API_KEY=your-api-key \
uv run -m pytest -m model_api --no-cov server/tests/test_model_api_transcript.py

6.0 KiB Raw Blame History Unescape Escape

Reflector GPU Transcription API (Specification)

Base URL and Authentication

Supported file types

Models and languages

Endpoints

POST /v1/audio/transcriptions

POST /v1/audio/transcriptions-from-url

Error handling

Implementation details

Server configuration (Reflector API)

Conformance tests

6.0 KiB

Raw Blame History