fix: align whisper transcriber api with parakeet (#602)

* Documents transcriber api

* Update whisper transcriber api to match parakeet

* Update api transcription spec

* Return 400 for unsupported file type

* Add params to api spec

* Update whisper transcriber implementation to match parakeet
This commit is contained in:
2025-09-05 10:52:14 +02:00
committed by GitHub
parent dc82f8bb3b
commit 0663700a61
3 changed files with 705 additions and 61 deletions

View File

@@ -0,0 +1,194 @@
## Reflector GPU Transcription API (Specification)
This document defines the Reflector GPU transcription API that all implementations must adhere to. Current implementations include NVIDIA Parakeet (NeMo) and Whisper (faster-whisper), both deployed on Modal.com. The API surface and response shapes are OpenAI/Whisper-compatible, so clients can switch implementations by changing only the base URL.
### Base URL and Authentication
- Example base URLs (Modal web endpoints):
- Parakeet: `https://<account>--reflector-transcriber-parakeet-web.modal.run`
- Whisper: `https://<account>--reflector-transcriber-web.modal.run`
- All endpoints are served under `/v1` and require a Bearer token:
```
Authorization: Bearer <REFLECTOR_GPU_APIKEY>
```
Note: To switch implementations, deploy the desired variant and point `TRANSCRIPT_URL` to its base URL. The API is identical.
### Supported file types
`mp3, mp4, mpeg, mpga, m4a, wav, webm`
### Models and languages
- Parakeet (NVIDIA NeMo): default `nvidia/parakeet-tdt-0.6b-v2`
- Language support: only `en`. Other languages return HTTP 400.
- Whisper (faster-whisper): default `large-v2` (or deployment-specific)
- Language support: multilingual (per Whisper model capabilities).
Note: The `model` parameter is accepted by all implementations for interface parity. Some backends may treat it as informational.
### Endpoints
#### POST /v1/audio/transcriptions
Transcribe one or more uploaded audio files.
Request: multipart/form-data
- `file` (File) — optional. Single file to transcribe.
- `files` (File[]) — optional. One or more files to transcribe.
- `model` (string) — optional. Defaults to the implementation-specific model (see above).
- `language` (string) — optional, defaults to `en`.
- Parakeet: only `en` is accepted; other values return HTTP 400
- Whisper: model-dependent; typically multilingual
- `batch` (boolean) — optional, defaults to `false`.
Notes:
- Provide either `file` or `files`, not both. If neither is provided, HTTP 400.
- `batch` requires `files`; using `batch=true` without `files` returns HTTP 400.
- Response shape for multiple files is the same regardless of `batch`.
- Files sent to this endpoint are processed in a single pass (no VAD/chunking). This is intended for short clips (roughly ≤ 30s; depends on GPU memory/model). For longer audio, prefer `/v1/audio/transcriptions-from-url` which supports VAD-based chunking.
Responses
Single file response:
```json
{
"text": "transcribed text",
"words": [
{ "word": "hello", "start": 0.0, "end": 0.5 },
{ "word": "world", "start": 0.5, "end": 1.0 }
],
"filename": "audio.mp3"
}
```
Multiple files response:
```json
{
"results": [
{"filename": "a1.mp3", "text": "...", "words": [...]},
{"filename": "a2.mp3", "text": "...", "words": [...]}]
}
```
Notes:
- Word objects always include keys: `word`, `start`, `end`.
- Some implementations may include a trailing space in `word` to match Whisper tokenization behavior; clients should trim if needed.
Example curl (single file):
```bash
curl -X POST \
-H "Authorization: Bearer $REFLECTOR_GPU_APIKEY" \
-F "file=@/path/to/audio.mp3" \
-F "language=en" \
"$BASE_URL/v1/audio/transcriptions"
```
Example curl (multiple files, batch):
```bash
curl -X POST \
-H "Authorization: Bearer $REFLECTOR_GPU_APIKEY" \
-F "files=@/path/a1.mp3" -F "files=@/path/a2.mp3" \
-F "batch=true" -F "language=en" \
"$BASE_URL/v1/audio/transcriptions"
```
#### POST /v1/audio/transcriptions-from-url
Transcribe a single remote audio file by URL.
Request: application/json
Body parameters:
- `audio_file_url` (string) — required. URL of the audio file to transcribe.
- `model` (string) — optional. Defaults to the implementation-specific model (see above).
- `language` (string) — optional, defaults to `en`. Parakeet only accepts `en`.
- `timestamp_offset` (number) — optional, defaults to `0.0`. Added to each word's `start`/`end` in the response.
```json
{
"audio_file_url": "https://example.com/audio.mp3",
"model": "nvidia/parakeet-tdt-0.6b-v2",
"language": "en",
"timestamp_offset": 0.0
}
```
Response:
```json
{
"text": "transcribed text",
"words": [
{ "word": "hello", "start": 10.0, "end": 10.5 },
{ "word": "world", "start": 10.5, "end": 11.0 }
]
}
```
Notes:
- `timestamp_offset` is added to each words `start`/`end` in the response.
- Implementations may perform VAD-based chunking and batching for long-form audio; word timings are adjusted accordingly.
Example curl:
```bash
curl -X POST \
-H "Authorization: Bearer $REFLECTOR_GPU_APIKEY" \
-H "Content-Type: application/json" \
-d '{
"audio_file_url": "https://example.com/audio.mp3",
"language": "en",
"timestamp_offset": 0
}' \
"$BASE_URL/v1/audio/transcriptions-from-url"
```
### Error handling
- 400 Bad Request
- Parakeet: `language` other than `en`
- Missing required parameters (`file`/`files` for upload; `audio_file_url` for URL endpoint)
- Unsupported file extension
- 401 Unauthorized
- Missing or invalid Bearer token
- 404 Not Found
- `audio_file_url` does not exist
### Implementation details
- GPUs: A10G for small-file/live, L40S for large-file URL transcription (subject to deployment)
- VAD chunking and segment batching; word timings adjusted and overlapping ends constrained
- Pads very short segments (< 0.5s) to avoid model crashes on some backends
### Server configuration (Reflector API)
Set the Reflector server to use the Modal backend and point `TRANSCRIPT_URL` to your chosen deployment:
```
TRANSCRIPT_BACKEND=modal
TRANSCRIPT_URL=https://<account>--reflector-transcriber-parakeet-web.modal.run
TRANSCRIPT_MODAL_API_KEY=<REFLECTOR_GPU_APIKEY>
```
### Conformance tests
Use the pytest-based conformance tests to validate any new implementation (including self-hosted) against this spec:
```
TRANSCRIPT_URL=https://<your-deployment-base> \
TRANSCRIPT_MODAL_API_KEY=your-api-key \
uv run -m pytest -m gpu_modal --no-cov server/tests/test_gpu_modal_transcript.py
```

View File

@@ -1,41 +1,78 @@
import os import os
import tempfile import sys
import threading import threading
import uuid
from typing import Generator, Mapping, NamedTuple, NewType, TypedDict
from urllib.parse import urlparse
import modal import modal
from pydantic import BaseModel
MODELS_DIR = "/models"
MODEL_NAME = "large-v2" MODEL_NAME = "large-v2"
MODEL_COMPUTE_TYPE: str = "float16" MODEL_COMPUTE_TYPE: str = "float16"
MODEL_NUM_WORKERS: int = 1 MODEL_NUM_WORKERS: int = 1
MINUTES = 60 # seconds MINUTES = 60 # seconds
SAMPLERATE = 16000
UPLOADS_PATH = "/uploads"
CACHE_PATH = "/models"
SUPPORTED_FILE_EXTENSIONS = ["mp3", "mp4", "mpeg", "mpga", "m4a", "wav", "webm"]
VAD_CONFIG = {
"batch_max_duration": 30.0,
"silence_padding": 0.5,
"window_size": 512,
}
volume = modal.Volume.from_name("models", create_if_missing=True)
WhisperUniqFilename = NewType("WhisperUniqFilename", str)
AudioFileExtension = NewType("AudioFileExtension", str)
app = modal.App("reflector-transcriber") app = modal.App("reflector-transcriber")
model_cache = modal.Volume.from_name("models", create_if_missing=True)
upload_volume = modal.Volume.from_name("whisper-uploads", create_if_missing=True)
class TimeSegment(NamedTuple):
"""Represents a time segment with start and end times."""
start: float
end: float
class AudioSegment(NamedTuple):
"""Represents an audio segment with timing and audio data."""
start: float
end: float
audio: any
class TranscriptResult(NamedTuple):
"""Represents a transcription result with text and word timings."""
text: str
words: list["WordTiming"]
class WordTiming(TypedDict):
"""Represents a word with its timing information."""
word: str
start: float
end: float
def download_model(): def download_model():
from faster_whisper import download_model from faster_whisper import download_model
volume.reload() model_cache.reload()
download_model(MODEL_NAME, cache_dir=MODELS_DIR) download_model(MODEL_NAME, cache_dir=CACHE_PATH)
volume.commit() model_cache.commit()
image = ( image = (
modal.Image.debian_slim(python_version="3.12") modal.Image.debian_slim(python_version="3.12")
.pip_install(
"huggingface_hub==0.27.1",
"hf-transfer==0.1.9",
"torch==2.5.1",
"faster-whisper==1.1.1",
)
.env( .env(
{ {
"HF_HUB_ENABLE_HF_TRANSFER": "1", "HF_HUB_ENABLE_HF_TRANSFER": "1",
@@ -45,19 +82,98 @@ image = (
), ),
} }
) )
.run_function(download_model, volumes={MODELS_DIR: volume}) .apt_install("ffmpeg")
.pip_install(
"huggingface_hub==0.27.1",
"hf-transfer==0.1.9",
"torch==2.5.1",
"faster-whisper==1.1.1",
"fastapi==0.115.12",
"requests",
"librosa==0.10.1",
"numpy<2",
"silero-vad==5.1.0",
)
.run_function(download_model, volumes={CACHE_PATH: model_cache})
) )
def detect_audio_format(url: str, headers: Mapping[str, str]) -> AudioFileExtension:
parsed_url = urlparse(url)
url_path = parsed_url.path
for ext in SUPPORTED_FILE_EXTENSIONS:
if url_path.lower().endswith(f".{ext}"):
return AudioFileExtension(ext)
content_type = headers.get("content-type", "").lower()
if "audio/mpeg" in content_type or "audio/mp3" in content_type:
return AudioFileExtension("mp3")
if "audio/wav" in content_type:
return AudioFileExtension("wav")
if "audio/mp4" in content_type:
return AudioFileExtension("mp4")
raise ValueError(
f"Unsupported audio format for URL: {url}. "
f"Supported extensions: {', '.join(SUPPORTED_FILE_EXTENSIONS)}"
)
def download_audio_to_volume(
audio_file_url: str,
) -> tuple[WhisperUniqFilename, AudioFileExtension]:
import requests
from fastapi import HTTPException
response = requests.head(audio_file_url, allow_redirects=True)
if response.status_code == 404:
raise HTTPException(status_code=404, detail="Audio file not found")
response = requests.get(audio_file_url, allow_redirects=True)
response.raise_for_status()
audio_suffix = detect_audio_format(audio_file_url, response.headers)
unique_filename = WhisperUniqFilename(f"{uuid.uuid4()}.{audio_suffix}")
file_path = f"{UPLOADS_PATH}/{unique_filename}"
with open(file_path, "wb") as f:
f.write(response.content)
upload_volume.commit()
return unique_filename, audio_suffix
def pad_audio(audio_array, sample_rate: int = SAMPLERATE):
"""Add 0.5s of silence if audio is shorter than the silence_padding window.
Whisper does not require this strictly, but aligning behavior with Parakeet
avoids edge-case crashes on extremely short inputs and makes comparisons easier.
"""
import numpy as np
audio_duration = len(audio_array) / sample_rate
if audio_duration < VAD_CONFIG["silence_padding"]:
silence_samples = int(sample_rate * VAD_CONFIG["silence_padding"])
silence = np.zeros(silence_samples, dtype=np.float32)
return np.concatenate([audio_array, silence])
return audio_array
@app.cls( @app.cls(
gpu="A10G", gpu="A10G",
timeout=5 * MINUTES, timeout=5 * MINUTES,
scaledown_window=5 * MINUTES, scaledown_window=5 * MINUTES,
allow_concurrent_inputs=6,
image=image, image=image,
volumes={MODELS_DIR: volume}, volumes={CACHE_PATH: model_cache, UPLOADS_PATH: upload_volume},
) )
class Transcriber: @modal.concurrent(max_inputs=10)
class TranscriberWhisperLive:
"""Live transcriber class for small audio segments (A10G).
Mirrors the Parakeet live class API but uses Faster-Whisper under the hood.
"""
@modal.enter() @modal.enter()
def enter(self): def enter(self):
import faster_whisper import faster_whisper
@@ -71,23 +187,200 @@ class Transcriber:
device=self.device, device=self.device,
compute_type=MODEL_COMPUTE_TYPE, compute_type=MODEL_COMPUTE_TYPE,
num_workers=MODEL_NUM_WORKERS, num_workers=MODEL_NUM_WORKERS,
download_root=MODELS_DIR, download_root=CACHE_PATH,
local_files_only=True, local_files_only=True,
) )
print(f"Model is on device: {self.device}")
@modal.method() @modal.method()
def transcribe_segment( def transcribe_segment(
self, self,
audio_data: str, filename: str,
audio_suffix: str, language: str = "en",
language: str,
): ):
with tempfile.NamedTemporaryFile("wb+", suffix=f".{audio_suffix}") as fp: """Transcribe a single uploaded audio file by filename."""
fp.write(audio_data) upload_volume.reload()
file_path = f"{UPLOADS_PATH}/{filename}"
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
with self.lock:
with NoStdStreams():
segments, _ = self.model.transcribe(
file_path,
language=language,
beam_size=5,
word_timestamps=True,
vad_filter=True,
vad_parameters={"min_silence_duration_ms": 500},
)
segments = list(segments)
text = "".join(segment.text for segment in segments).strip()
words = [
{
"word": word.word,
"start": round(float(word.start), 2),
"end": round(float(word.end), 2),
}
for segment in segments
for word in segment.words
]
return {"text": text, "words": words}
@modal.method()
def transcribe_batch(
self,
filenames: list[str],
language: str = "en",
):
"""Transcribe multiple uploaded audio files and return per-file results."""
upload_volume.reload()
results = []
for filename in filenames:
file_path = f"{UPLOADS_PATH}/{filename}"
if not os.path.exists(file_path):
raise FileNotFoundError(f"Batch file not found: {file_path}")
with self.lock:
with NoStdStreams():
segments, _ = self.model.transcribe(
file_path,
language=language,
beam_size=5,
word_timestamps=True,
vad_filter=True,
vad_parameters={"min_silence_duration_ms": 500},
)
segments = list(segments)
text = "".join(seg.text for seg in segments).strip()
words = [
{
"word": w.word,
"start": round(float(w.start), 2),
"end": round(float(w.end), 2),
}
for seg in segments
for w in seg.words
]
results.append(
{
"filename": filename,
"text": text,
"words": words,
}
)
return results
@app.cls(
gpu="L40S",
timeout=15 * MINUTES,
image=image,
volumes={CACHE_PATH: model_cache, UPLOADS_PATH: upload_volume},
)
class TranscriberWhisperFile:
"""File transcriber for larger/longer audio, using VAD-driven batching (L40S)."""
@modal.enter()
def enter(self):
import faster_whisper
import torch
from silero_vad import load_silero_vad
self.lock = threading.Lock()
self.use_gpu = torch.cuda.is_available()
self.device = "cuda" if self.use_gpu else "cpu"
self.model = faster_whisper.WhisperModel(
MODEL_NAME,
device=self.device,
compute_type=MODEL_COMPUTE_TYPE,
num_workers=MODEL_NUM_WORKERS,
download_root=CACHE_PATH,
local_files_only=True,
)
self.vad_model = load_silero_vad(onnx=False)
@modal.method()
def transcribe_segment(
self, filename: str, timestamp_offset: float = 0.0, language: str = "en"
):
import librosa
import numpy as np
from silero_vad import VADIterator
def vad_segments(
audio_array,
sample_rate: int = SAMPLERATE,
window_size: int = VAD_CONFIG["window_size"],
) -> Generator[TimeSegment, None, None]:
"""Generate speech segments as TimeSegment using Silero VAD."""
iterator = VADIterator(self.vad_model, sampling_rate=sample_rate)
start = None
for i in range(0, len(audio_array), window_size):
chunk = audio_array[i : i + window_size]
if len(chunk) < window_size:
chunk = np.pad(
chunk, (0, window_size - len(chunk)), mode="constant"
)
speech = iterator(chunk)
if not speech:
continue
if "start" in speech:
start = speech["start"]
continue
if "end" in speech and start is not None:
end = speech["end"]
yield TimeSegment(
start / float(SAMPLERATE), end / float(SAMPLERATE)
)
start = None
iterator.reset_states()
upload_volume.reload()
file_path = f"{UPLOADS_PATH}/{filename}"
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
audio_array, _sr = librosa.load(file_path, sr=SAMPLERATE, mono=True)
# Batch segments up to ~30s windows by merging contiguous VAD segments
merged_batches: list[TimeSegment] = []
batch_start = None
batch_end = None
max_duration = VAD_CONFIG["batch_max_duration"]
for segment in vad_segments(audio_array):
seg_start, seg_end = segment.start, segment.end
if batch_start is None:
batch_start, batch_end = seg_start, seg_end
continue
if seg_end - batch_start <= max_duration:
batch_end = seg_end
else:
merged_batches.append(TimeSegment(batch_start, batch_end))
batch_start, batch_end = seg_start, seg_end
if batch_start is not None and batch_end is not None:
merged_batches.append(TimeSegment(batch_start, batch_end))
all_text = []
all_words = []
for segment in merged_batches:
start_time, end_time = segment.start, segment.end
s_idx = int(start_time * SAMPLERATE)
e_idx = int(end_time * SAMPLERATE)
segment = audio_array[s_idx:e_idx]
segment = pad_audio(segment, SAMPLERATE)
with self.lock: with self.lock:
segments, _ = self.model.transcribe( segments, _ = self.model.transcribe(
fp.name, segment,
language=language, language=language,
beam_size=5, beam_size=5,
word_timestamps=True, word_timestamps=True,
@@ -96,66 +389,220 @@ class Transcriber:
) )
segments = list(segments) segments = list(segments)
text = "".join(segment.text for segment in segments) text = "".join(seg.text for seg in segments).strip()
words = [ words = [
{"word": word.word, "start": word.start, "end": word.end} {
for segment in segments "word": w.word,
for word in segment.words "start": round(float(w.start) + start_time + timestamp_offset, 2),
"end": round(float(w.end) + start_time + timestamp_offset, 2),
}
for seg in segments
for w in seg.words
] ]
if text:
all_text.append(text)
all_words.extend(words)
return {"text": text, "words": words} return {"text": " ".join(all_text), "words": all_words}
def detect_audio_format(url: str, headers: dict) -> str:
from urllib.parse import urlparse
from fastapi import HTTPException
url_path = urlparse(url).path
for ext in SUPPORTED_FILE_EXTENSIONS:
if url_path.lower().endswith(f".{ext}"):
return ext
content_type = headers.get("content-type", "").lower()
if "audio/mpeg" in content_type or "audio/mp3" in content_type:
return "mp3"
if "audio/wav" in content_type:
return "wav"
if "audio/mp4" in content_type:
return "mp4"
raise HTTPException(
status_code=400,
detail=(
f"Unsupported audio format for URL. Supported extensions: {', '.join(SUPPORTED_FILE_EXTENSIONS)}"
),
)
def download_audio_to_volume(audio_file_url: str) -> tuple[str, str]:
import requests
from fastapi import HTTPException
response = requests.head(audio_file_url, allow_redirects=True)
if response.status_code == 404:
raise HTTPException(status_code=404, detail="Audio file not found")
response = requests.get(audio_file_url, allow_redirects=True)
response.raise_for_status()
audio_suffix = detect_audio_format(audio_file_url, response.headers)
unique_filename = f"{uuid.uuid4()}.{audio_suffix}"
file_path = f"{UPLOADS_PATH}/{unique_filename}"
with open(file_path, "wb") as f:
f.write(response.content)
upload_volume.commit()
return unique_filename, audio_suffix
@app.function( @app.function(
scaledown_window=60, scaledown_window=60,
timeout=60, timeout=600,
allow_concurrent_inputs=40,
secrets=[ secrets=[
modal.Secret.from_name("reflector-gpu"), modal.Secret.from_name("reflector-gpu"),
], ],
volumes={MODELS_DIR: volume}, volumes={CACHE_PATH: model_cache, UPLOADS_PATH: upload_volume},
image=image,
) )
@modal.concurrent(max_inputs=40)
@modal.asgi_app() @modal.asgi_app()
def web(): def web():
from fastapi import Body, Depends, FastAPI, HTTPException, UploadFile, status from fastapi import (
Body,
Depends,
FastAPI,
Form,
HTTPException,
UploadFile,
status,
)
from fastapi.security import OAuth2PasswordBearer from fastapi.security import OAuth2PasswordBearer
from typing_extensions import Annotated
transcriber = Transcriber() transcriber_live = TranscriberWhisperLive()
transcriber_file = TranscriberWhisperFile()
app = FastAPI() app = FastAPI()
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token") oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token")
supported_file_types = ["mp3", "mp4", "mpeg", "mpga", "m4a", "wav", "webm"]
def apikey_auth(apikey: str = Depends(oauth2_scheme)): def apikey_auth(apikey: str = Depends(oauth2_scheme)):
if apikey != os.environ["REFLECTOR_GPU_APIKEY"]: if apikey == os.environ["REFLECTOR_GPU_APIKEY"]:
raise HTTPException( return
status_code=status.HTTP_401_UNAUTHORIZED, raise HTTPException(
detail="Invalid API key", status_code=status.HTTP_401_UNAUTHORIZED,
headers={"WWW-Authenticate": "Bearer"}, detail="Invalid API key",
) headers={"WWW-Authenticate": "Bearer"},
)
class TranscriptResponse(BaseModel): class TranscriptResponse(dict):
result: dict pass
@app.post("/v1/audio/transcriptions", dependencies=[Depends(apikey_auth)]) @app.post("/v1/audio/transcriptions", dependencies=[Depends(apikey_auth)])
def transcribe( def transcribe(
file: UploadFile, file: UploadFile = None,
model: str = "whisper-1", files: list[UploadFile] | None = None,
language: Annotated[str, Body(...)] = "en", model: str = Form(MODEL_NAME),
) -> TranscriptResponse: language: str = Form("en"),
audio_data = file.file.read() batch: bool = Form(False),
audio_suffix = file.filename.split(".")[-1] ):
assert audio_suffix in supported_file_types if not file and not files:
raise HTTPException(
status_code=400, detail="Either 'file' or 'files' parameter is required"
)
if batch and not files:
raise HTTPException(
status_code=400, detail="Batch transcription requires 'files'"
)
func = transcriber.transcribe_segment.spawn( upload_files = [file] if file else files
audio_data=audio_data,
audio_suffix=audio_suffix, uploaded_filenames: list[str] = []
language=language, for upload_file in upload_files:
) audio_suffix = upload_file.filename.split(".")[-1]
result = func.get() if audio_suffix not in SUPPORTED_FILE_EXTENSIONS:
return result raise HTTPException(
status_code=400,
detail=(
f"Unsupported audio format. Supported extensions: {', '.join(SUPPORTED_FILE_EXTENSIONS)}"
),
)
unique_filename = f"{uuid.uuid4()}.{audio_suffix}"
file_path = f"{UPLOADS_PATH}/{unique_filename}"
with open(file_path, "wb") as f:
content = upload_file.file.read()
f.write(content)
uploaded_filenames.append(unique_filename)
upload_volume.commit()
try:
if batch and len(upload_files) > 1:
func = transcriber_live.transcribe_batch.spawn(
filenames=uploaded_filenames,
language=language,
)
results = func.get()
return {"results": results}
results = []
for filename in uploaded_filenames:
func = transcriber_live.transcribe_segment.spawn(
filename=filename,
language=language,
)
result = func.get()
result["filename"] = filename
results.append(result)
return {"results": results} if len(results) > 1 else results[0]
finally:
for filename in uploaded_filenames:
try:
file_path = f"{UPLOADS_PATH}/{filename}"
os.remove(file_path)
except Exception:
pass
upload_volume.commit()
@app.post("/v1/audio/transcriptions-from-url", dependencies=[Depends(apikey_auth)])
def transcribe_from_url(
audio_file_url: str = Body(
..., description="URL of the audio file to transcribe"
),
model: str = Body(MODEL_NAME),
language: str = Body("en"),
timestamp_offset: float = Body(0.0),
):
unique_filename, _audio_suffix = download_audio_to_volume(audio_file_url)
try:
func = transcriber_file.transcribe_segment.spawn(
filename=unique_filename,
timestamp_offset=timestamp_offset,
language=language,
)
result = func.get()
return result
finally:
try:
file_path = f"{UPLOADS_PATH}/{unique_filename}"
os.remove(file_path)
upload_volume.commit()
except Exception:
pass
return app return app
class NoStdStreams:
def __init__(self):
self.devnull = open(os.devnull, "w")
def __enter__(self):
self._stdout, self._stderr = sys.stdout, sys.stderr
self._stdout.flush()
self._stderr.flush()
sys.stdout, sys.stderr = self.devnull, self.devnull
def __exit__(self, exc_type, exc_value, traceback):
sys.stdout, sys.stderr = self._stdout, self._stderr
self.devnull.close()

View File

@@ -272,6 +272,9 @@ class TestGPUModalTranscript:
for f in temp_files: for f in temp_files:
Path(f).unlink(missing_ok=True) Path(f).unlink(missing_ok=True)
@pytest.mark.skipif(
not "parakeet" in get_model_name(), reason="Parakeet only supports English"
)
def test_transcriptions_error_handling(self): def test_transcriptions_error_handling(self):
"""Test error handling for invalid requests.""" """Test error handling for invalid requests."""
url = get_modal_transcript_url() url = get_modal_transcript_url()