feat: pipeline improvement with file processing, parakeet, silero-vad (#540)

* feat: improve pipeline threading, and transcriber (parakeet and silero vad) * refactor: remove whisperx, implement parakeet * refactor: make audio_chunker more smart and wait for speech, instead of fixed frame * refactor: make audio merge to always downscale the audio to 16k for transcription * refactor: make the audio transcript modal accepting batches * refactor: improve type safety and remove prometheus metrics - Add DiarizationSegment TypedDict for proper diarization typing - Replace List/Optional with modern Python list/| None syntax - Remove all Prometheus metrics from TranscriptDiarizationAssemblerProcessor - Add comprehensive file processing pipeline with parallel execution - Update processor imports and type annotations throughout - Implement optimized file pipeline as default in process.py tool * refactor: convert FileDiarizationProcessor I/O types to BaseModel Update FileDiarizationInput and FileDiarizationOutput to inherit from BaseModel instead of plain classes, following the standard pattern used by other processors in the codebase. * test: add tests for file transcript and diarization with pytest-recording * build: add pytest-recording * feat: add local pyannote for testing * fix: replace PyAV AudioResampler with torchaudio for reliable audio processing - Replace problematic PyAV AudioResampler that was causing ValueError: [Errno 22] Invalid argument - Use torchaudio.functional.resample for robust sample rate conversion - Optimize processing: skip conversion for already 16kHz mono audio - Add direct WAV writing with Python wave module for better performance - Consolidate duplicate downsample checks for cleaner code - Maintain list[av.AudioFrame] input interface - Required for Silero VAD which needs 16kHz mono audio * fix: replace PyAV AudioResampler with torchaudio solution - Resolves ValueError: [Errno 22] Invalid argument in AudioMergeProcessor - Replaces problematic PyAV AudioResampler with torchaudio.functional.resample - Optimizes processing to skip unnecessary conversions when audio is already 16kHz mono - Uses direct WAV writing with Python's wave module for better performance - Fixes test_basic_process to disable diarization (pyannote dependency not installed) - Updates test expectations to match actual processor behavior - Removes unused pydub dependency from pyproject.toml - Adds comprehensive TEST_ANALYSIS.md documenting test suite status * feat: add parameterized test for both diarization modes - Adds @pytest.mark.parametrize to test_basic_process with enable_diarization=[False, True] - Test with diarization=False always passes (tests core AudioMergeProcessor functionality) - Test with diarization=True gracefully skips when pyannote.audio is not installed - Provides comprehensive test coverage for both pipeline configurations * fix: resolve pipeline property naming conflict in AudioDiarizationPyannoteProcessor - Renames 'pipeline' property to 'diarization_pipeline' to avoid conflict with base Processor.pipeline attribute - Fixes AttributeError: 'property 'pipeline' object has no setter' when set_pipeline() is called - Updates property usage in _diarize method to use new name - Now correctly supports pipeline initialization for diarization processing * fix: add local for pyannote * test: add diarization test * fix: resample on audio merge now working * fix: correctly restore timestamp * fix: display exception in a threaded processor if that happen * Update pyproject.toml * ci: remove option * ci: update astral-sh/setup-uv * test: add monadical url for pytest-recording * refactor: remove previous version * build: move faster whisper to local dep * test: fix missing import * refactor: improve main_file_pipeline organization and error handling - Move all imports to the top of the file - Create unified EmptyPipeline class to replace duplicate mock pipeline code - Remove timeout and fallback logic - let processors handle their own retries - Fix error handling to raise any exception from parallel tasks - Add proper type hints and validation for captured results * fix: wrong function * fix: remove task_done * feat: add configurable file processing timeouts for modal processors - Add TRANSCRIPT_FILE_TIMEOUT setting (default: 600s) for file transcription - Add DIARIZATION_FILE_TIMEOUT setting (default: 600s) for file diarization - Replace hardcoded timeout=600 with configurable settings in modal processors - Allows customization of timeout values via environment variables * fix: use logger * fix: worker process meetings now use file pipeline * fix: topic not gathered * refactor: remove prepare(), pipeline now work * refactor: implement many review from Igor * test: add test for test_pipeline_main_file * refactor: remove doc * doc: add doc * ci: update build to use native arm64 builder * fix: merge fixes * refactor: changes from Igor review + add test (not by default) to test gpu modal part * ci: update to our own runner linux-amd64 * ci: try using suggested mode=min * fix: update diarizer for latest modal, and use volume * fix: modal file extension detection * fix: put the diarizer as A100
2025-12-22 05:09:05 +00:00 · 2025-08-20 20:07:19 -06:00
parent 009590c080
commit 3ea7f6b7b6
37 changed files with 5086 additions and 198 deletions
--- a/server/gpu/modal_deployments/README.md
+++ b/server/gpu/modal_deployments/README.md
@@ -4,7 +4,8 @@ This repository hold an API for the GPU implementation of the Reflector API serv
 and use [Modal.com](https://modal.com)

 - `reflector_diarizer.py` - Diarization API
- `reflector_transcriber.py` - Transcription API
+- `reflector_transcriber.py` - Transcription API (Whisper)
+- `reflector_transcriber_parakeet.py` - Transcription API (NVIDIA Parakeet)
 - `reflector_translator.py` - Translation API

 ## Modal.com deployment
@@ -19,6 +20,10 @@ $ modal deploy reflector_transcriber.py
 ...
 └── 🔨 Created web => https://xxxx--reflector-transcriber-web.modal.run

+$ modal deploy reflector_transcriber_parakeet.py
+...
+└── 🔨 Created web => https://xxxx--reflector-transcriber-parakeet-web.modal.run
+
 $ modal deploy reflector_llm.py
 ...
 └── 🔨 Created web => https://xxxx--reflector-llm-web.modal.run
@@ -68,6 +73,86 @@ Authorization: bearer <REFLECTOR_APIKEY>

 ### Transcription

+#### Parakeet Transcriber (`reflector_transcriber_parakeet.py`)
+
+NVIDIA Parakeet is a state-of-the-art ASR model optimized for real-time transcription with superior word-level timestamps.
+
+**GPU Configuration:**
+- **A10G GPU** - Used for `/v1/audio/transcriptions` endpoint (small files, live transcription)
+  - Higher concurrency (max_inputs=10)
+  - Optimized for multiple small audio files
+  - Supports batch processing for efficiency
+
+- **L40S GPU** - Used for `/v1/audio/transcriptions-from-url` endpoint (large files)
+  - Lower concurrency but more powerful processing
+  - Optimized for single large audio files
+  - VAD-based chunking for long-form audio
+
+##### `/v1/audio/transcriptions` - Small file transcription
+
+**request** (multipart/form-data)
+- `file` or `files[]` - audio file(s) to transcribe
+- `model` - model name (default: `nvidia/parakeet-tdt-0.6b-v2`)
+- `language` - language code (default: `en`)
+- `batch` - whether to use batch processing for multiple files (default: `true`)
+
+**response**
+```json
+{
+    "text": "transcribed text",
+    "words": [
+        {"word": "hello", "start": 0.0, "end": 0.5},
+        {"word": "world", "start": 0.5, "end": 1.0}
+    ],
+    "filename": "audio.mp3"
+}
+```
+
+For multiple files with batch=true:
+```json
+{
+    "results": [
+        {
+            "filename": "audio1.mp3",
+            "text": "transcribed text",
+            "words": [...]
+        },
+        {
+            "filename": "audio2.mp3",
+            "text": "transcribed text",
+            "words": [...]
+        }
+    ]
+}
+```
+
+##### `/v1/audio/transcriptions-from-url` - Large file transcription
+
+**request** (application/json)
+```json
+{
+    "audio_file_url": "https://example.com/audio.mp3",
+    "model": "nvidia/parakeet-tdt-0.6b-v2",
+    "language": "en",
+    "timestamp_offset": 0.0
+}
+```
+
+**response**
+```json
+{
+    "text": "transcribed text from large file",
+    "words": [
+        {"word": "hello", "start": 0.0, "end": 0.5},
+        {"word": "world", "start": 0.5, "end": 1.0}
+    ]
+}
+```
+
+**Supported file types:** mp3, mp4, mpeg, mpga, m4a, wav, webm
+
+#### Whisper Transcriber (`reflector_transcriber.py`)
+
 `POST /transcribe`

 **request** (multipart/form-data)