feat: pipeline improvement with file processing, parakeet, silero-vad (#540)

* feat: improve pipeline threading, and transcriber (parakeet and silero vad) * refactor: remove whisperx, implement parakeet * refactor: make audio_chunker more smart and wait for speech, instead of fixed frame * refactor: make audio merge to always downscale the audio to 16k for transcription * refactor: make the audio transcript modal accepting batches * refactor: improve type safety and remove prometheus metrics - Add DiarizationSegment TypedDict for proper diarization typing - Replace List/Optional with modern Python list/| None syntax - Remove all Prometheus metrics from TranscriptDiarizationAssemblerProcessor - Add comprehensive file processing pipeline with parallel execution - Update processor imports and type annotations throughout - Implement optimized file pipeline as default in process.py tool * refactor: convert FileDiarizationProcessor I/O types to BaseModel Update FileDiarizationInput and FileDiarizationOutput to inherit from BaseModel instead of plain classes, following the standard pattern used by other processors in the codebase. * test: add tests for file transcript and diarization with pytest-recording * build: add pytest-recording * feat: add local pyannote for testing * fix: replace PyAV AudioResampler with torchaudio for reliable audio processing - Replace problematic PyAV AudioResampler that was causing ValueError: [Errno 22] Invalid argument - Use torchaudio.functional.resample for robust sample rate conversion - Optimize processing: skip conversion for already 16kHz mono audio - Add direct WAV writing with Python wave module for better performance - Consolidate duplicate downsample checks for cleaner code - Maintain list[av.AudioFrame] input interface - Required for Silero VAD which needs 16kHz mono audio * fix: replace PyAV AudioResampler with torchaudio solution - Resolves ValueError: [Errno 22] Invalid argument in AudioMergeProcessor - Replaces problematic PyAV AudioResampler with torchaudio.functional.resample - Optimizes processing to skip unnecessary conversions when audio is already 16kHz mono - Uses direct WAV writing with Python's wave module for better performance - Fixes test_basic_process to disable diarization (pyannote dependency not installed) - Updates test expectations to match actual processor behavior - Removes unused pydub dependency from pyproject.toml - Adds comprehensive TEST_ANALYSIS.md documenting test suite status * feat: add parameterized test for both diarization modes - Adds @pytest.mark.parametrize to test_basic_process with enable_diarization=[False, True] - Test with diarization=False always passes (tests core AudioMergeProcessor functionality) - Test with diarization=True gracefully skips when pyannote.audio is not installed - Provides comprehensive test coverage for both pipeline configurations * fix: resolve pipeline property naming conflict in AudioDiarizationPyannoteProcessor - Renames 'pipeline' property to 'diarization_pipeline' to avoid conflict with base Processor.pipeline attribute - Fixes AttributeError: 'property 'pipeline' object has no setter' when set_pipeline() is called - Updates property usage in _diarize method to use new name - Now correctly supports pipeline initialization for diarization processing * fix: add local for pyannote * test: add diarization test * fix: resample on audio merge now working * fix: correctly restore timestamp * fix: display exception in a threaded processor if that happen * Update pyproject.toml * ci: remove option * ci: update astral-sh/setup-uv * test: add monadical url for pytest-recording * refactor: remove previous version * build: move faster whisper to local dep * test: fix missing import * refactor: improve main_file_pipeline organization and error handling - Move all imports to the top of the file - Create unified EmptyPipeline class to replace duplicate mock pipeline code - Remove timeout and fallback logic - let processors handle their own retries - Fix error handling to raise any exception from parallel tasks - Add proper type hints and validation for captured results * fix: wrong function * fix: remove task_done * feat: add configurable file processing timeouts for modal processors - Add TRANSCRIPT_FILE_TIMEOUT setting (default: 600s) for file transcription - Add DIARIZATION_FILE_TIMEOUT setting (default: 600s) for file diarization - Replace hardcoded timeout=600 with configurable settings in modal processors - Allows customization of timeout values via environment variables * fix: use logger * fix: worker process meetings now use file pipeline * fix: topic not gathered * refactor: remove prepare(), pipeline now work * refactor: implement many review from Igor * test: add test for test_pipeline_main_file * refactor: remove doc * doc: add doc * ci: update build to use native arm64 builder * fix: merge fixes * refactor: changes from Igor review + add test (not by default) to test gpu modal part * ci: update to our own runner linux-amd64 * ci: try using suggested mode=min * fix: update diarizer for latest modal, and use volume * fix: modal file extension detection * fix: put the diarizer as A100
2025-12-24 06:09:07 +00:00 · 2025-08-20 20:07:19 -06:00
parent 009590c080
commit 3ea7f6b7b6
37 changed files with 5086 additions and 198 deletions
--- a/server/reflector/tools/process.py
+++ b/server/reflector/tools/process.py
@@ -1,10 +1,23 @@
+"""
+Process audio file with diarization support
+===========================================
+
+Extended version of process.py that includes speaker diarization.
+This tool processes audio files locally without requiring the full server infrastructure.
+"""
+
 import asyncio
+import tempfile
+import uuid
+from pathlib import Path
+from typing import List

 import av

 from reflector.logger import logger
 from reflector.processors import (
    AudioChunkerProcessor,
+    AudioFileWriterProcessor,
    AudioMergeProcessor,
    AudioTranscriptAutoProcessor,
    Pipeline,
@@ -15,7 +28,43 @@ from reflector.processors import (
    TranscriptTopicDetectorProcessor,
    TranscriptTranslatorAutoProcessor,
 )
-from reflector.processors.base import BroadcastProcessor
+from reflector.processors.base import BroadcastProcessor, Processor
+from reflector.processors.types import (
+    AudioDiarizationInput,
+    TitleSummary,
+    TitleSummaryWithId,
+)
+
+
+class TopicCollectorProcessor(Processor):
+    """Collect topics for diarization"""
+
+    INPUT_TYPE = TitleSummary
+    OUTPUT_TYPE = TitleSummary
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.topics: List[TitleSummaryWithId] = []
+        self._topic_id = 0
+
+    async def _push(self, data: TitleSummary):
+        # Convert to TitleSummaryWithId and collect
+        self._topic_id += 1
+        topic_with_id = TitleSummaryWithId(
+            id=str(self._topic_id),
+            title=data.title,
+            summary=data.summary,
+            timestamp=data.timestamp,
+            duration=data.duration,
+            transcript=data.transcript,
+        )
+        self.topics.append(topic_with_id)
+
+        # Pass through the original topic
+        await self.emit(data)
+
+    def get_topics(self) -> List[TitleSummaryWithId]:
+        return self.topics


 async def process_audio_file(
@@ -24,18 +73,40 @@ async def process_audio_file(
    only_transcript=False,
    source_language="en",
    target_language="en",
+    enable_diarization=True,
+    diarization_backend="pyannote",
 ):
-    # build pipeline for audio processing
-    processors = [
+    # Create temp file for audio if diarization is enabled
+    audio_temp_path = None
+    if enable_diarization:
+        audio_temp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
+        audio_temp_path = audio_temp_file.name
+        audio_temp_file.close()
+
+    # Create processor for collecting topics
+    topic_collector = TopicCollectorProcessor()
+
+    # Build pipeline for audio processing
+    processors = []
+
+    # Add audio file writer at the beginning if diarization is enabled
+    if enable_diarization:
+        processors.append(AudioFileWriterProcessor(audio_temp_path))
+
+    # Add the rest of the processors
+    processors += [
        AudioChunkerProcessor(),
        AudioMergeProcessor(),
        AudioTranscriptAutoProcessor.as_threaded(),
        TranscriptLinerProcessor(),
        TranscriptTranslatorAutoProcessor.as_threaded(),
    ]
+
    if not only_transcript:
        processors += [
            TranscriptTopicDetectorProcessor.as_threaded(),
+            # Collect topics for diarization
+            topic_collector,
            BroadcastProcessor(
                processors=[
                    TranscriptFinalTitleProcessor.as_threaded(),
@@ -44,14 +115,14 @@ async def process_audio_file(
            ),
        ]

-    # transcription output
+    # Create main pipeline
    pipeline = Pipeline(*processors)
    pipeline.set_pref("audio:source_language", source_language)
    pipeline.set_pref("audio:target_language", target_language)
    pipeline.describe()
    pipeline.on(event_callback)

-    # start processing audio
+    # Start processing audio
    logger.info(f"Opening {filename}")
    container = av.open(filename)
    try:
@@ -62,43 +133,242 @@ async def process_audio_file(
        logger.info("Flushing the pipeline")
        await pipeline.flush()

-    logger.info("All done !")
+    # Run diarization if enabled and we have topics
+    if enable_diarization and not only_transcript and audio_temp_path:
+        topics = topic_collector.get_topics()
+
+        if topics:
+            logger.info(f"Starting diarization with {len(topics)} topics")
+
+            try:
+                from reflector.processors import AudioDiarizationAutoProcessor
+
+                diarization_processor = AudioDiarizationAutoProcessor(
+                    name=diarization_backend
+                )
+
+                diarization_processor.set_pipeline(pipeline)
+
+                # For Modal backend, we need to upload the file to S3 first
+                if diarization_backend == "modal":
+                    from datetime import datetime
+
+                    from reflector.storage import get_transcripts_storage
+                    from reflector.utils.s3_temp_file import S3TemporaryFile
+
+                    storage = get_transcripts_storage()
+
+                    # Generate a unique filename in evaluation folder
+                    timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
+                    audio_filename = f"evaluation/diarization_temp/{timestamp}_{uuid.uuid4().hex}.wav"
+
+                    # Use context manager for automatic cleanup
+                    async with S3TemporaryFile(storage, audio_filename) as s3_file:
+                        # Read and upload the audio file
+                        with open(audio_temp_path, "rb") as f:
+                            audio_data = f.read()
+
+                        audio_url = await s3_file.upload(audio_data)
+                        logger.info(f"Uploaded audio to S3: {audio_filename}")
+
+                        # Create diarization input with S3 URL
+                        diarization_input = AudioDiarizationInput(
+                            audio_url=audio_url, topics=topics
+                        )
+
+                        # Run diarization
+                        await diarization_processor.push(diarization_input)
+                        await diarization_processor.flush()
+
+                        logger.info("Diarization complete")
+                        # File will be automatically cleaned up when exiting the context
+                else:
+                    # For local backend, use local file path
+                    audio_url = audio_temp_path
+
+                    # Create diarization input
+                    diarization_input = AudioDiarizationInput(
+                        audio_url=audio_url, topics=topics
+                    )
+
+                    # Run diarization
+                    await diarization_processor.push(diarization_input)
+                    await diarization_processor.flush()
+
+                    logger.info("Diarization complete")
+
+            except ImportError as e:
+                logger.error(f"Failed to import diarization dependencies: {e}")
+                logger.error(
+                    "Install with: uv pip install pyannote.audio torch torchaudio"
+                )
+                logger.error(
+                    "And set HF_TOKEN environment variable for pyannote models"
+                )
+                raise SystemExit(1)
+            except Exception as e:
+                logger.error(f"Diarization failed: {e}")
+                raise SystemExit(1)
+        else:
+            logger.warning("Skipping diarization: no topics available")
+
+    # Clean up temp file
+    if audio_temp_path:
+        try:
+            Path(audio_temp_path).unlink()
+        except Exception as e:
+            logger.warning(f"Failed to clean up temp file {audio_temp_path}: {e}")
+
+    logger.info("All done!")
+
+
+async def process_file_pipeline(
+    filename: str,
+    event_callback,
+    source_language="en",
+    target_language="en",
+    enable_diarization=True,
+    diarization_backend="modal",
+):
+    """Process audio/video file using the optimized file pipeline"""
+    try:
+        from reflector.db import database
+        from reflector.db.transcripts import SourceKind, transcripts_controller
+        from reflector.pipelines.main_file_pipeline import PipelineMainFile
+
+        await database.connect()
+        try:
+            # Create a temporary transcript for processing
+            transcript = await transcripts_controller.add(
+                "",
+                source_kind=SourceKind.FILE,
+                source_language=source_language,
+                target_language=target_language,
+            )
+
+            # Process the file
+            pipeline = PipelineMainFile(transcript_id=transcript.id)
+            await pipeline.process(Path(filename))
+
+            logger.info("File pipeline processing complete")
+
+        finally:
+            await database.disconnect()
+    except ImportError as e:
+        logger.error(f"File pipeline not available: {e}")
+        logger.info("Falling back to stream pipeline")
+        # Fall back to stream pipeline
+        await process_audio_file(
+            filename,
+            event_callback,
+            only_transcript=False,
+            source_language=source_language,
+            target_language=target_language,
+            enable_diarization=enable_diarization,
+            diarization_backend=diarization_backend,
+        )


 if __name__ == "__main__":
    import argparse
+    import os

-    parser = argparse.ArgumentParser()
+    parser = argparse.ArgumentParser(
+        description="Process audio files with optional speaker diarization"
+    )
    parser.add_argument("source", help="Source file (mp3, wav, mp4...)")
-    parser.add_argument("--only-transcript", "-t", action="store_true")
-    parser.add_argument("--source-language", default="en")
-    parser.add_argument("--target-language", default="en")
+    parser.add_argument(
+        "--stream",
+        action="store_true",
+        help="Use streaming pipeline (original frame-based processing)",
+    )
+    parser.add_argument(
+        "--only-transcript",
+        "-t",
+        action="store_true",
+        help="Only generate transcript without topics/summaries",
+    )
+    parser.add_argument(
+        "--source-language", default="en", help="Source language code (default: en)"
+    )
+    parser.add_argument(
+        "--target-language", default="en", help="Target language code (default: en)"
+    )
    parser.add_argument("--output", "-o", help="Output file (output.jsonl)")
+    parser.add_argument(
+        "--enable-diarization",
+        "-d",
+        action="store_true",
+        help="Enable speaker diarization",
+    )
+    parser.add_argument(
+        "--diarization-backend",
+        default="pyannote",
+        choices=["pyannote", "modal"],
+        help="Diarization backend to use (default: pyannote)",
+    )
    args = parser.parse_args()

+    if "REDIS_HOST" not in os.environ:
+        os.environ["REDIS_HOST"] = "localhost"
+
    output_fd = None
    if args.output:
        output_fd = open(args.output, "w")

    async def event_callback(event: PipelineEvent):
        processor = event.processor
-        # ignore some processor
-        if processor in ("AudioChunkerProcessor", "AudioMergeProcessor"):
+        data = event.data
+
+        # Ignore internal processors
+        if processor in (
+            "AudioChunkerProcessor",
+            "AudioMergeProcessor",
+            "AudioFileWriterProcessor",
+            "TopicCollectorProcessor",
+            "BroadcastProcessor",
+        ):
            return
-        logger.info(f"Event: {event}")
+
+        # If diarization is enabled, skip the original topic events from the pipeline
+        # The diarization processor will emit the same topics but with speaker info
+        if processor == "TranscriptTopicDetectorProcessor" and args.enable_diarization:
+            return
+
+        # Log all events
+        logger.info(f"Event: {processor} - {type(data).__name__}")
+
+        # Write to output
        if output_fd:
            output_fd.write(event.model_dump_json())
            output_fd.write("\n")
+            output_fd.flush()

-    asyncio.run(
-        process_audio_file(
-            args.source,
-            event_callback,
-            only_transcript=args.only_transcript,
-            source_language=args.source_language,
-            target_language=args.target_language,
+    if args.stream:
+        # Use original streaming pipeline
+        asyncio.run(
+            process_audio_file(
+                args.source,
+                event_callback,
+                only_transcript=args.only_transcript,
+                source_language=args.source_language,
+                target_language=args.target_language,
+                enable_diarization=args.enable_diarization,
+                diarization_backend=args.diarization_backend,
+            )
+        )
+    else:
+        # Use optimized file pipeline (default)
+        asyncio.run(
+            process_file_pipeline(
+                args.source,
+                event_callback,
+                source_language=args.source_language,
+                target_language=args.target_language,
+                enable_diarization=args.enable_diarization,
+                diarization_backend=args.diarization_backend,
+            )
        )
-    )

    if output_fd:
        output_fd.close()