fix: remove downscale from silero vad and use upstream processor

2026-02-04 18:06:48 +00:00 · 2025-08-22 11:03:26 -06:00
6 changed files with 35 additions and 73 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -14,6 +14,4 @@ data/
 www/REFACTOR.md
 www/reload-frontend
 server/test.sqlite
-CLAUDE.local.md
-www/.env.development
-www/.env.production
+CLAUDE.local.md
--- a/.gitleaksignore
+++ b/.gitleaksignore
@@ -1 +0,0 @@
-b9d891d3424f371642cb032ecfd0e2564470a72c:server/tests/test_transcripts_recording_deletion.py:generic-api-key:15
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -27,8 +27,3 @@ repos:
        files: ^server/
      - id: ruff-format
        files: ^server/
-
-  - repo: https://github.com/gitleaks/gitleaks
-    rev: v8.28.0
-    hooks:
-      - id: gitleaks
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,13 +1,5 @@
 # Changelog

-## [0.7.3](https://github.com/Monadical-SAS/reflector/compare/v0.7.2...v0.7.3) (2025-08-22)
-
-
-### Bug Fixes
-
-* cleaned repo, and get git-leaks clean ([359280d](https://github.com/Monadical-SAS/reflector/commit/359280dd340433ba4402ed69034094884c825e67))
-* restore previous behavior on live pipeline + audio downscaler ([#561](https://github.com/Monadical-SAS/reflector/issues/561)) ([9265d20](https://github.com/Monadical-SAS/reflector/commit/9265d201b590d23c628c5f19251b70f473859043))
-
 ## [0.7.2](https://github.com/Monadical-SAS/reflector/compare/v0.7.1...v0.7.2) (2025-08-21)


--- a/README.md
+++ b/README.md
@@ -1,60 +1,43 @@
 <div align="center">
-<img width="100" alt="image" src="https://github.com/user-attachments/assets/66fb367b-2c89-4516-9912-f47ac59c6a7f"/>

 # Reflector

-Reflector is an AI-powered audio transcription and meeting analysis platform that provides real-time transcription, speaker diarization, translation and summarization for audio content and live meetings. It works 100% with local models (whisper/parakeet, pyannote, seamless-m4t, and your local llm like phi-4).
+Reflector Audio Management and Analysis is a cutting-edge web application under development by Monadical. It utilizes AI to record meetings, providing a permanent record with transcripts, translations, and automated summaries.

-[![Tests](https://github.com/monadical-sas/reflector/actions/workflows/test_server.yml/badge.svg?branch=main&event=push)](https://github.com/monadical-sas/reflector/actions/workflows/test_server.yml)
+[![Tests](https://github.com/monadical-sas/reflector/actions/workflows/pytests.yml/badge.svg?branch=main&event=push)](https://github.com/monadical-sas/reflector/actions/workflows/pytests.yml)
 [![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](https://opensource.org/licenses/MIT)
 </div>
-</div>
+
+## Screenshots
 <table>
  <tr>
    <td>
-      <a href="https://github.com/user-attachments/assets/21f5597c-2930-4899-a154-f7bd61a59e97">
-        <img width="700" alt="image" src="https://github.com/user-attachments/assets/21f5597c-2930-4899-a154-f7bd61a59e97" />
+      <a href="https://github.com/user-attachments/assets/3a976930-56c1-47ef-8c76-55d3864309e3">
+        <img width="700" alt="image" src="https://github.com/user-attachments/assets/3a976930-56c1-47ef-8c76-55d3864309e3" />
      </a>
    </td>
    <td>
-      <a href="https://github.com/user-attachments/assets/f6b9399a-5e51-4bae-b807-59128d0a940c">
-        <img width="700" alt="image" src="https://github.com/user-attachments/assets/f6b9399a-5e51-4bae-b807-59128d0a940c" />
+      <a href="https://github.com/user-attachments/assets/bfe3bde3-08af-4426-a9a1-11ad5cd63b33">
+        <img width="700" alt="image" src="https://github.com/user-attachments/assets/bfe3bde3-08af-4426-a9a1-11ad5cd63b33" />
      </a>
    </td>
    <td>
-      <a href="https://github.com/user-attachments/assets/a42ce460-c1fd-4489-a995-270516193897">
-        <img width="700" alt="image" src="https://github.com/user-attachments/assets/a42ce460-c1fd-4489-a995-270516193897" />
-      </a>
-    </td>
-    <td>
-      <a href="https://github.com/user-attachments/assets/21929f6d-c309-42fe-9c11-f1299e50fbd4">
-        <img width="700" alt="image" src="https://github.com/user-attachments/assets/21929f6d-c309-42fe-9c11-f1299e50fbd4" />
+      <a href="https://github.com/user-attachments/assets/7b60c9d0-efe4-474f-a27b-ea13bd0fabdc">
+        <img width="700" alt="image" src="https://github.com/user-attachments/assets/7b60c9d0-efe4-474f-a27b-ea13bd0fabdc" />
      </a>
    </td>
  </tr>
 </table>

-## What is Reflector?
-
-Reflector is a web application that utilizes AI to process audio content, providing:
-
- **Real-time Transcription**: Convert speech to text using [Whisper](https://github.com/openai/whisper) (multi-language) or [Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) (English) models
- **Speaker Diarization**: Identify and label different speakers using [Pyannote](https://github.com/pyannote/pyannote-audio) 3.1
- **Live Translation**: Translate audio content in real-time to many languages with [Facebook Seamless-M4T](https://github.com/facebookresearch/seamless_communication)
- **Topic Detection & Summarization**: Extract key topics and generate concise summaries using LLMs
- **Meeting Recording**: Create permanent records of meetings with searchable transcripts
-
-Currently we provide [modal.com](https://modal.com/) gpu template to deploy.
-
 ## Background

 The project architecture consists of three primary components:

- **Back-End**: Python server that offers an API and data persistence, found in `server/`.
 - **Front-End**: NextJS React project hosted on Vercel, located in `www/`.
- **GPU implementation**: Providing services such as speech-to-text transcription, topic generation, automated summaries, and translations.
+- **Back-End**: Python server that offers an API and data persistence, found in `server/`.
+- **GPU implementation**: Providing services such as speech-to-text transcription, topic generation, automated summaries, and translations. Most reliable option is Modal deployment

-It also uses authentik for authentication if activated.
+It also uses authentik for authentication if activated, and Vercel for deployment and configuration of the front-end.

 ## Contribution Guidelines

--- a/server/reflector/processors/audio_chunker_silero.py
+++ b/server/reflector/processors/audio_chunker_silero.py
@@ -11,7 +11,10 @@ from reflector.processors.audio_chunker_auto import AudioChunkerAutoProcessor

 class AudioChunkerSileroProcessor(AudioChunkerProcessor):
    """
-    Assemble audio frames into chunks with VAD-based speech detection using Silero VAD
+    Assemble audio frames into chunks with VAD-based speech detection using Silero VAD.
+
+    Expects input audio to be already downscaled to 16kHz mono s16 format
+    (handled by AudioDownscaleProcessor in the pipeline).
    """

    def __init__(
@@ -31,12 +34,13 @@ class AudioChunkerSileroProcessor(AudioChunkerProcessor):
        self._init_vad(use_onnx)

    def _init_vad(self, use_onnx=False):
-        """Initialize Silero VAD model"""
+        """Initialize Silero VAD model for 16kHz audio"""
        try:
            torch.set_num_threads(1)
            self.vad_model = load_silero_vad(onnx=use_onnx)
+            # VAD expects 16kHz audio (guaranteed by AudioDownscaleProcessor)
            self.vad_iterator = VADIterator(self.vad_model, sampling_rate=16000)
-            self.logger.info("Silero VAD initialized successfully")
+            self.logger.info("Silero VAD initialized for 16kHz audio")

        except Exception as e:
            self.logger.error(f"Failed to initialize Silero VAD: {e}")
@@ -75,7 +79,7 @@ class AudioChunkerSileroProcessor(AudioChunkerProcessor):
            return None

        # Processing block with current buffer size
-        print(f"Processing block: {len(self.frames)} frames in buffer")
+        # print(f"Processing block: {len(self.frames)} frames in buffer")

        try:
            # Convert frames to numpy array for VAD
@@ -189,38 +193,29 @@ class AudioChunkerSileroProcessor(AudioChunkerProcessor):
        return None

    def _frames_to_numpy(self, frames: list[av.AudioFrame]) -> Optional[np.ndarray]:
-        """Convert av.AudioFrame list to numpy array for VAD processing"""
+        """Convert av.AudioFrame list to numpy array for VAD processing
+
+        Input frames are already 16kHz mono s16 format from AudioDownscaleProcessor.
+        Only need to convert s16 to float32 for Silero VAD.
+        """
        if not frames:
            return None

        try:
-            audio_data = []
-            for frame in frames:
-                frame_array = frame.to_ndarray()
-
-                if len(frame_array.shape) == 2:
-                    frame_array = frame_array.flatten()
-
-                audio_data.append(frame_array)
-
-            if not audio_data:
+            # Concatenate all frame arrays
+            audio_arrays = [frame.to_ndarray().flatten() for frame in frames]
+            if not audio_arrays:
                return None

-            combined_audio = np.concatenate(audio_data)
+            combined_audio = np.concatenate(audio_arrays)

-            # Ensure float32 format
-            if combined_audio.dtype == np.int16:
-                # Normalize int16 audio to float32 in range [-1.0, 1.0]
-                combined_audio = combined_audio.astype(np.float32) / 32768.0
-            elif combined_audio.dtype != np.float32:
-                combined_audio = combined_audio.astype(np.float32)
-
-            return combined_audio
+            # Convert s16 to float32 (Silero VAD requires float32 in range [-1.0, 1.0])
+            # Input is guaranteed to be s16 from AudioDownscaleProcessor
+            return combined_audio.astype(np.float32) / 32768.0

        except Exception as e:
            self.logger.error(f"Error converting frames to numpy: {e}")
-
-        return None
+            return None

    def _find_speech_segment_end(self, audio_array: np.ndarray) -> Optional[int]:
        """Find complete speech segments and return frame index at segment end"""
				`@@ -1 +0,0 @@`
				`b9d891d3424f371642cb032ecfd0e2564470a72c:server/tests/test_transcripts_recording_deletion.py:generic-api-key:15`