mirror of https://github.com/Monadical-SAS/reflector.git synced 2026-02-05 02:16:46 +00:00

Files

Igor Monadical 407c15299f docs: docs website + installation (#778 )

* feat: WIP doc (vibe started and iterated)

* install from scratch docs

* caddyfile.example

* gitignore

* authentik script

* authentik script

* authentik script

* llm doc

* authentik ongoing

* more daily setup logs

* doc website

* gpu self hosted setup guide (no-mistakes)

* doc review round

* doc review round

* doc review round

* update doc site sidebars

* feat(docs): add mermaid diagram support

* docs polishing

* live pipeline doc

* move pipeline dev docs to dev docs location

* doc pr review iteration

* dockerfile healthcheck

* docs/pr-comments

* remove jwt comment

* llm suggestion

* pr comments

* pr comments

* document auto migrations

* cleanup docs

---------

Co-authored-by: Mathieu Virbel <mat@meltingrocks.com>
Co-authored-by: Igor Loskutov <igor.loskutoff@gmail.com>

2026-01-06 17:25:02 -05:00

1.9 KiB

Raw Blame History

sidebar_position, title

sidebar_position	title
2	File Processing Pipeline

File Processing Pipeline

The file processing pipeline handles uploaded audio files, optimizing for accuracy and throughput.

Pipeline Stages

1. Input Stage

Accepted Formats:

MP3 (most common)
WAV (uncompressed)
M4A (Apple format)
WebM (browser recordings)
MP4 (video with audio track)

File Validation:

Sample rate: Any (will be resampled to 16kHz)

2. Pre-processing

Audio Normalization:

# Convert to standard format
- Sample rate: 16kHz (Whisper requirement)
- Channels: Mono
- Bit depth: 16-bit
- Format: WAV

Noise Reduction (Optional):

Background noise removal
Echo cancellation
High-pass filter for rumble

3. Chunking Strategy

Audio is split into segments for processing:

Configurable chunk sizes
Optional silence detection for natural breaks
Parallel processing of chunks

4. Transcription Processing

Transcription uses OpenAI Whisper models via Modal.com or self-hosted GPU:

Automatic language detection
Word-level timestamps

5. Diarization (Speaker Identification)

Speaker diarization uses Pyannote 3.1:

Voice Activity Detection (VAD) - Identifies speech segments
Speaker Embedding - Extracts voice characteristics
Clustering - Groups similar voices
Segmentation - Assigns speaker labels to time segments

6. Alignment & Merging

Combines transcription with speaker diarization
Maps speaker labels to transcript segments
Resolves timing overlaps
Validates timeline consistency

7. Post-processing Chain

Text Formatting: Punctuation, capitalization
Topic Detection: LLM-based topic extraction
Summarization: AI-generated summaries and action items

8. Storage & Delivery

File Storage:

Original audio: S3 (optional)
Transcript exports: JSON, VTT, TXT

Notifications:

WebSocket updates during processing
Webhook notifications on completion (optional)

1.9 KiB Raw Blame History