mirror of
https://github.com/Monadical-SAS/reflector.git
synced 2026-02-05 02:16:46 +00:00
* feat: WIP doc (vibe started and iterated) * install from scratch docs * caddyfile.example * gitignore * authentik script * authentik script * authentik script * llm doc * authentik ongoing * more daily setup logs * doc website * gpu self hosted setup guide (no-mistakes) * doc review round * doc review round * doc review round * update doc site sidebars * feat(docs): add mermaid diagram support * docs polishing * live pipeline doc * move pipeline dev docs to dev docs location * doc pr review iteration * dockerfile healthcheck * docs/pr-comments * remove jwt comment * llm suggestion * pr comments * pr comments * document auto migrations * cleanup docs --------- Co-authored-by: Mathieu Virbel <mat@meltingrocks.com> Co-authored-by: Igor Loskutov <igor.loskutoff@gmail.com>
1.9 KiB
1.9 KiB
sidebar_position, title
| sidebar_position | title |
|---|---|
| 2 | File Processing Pipeline |
File Processing Pipeline
The file processing pipeline handles uploaded audio files, optimizing for accuracy and throughput.
Pipeline Stages
1. Input Stage
Accepted Formats:
- MP3 (most common)
- WAV (uncompressed)
- M4A (Apple format)
- WebM (browser recordings)
- MP4 (video with audio track)
File Validation:
- Sample rate: Any (will be resampled to 16kHz)
2. Pre-processing
Audio Normalization:
# Convert to standard format
- Sample rate: 16kHz (Whisper requirement)
- Channels: Mono
- Bit depth: 16-bit
- Format: WAV
Noise Reduction (Optional):
- Background noise removal
- Echo cancellation
- High-pass filter for rumble
3. Chunking Strategy
Audio is split into segments for processing:
- Configurable chunk sizes
- Optional silence detection for natural breaks
- Parallel processing of chunks
4. Transcription Processing
Transcription uses OpenAI Whisper models via Modal.com or self-hosted GPU:
- Automatic language detection
- Word-level timestamps
5. Diarization (Speaker Identification)
Speaker diarization uses Pyannote 3.1:
- Voice Activity Detection (VAD) - Identifies speech segments
- Speaker Embedding - Extracts voice characteristics
- Clustering - Groups similar voices
- Segmentation - Assigns speaker labels to time segments
6. Alignment & Merging
- Combines transcription with speaker diarization
- Maps speaker labels to transcript segments
- Resolves timing overlaps
- Validates timeline consistency
7. Post-processing Chain
- Text Formatting: Punctuation, capitalization
- Topic Detection: LLM-based topic extraction
- Summarization: AI-generated summaries and action items
8. Storage & Delivery
File Storage:
- Original audio: S3 (optional)
- Transcript exports: JSON, VTT, TXT
Notifications:
- WebSocket updates during processing
- Webhook notifications on completion (optional)