mirror of
https://github.com/Monadical-SAS/reflector.git
synced 2025-12-20 20:29:06 +00:00
dailico track merge vibe
This commit is contained in:
@@ -1,27 +1,27 @@
|
|||||||
# Daily.co Integration Test Plan
|
# Daily.co Integration Test Plan
|
||||||
|
|
||||||
## ⚠️ IMPORTANT: Stub Implementation
|
## ✅ IMPLEMENTATION STATUS: Real Transcription Active
|
||||||
|
|
||||||
**This test validates Daily.co webhook integration with MOCK transcription data.**
|
**This test validates Daily.co multitrack recording integration with REAL transcription/diarization.**
|
||||||
|
|
||||||
The actual audio/video files are recorded to S3, but transcription/diarization is NOT performed. Instead:
|
The implementation includes complete audio processing pipeline:
|
||||||
- A **stub processor** generates fake transcript with predetermined text ("The Great Fish Eating Argument")
|
- **Multitrack recordings** from Daily.co S3 (separate audio stream per participant)
|
||||||
- **Audio track is downloaded from Daily.co S3** to local storage for playback in the frontend
|
- **PyAV-based audio mixdown** with PTS-based track alignment
|
||||||
- All database entities (recording, transcript, topics, participants, words) are created with **fake "fish" conversation data**
|
- **Real transcription** via Modal GPU backend (Whisper)
|
||||||
- This allows testing the complete webhook → database flow WITHOUT expensive GPU processing
|
- **Real diarization** via Modal GPU backend (speaker identification)
|
||||||
|
- **Per-track transcription** with timestamp synchronization
|
||||||
|
- **Complete database entities** (recording, transcript, topics, participants, words)
|
||||||
|
|
||||||
**Expected transcript content:**
|
**Processing pipeline** (`PipelineMainMultitrack`):
|
||||||
- Title: "The Great Fish Eating Argument"
|
1. Download all audio tracks from Daily.co S3
|
||||||
- Participants: "Fish Eater" (speaker 0), "Annoying Person" (speaker 1)
|
2. Align tracks by PTS (presentation timestamp) to handle late joiners
|
||||||
- Transcription: Nonsensical argument about eating fish (see `reflector/worker/daily_stub_data.py`)
|
3. Mix tracks into single audio file for unified playback
|
||||||
- Audio file: Downloaded WebM from Daily.co S3 (stored in `data/{transcript_id}/upload.webm`)
|
4. Transcribe each track individually with proper offset handling
|
||||||
|
5. Perform diarization on mixed audio
|
||||||
|
6. Generate topics, summaries, and word-level timestamps
|
||||||
|
7. Convert audio to MP3 and generate waveform visualization
|
||||||
|
|
||||||
**File processing pipeline** then:
|
**Note:** A stub processor (`process_daily_recording`) exists for testing webhook flow without GPU costs, but the production code path uses `process_multitrack_recording` with full ML pipeline.
|
||||||
- Converts WebM to MP3 format (for frontend audio player)
|
|
||||||
- Generates waveform visualization data (audio.json)
|
|
||||||
- These files enable proper frontend transcript page display
|
|
||||||
|
|
||||||
**Next implementation step:** Replace stub with real transcription pipeline (merge audio tracks, run Whisper/diarization).
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -29,6 +29,7 @@ The actual audio/video files are recorded to S3, but transcription/diarization i
|
|||||||
|
|
||||||
**1. Environment Variables** (check in `.env.development.local`):
|
**1. Environment Variables** (check in `.env.development.local`):
|
||||||
```bash
|
```bash
|
||||||
|
# Daily.co API Configuration
|
||||||
DAILY_API_KEY=<key>
|
DAILY_API_KEY=<key>
|
||||||
DAILY_SUBDOMAIN=monadical
|
DAILY_SUBDOMAIN=monadical
|
||||||
DAILY_WEBHOOK_SECRET=<base64-encoded-secret>
|
DAILY_WEBHOOK_SECRET=<base64-encoded-secret>
|
||||||
@@ -37,25 +38,43 @@ AWS_DAILY_S3_REGION=us-east-1
|
|||||||
AWS_DAILY_ROLE_ARN=arn:aws:iam::950402358378:role/DailyCo
|
AWS_DAILY_ROLE_ARN=arn:aws:iam::950402358378:role/DailyCo
|
||||||
DAILY_MIGRATION_ENABLED=true
|
DAILY_MIGRATION_ENABLED=true
|
||||||
DAILY_MIGRATION_ROOM_IDS=["552640fd-16f2-4162-9526-8cf40cd2357e"]
|
DAILY_MIGRATION_ROOM_IDS=["552640fd-16f2-4162-9526-8cf40cd2357e"]
|
||||||
|
|
||||||
|
# Transcription/Diarization Backend (Required for real processing)
|
||||||
|
DIARIZATION_BACKEND=modal
|
||||||
|
DIARIZATION_MODAL_API_KEY=<modal-api-key>
|
||||||
|
# TRANSCRIPTION_BACKEND is not explicitly set (uses default/modal)
|
||||||
```
|
```
|
||||||
|
|
||||||
**2. Services Running:**
|
**2. Services Running:**
|
||||||
```bash
|
```bash
|
||||||
docker-compose ps # server, postgres, redis should be UP
|
docker compose ps # server, postgres, redis, worker, beat should be UP
|
||||||
|
```
|
||||||
|
|
||||||
|
**IMPORTANT:** Worker and beat services MUST be running for transcription processing:
|
||||||
|
```bash
|
||||||
|
docker compose up -d worker beat
|
||||||
```
|
```
|
||||||
|
|
||||||
**3. ngrok Tunnel for Webhooks:**
|
**3. ngrok Tunnel for Webhooks:**
|
||||||
```bash
|
```bash
|
||||||
ngrok http 1250 # Note the URL (e.g., https://abc123.ngrok-free.app)
|
# Start ngrok (if not already running)
|
||||||
|
ngrok http 1250 --log=stdout > /tmp/ngrok.log 2>&1 &
|
||||||
|
|
||||||
|
# Get public URL
|
||||||
|
curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; data=json.load(sys.stdin); print(data['tunnels'][0]['public_url'])"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Current ngrok URL:** `https://0503947384a3.ngrok-free.app` (as of last registration)
|
||||||
|
|
||||||
**4. Webhook Created:**
|
**4. Webhook Created:**
|
||||||
```bash
|
```bash
|
||||||
cd server
|
cd server
|
||||||
uv run python scripts/recreate_daily_webhook.py https://abc123.ngrok-free.app/v1/daily/webhook
|
uv run python scripts/recreate_daily_webhook.py https://0503947384a3.ngrok-free.app/v1/daily/webhook
|
||||||
# Verify: "Created webhook <uuid> (state: ACTIVE)"
|
# Verify: "Created webhook <uuid> (state: ACTIVE)"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Current webhook status:** ✅ ACTIVE (webhook ID: dad5ad16-ceca-488e-8fc5-dae8650b51d0)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Test 1: Database Configuration
|
## Test 1: Database Configuration
|
||||||
@@ -338,23 +357,25 @@ recorded_at: <recent-timestamp>
|
|||||||
|
|
||||||
**Check transcript created:**
|
**Check transcript created:**
|
||||||
```bash
|
```bash
|
||||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||||
"SELECT id, title, status, duration, recording_id, meeting_id, room_id
|
"SELECT id, title, status, duration, recording_id, meeting_id, room_id
|
||||||
FROM transcript
|
FROM transcript
|
||||||
ORDER BY created_at DESC LIMIT 1;"
|
ORDER BY created_at DESC LIMIT 1;"
|
||||||
```
|
```
|
||||||
|
|
||||||
**Expected:**
|
**Expected (REAL transcription):**
|
||||||
```
|
```
|
||||||
id: <transcript-id>
|
id: <transcript-id>
|
||||||
title: The Great Fish Eating Argument
|
title: <AI-generated title based on actual conversation content>
|
||||||
status: uploaded (audio file downloaded for playback)
|
status: uploaded (audio file processed and available)
|
||||||
duration: ~200-300 seconds (depends on fish text parsing)
|
duration: <actual meeting duration in seconds>
|
||||||
recording_id: <same-as-recording-id-above>
|
recording_id: <same-as-recording-id-above>
|
||||||
meeting_id: <meeting-id>
|
meeting_id: <meeting-id>
|
||||||
room_id: 552640fd-16f2-4162-9526-8cf40cd2357e
|
room_id: 552640fd-16f2-4162-9526-8cf40cd2357e
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Note:** Title and content will reflect the ACTUAL conversation, not mock data. Processing time depends on recording length and GPU backend availability (Modal).
|
||||||
|
|
||||||
**Verify audio file exists:**
|
**Verify audio file exists:**
|
||||||
```bash
|
```bash
|
||||||
ls -lh data/<transcript-id>/upload.webm
|
ls -lh data/<transcript-id>/upload.webm
|
||||||
@@ -365,12 +386,12 @@ ls -lh data/<transcript-id>/upload.webm
|
|||||||
-rw-r--r-- 1 user staff ~100-200K Oct 10 18:48 upload.webm
|
-rw-r--r-- 1 user staff ~100-200K Oct 10 18:48 upload.webm
|
||||||
```
|
```
|
||||||
|
|
||||||
**Check transcript topics (stub data):**
|
**Check transcript topics (REAL transcription):**
|
||||||
```bash
|
```bash
|
||||||
TRANSCRIPT_ID=$(docker-compose exec -T postgres psql -U reflector -d reflector -t -c \
|
TRANSCRIPT_ID=$(docker compose exec -T postgres psql -U reflector -d reflector -t -c \
|
||||||
"SELECT id FROM transcript ORDER BY created_at DESC LIMIT 1;")
|
"SELECT id FROM transcript ORDER BY created_at DESC LIMIT 1;")
|
||||||
|
|
||||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||||
"SELECT
|
"SELECT
|
||||||
jsonb_array_length(topics) as num_topics,
|
jsonb_array_length(topics) as num_topics,
|
||||||
jsonb_array_length(participants) as num_participants,
|
jsonb_array_length(participants) as num_participants,
|
||||||
@@ -380,55 +401,52 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
|||||||
WHERE id = '$TRANSCRIPT_ID';"
|
WHERE id = '$TRANSCRIPT_ID';"
|
||||||
```
|
```
|
||||||
|
|
||||||
**Expected:**
|
**Expected (REAL data):**
|
||||||
```
|
```
|
||||||
num_topics: 3
|
num_topics: <varies based on conversation>
|
||||||
num_participants: 2
|
num_participants: <actual number of participants who spoke>
|
||||||
short_summary: Two people argue about eating fish
|
short_summary: <AI-generated summary of actual conversation>
|
||||||
title: The Great Fish Eating Argument
|
title: <AI-generated title based on content>
|
||||||
```
|
```
|
||||||
|
|
||||||
**Check topics contain fish text:**
|
**Check topics contain actual transcription:**
|
||||||
```bash
|
```bash
|
||||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||||
"SELECT topics->0->'title', topics->0->'summary', topics->0->'transcript'
|
"SELECT topics->0->'title', topics->0->'summary', topics->0->'transcript'
|
||||||
FROM transcript
|
FROM transcript
|
||||||
ORDER BY created_at DESC LIMIT 1;" | head -20
|
ORDER BY created_at DESC LIMIT 1;" | head -20
|
||||||
```
|
```
|
||||||
|
|
||||||
**Expected output should contain:**
|
**Expected output:** Will contain the ACTUAL transcribed conversation from the Daily.co meeting, not mock data.
|
||||||
```
|
|
||||||
Fish Argument Part 1
|
|
||||||
Argument about eating fish continues (part 1)
|
|
||||||
Fish for dinner are nothing wrong with you? There's nothing...
|
|
||||||
```
|
|
||||||
|
|
||||||
**Check participants:**
|
**Check participants:**
|
||||||
```bash
|
```bash
|
||||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||||
"SELECT participants FROM transcript ORDER BY created_at DESC LIMIT 1;" \
|
"SELECT participants FROM transcript ORDER BY created_at DESC LIMIT 1;" \
|
||||||
| python3 -c "import sys, json; data=json.loads(sys.stdin.read()); print(json.dumps(data, indent=2))"
|
| python3 -c "import sys, json; data=json.loads(sys.stdin.read()); print(json.dumps(data, indent=2))"
|
||||||
```
|
```
|
||||||
|
|
||||||
**Expected:**
|
**Expected (REAL diarization):**
|
||||||
```json
|
```json
|
||||||
[
|
[
|
||||||
{
|
{
|
||||||
"id": "<uuid>",
|
"id": "<uuid>",
|
||||||
"speaker": 0,
|
"speaker": 0,
|
||||||
"name": "Fish Eater"
|
"name": "Speaker 1"
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"id": "<uuid>",
|
"id": "<uuid>",
|
||||||
"speaker": 1,
|
"speaker": 1,
|
||||||
"name": "Annoying Person"
|
"name": "Speaker 2"
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Note:** Speaker names will be generic ("Speaker 1", "Speaker 2", etc.) as determined by the diarization backend. Number of participants depends on how many actually spoke during the meeting.
|
||||||
|
|
||||||
**Check word-level data:**
|
**Check word-level data:**
|
||||||
```bash
|
```bash
|
||||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||||
"SELECT jsonb_array_length(topics->0->'words') as num_words_first_topic
|
"SELECT jsonb_array_length(topics->0->'words') as num_words_first_topic
|
||||||
FROM transcript
|
FROM transcript
|
||||||
ORDER BY created_at DESC LIMIT 1;"
|
ORDER BY created_at DESC LIMIT 1;"
|
||||||
@@ -436,12 +454,12 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
|||||||
|
|
||||||
**Expected:**
|
**Expected:**
|
||||||
```
|
```
|
||||||
num_words_first_topic: ~100-150 (varies based on topic chunking)
|
num_words_first_topic: <varies based on actual conversation length and topic chunking>
|
||||||
```
|
```
|
||||||
|
|
||||||
**Verify speaker diarization in words:**
|
**Verify speaker diarization in words:**
|
||||||
```bash
|
```bash
|
||||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||||
"SELECT
|
"SELECT
|
||||||
topics->0->'words'->0->>'text' as first_word,
|
topics->0->'words'->0->>'text' as first_word,
|
||||||
topics->0->'words'->0->>'speaker' as speaker,
|
topics->0->'words'->0->>'speaker' as speaker,
|
||||||
@@ -451,14 +469,16 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
|||||||
ORDER BY created_at DESC LIMIT 1;"
|
ORDER BY created_at DESC LIMIT 1;"
|
||||||
```
|
```
|
||||||
|
|
||||||
**Expected:**
|
**Expected (REAL transcription):**
|
||||||
```
|
```
|
||||||
first_word: Fish
|
first_word: <actual first word from transcription>
|
||||||
speaker: 0 or 1 (depends on parsing)
|
speaker: 0, 1, 2, ... (actual speaker ID from diarization)
|
||||||
start_time: 0.0
|
start_time: <actual timestamp in seconds>
|
||||||
end_time: 0.35 (approximate)
|
end_time: <actual end timestamp>
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Note:** All timestamps and speaker IDs are from real transcription/diarization, synchronized across tracks.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Test 8: Recording Type Verification
|
## Test 8: Recording Type Verification
|
||||||
@@ -579,13 +599,15 @@ Recording: raw-tracks
|
|||||||
- [x] S3 path: `monadical/test2-{timestamp}/{recording-start-ts}-{participant-uuid}-cam-{audio|video}-{track-start-ts}`
|
- [x] S3 path: `monadical/test2-{timestamp}/{recording-start-ts}-{participant-uuid}-cam-{audio|video}-{track-start-ts}`
|
||||||
- [x] Database `num_clients` increments/decrements correctly
|
- [x] Database `num_clients` increments/decrements correctly
|
||||||
- [x] **Database recording entry created** with correct S3 path and status `completed`
|
- [x] **Database recording entry created** with correct S3 path and status `completed`
|
||||||
- [x] **Database transcript entry created** with status `uploaded`
|
- [ ] **Database transcript entry created** with status `uploaded`
|
||||||
- [x] **Audio file downloaded** to `data/{transcript_id}/upload.webm` (~100-200KB)
|
- [ ] **Audio file downloaded** to `data/{transcript_id}/upload.webm`
|
||||||
- [x] **Transcript has stub data**: title "The Great Fish Eating Argument"
|
- [ ] **Transcript has REAL data**: AI-generated title based on conversation
|
||||||
- [x] **Transcript has 3 topics** about fish argument
|
- [ ] **Transcript has topics** generated from actual content
|
||||||
- [x] **Transcript has 2 participants**: "Fish Eater" (speaker 0) and "Annoying Person" (speaker 1)
|
- [ ] **Transcript has participants** with proper speaker diarization
|
||||||
- [x] **Topics contain word-level data** with timestamps and speaker IDs
|
- [ ] **Topics contain word-level data** with accurate timestamps and speaker IDs
|
||||||
- [x] **Total duration** ~200-300 seconds based on fish text parsing
|
- [ ] **Total duration** matches actual meeting length
|
||||||
- [x] **MP3 and waveform files generated** by file processing pipeline
|
- [ ] **MP3 and waveform files generated** by file processing pipeline
|
||||||
- [x] **Frontend transcript page loads** without "Failed to load audio" error
|
- [ ] **Frontend transcript page loads** without "Failed to load audio" error
|
||||||
- [x] **Audio player functional** with working playback and waveform visualization
|
- [ ] **Audio player functional** with working playback and waveform visualization
|
||||||
|
- [ ] **Multitrack processing completed** without errors in worker logs
|
||||||
|
- [ ] **Modal GPU backends accessible** (transcription and diarization)
|
||||||
|
|||||||
@@ -6,7 +6,7 @@ ENV PYTHONUNBUFFERED=1 \
|
|||||||
|
|
||||||
# builder install base dependencies
|
# builder install base dependencies
|
||||||
WORKDIR /tmp
|
WORKDIR /tmp
|
||||||
RUN apt-get update && apt-get install -y curl && apt-get clean
|
RUN apt-get update && apt-get install -y curl ffmpeg && apt-get clean
|
||||||
ADD https://astral.sh/uv/install.sh /uv-installer.sh
|
ADD https://astral.sh/uv/install.sh /uv-installer.sh
|
||||||
RUN sh /uv-installer.sh && rm /uv-installer.sh
|
RUN sh /uv-installer.sh && rm /uv-installer.sh
|
||||||
ENV PATH="/root/.local/bin/:$PATH"
|
ENV PATH="/root/.local/bin/:$PATH"
|
||||||
|
|||||||
84
server/reflector/pipelines/MULTITRACK_FIX_SUMMARY.md
Normal file
84
server/reflector/pipelines/MULTITRACK_FIX_SUMMARY.md
Normal file
@@ -0,0 +1,84 @@
|
|||||||
|
# Multitrack Pipeline Fix Summary
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
Whisper timestamps were incorrect because it ignores leading silence in audio files. Daily.co tracks can have arbitrary amounts of silence before speech starts.
|
||||||
|
|
||||||
|
## Solution
|
||||||
|
**Pad tracks BEFORE transcription using stream metadata `start_time`**
|
||||||
|
|
||||||
|
This makes Whisper timestamps automatically correct relative to recording start.
|
||||||
|
|
||||||
|
## Key Changes in `main_multitrack_pipeline_fixed.py`
|
||||||
|
|
||||||
|
### 1. Added `pad_track_for_transcription()` method (lines 55-172)
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def pad_track_for_transcription(
|
||||||
|
self,
|
||||||
|
track_data: bytes,
|
||||||
|
track_idx: int,
|
||||||
|
storage,
|
||||||
|
) -> tuple[bytes, str]:
|
||||||
|
```
|
||||||
|
|
||||||
|
- Extracts stream metadata `start_time` using PyAV
|
||||||
|
- Creates PyAV filter graph with `adelay` filter to add padding
|
||||||
|
- Stores padded track to S3 and returns URL
|
||||||
|
- Uses same audio processing library (PyAV) already in the pipeline
|
||||||
|
|
||||||
|
### 2. Modified `process()` method
|
||||||
|
|
||||||
|
#### REMOVED (lines 255-302):
|
||||||
|
- Entire filename parsing for offsets - NOT NEEDED ANYMORE
|
||||||
|
- The complex regex parsing of Daily.co filenames
|
||||||
|
- Offset adjustment after transcription
|
||||||
|
|
||||||
|
#### ADDED (lines 371-382):
|
||||||
|
- Padding step BEFORE transcription:
|
||||||
|
```python
|
||||||
|
# PAD TRACKS BEFORE TRANSCRIPTION - THIS IS THE KEY FIX!
|
||||||
|
padded_track_urls: list[str] = []
|
||||||
|
for idx, data in enumerate(track_datas):
|
||||||
|
if not data:
|
||||||
|
padded_track_urls.append("")
|
||||||
|
continue
|
||||||
|
|
||||||
|
_, padded_url = await self.pad_track_for_transcription(
|
||||||
|
data, idx, storage
|
||||||
|
)
|
||||||
|
padded_track_urls.append(padded_url)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### MODIFIED (lines 385-435):
|
||||||
|
- Transcribe PADDED tracks instead of raw tracks
|
||||||
|
- Removed all timestamp offset adjustment code
|
||||||
|
- Just set speaker ID - timestamps already correct!
|
||||||
|
|
||||||
|
```python
|
||||||
|
# NO OFFSET ADJUSTMENT NEEDED!
|
||||||
|
# Timestamps are already correct because we transcribed padded tracks
|
||||||
|
# Just set speaker ID
|
||||||
|
for w in t.words:
|
||||||
|
w.speaker = idx
|
||||||
|
```
|
||||||
|
|
||||||
|
## Why This Works
|
||||||
|
|
||||||
|
1. **Stream metadata is authoritative**: Daily.co sets `start_time` in the WebM container
|
||||||
|
2. **PyAV respects metadata**: `audio_stream.start_time * audio_stream.time_base` gives seconds
|
||||||
|
3. **Padding before transcription**: Whisper sees continuous audio from time 0
|
||||||
|
4. **Automatic alignment**: Word at 51s in padded track = 51s in recording
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
Process the test recording (daily-20251020193458) and verify:
|
||||||
|
- Participant 0 words appear at ~2s
|
||||||
|
- Participant 1 words appear at ~51s
|
||||||
|
- No word interleaving
|
||||||
|
- Correct chronological order
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
- **Original**: `main_multitrack_pipeline.py`
|
||||||
|
- **Fixed**: `main_multitrack_pipeline_fixed.py`
|
||||||
|
- **Test data**: `/Users/firfi/work/clients/monadical/reflector/1760988935484-*.webm`
|
||||||
510
server/reflector/pipelines/main_multitrack_pipeline.backup.py
Normal file
510
server/reflector/pipelines/main_multitrack_pipeline.backup.py
Normal file
@@ -0,0 +1,510 @@
|
|||||||
|
import asyncio
|
||||||
|
import io
|
||||||
|
from fractions import Fraction
|
||||||
|
|
||||||
|
import av
|
||||||
|
import boto3
|
||||||
|
import structlog
|
||||||
|
from av.audio.resampler import AudioResampler
|
||||||
|
from celery import chain, shared_task
|
||||||
|
|
||||||
|
from reflector.asynctask import asynctask
|
||||||
|
from reflector.db.transcripts import (
|
||||||
|
TranscriptStatus,
|
||||||
|
TranscriptText,
|
||||||
|
transcripts_controller,
|
||||||
|
)
|
||||||
|
from reflector.logger import logger
|
||||||
|
from reflector.pipelines.main_file_pipeline import task_send_webhook_if_needed
|
||||||
|
from reflector.pipelines.main_live_pipeline import (
|
||||||
|
PipelineMainBase,
|
||||||
|
task_cleanup_consent,
|
||||||
|
task_pipeline_post_to_zulip,
|
||||||
|
)
|
||||||
|
from reflector.processors import (
|
||||||
|
AudioFileWriterProcessor,
|
||||||
|
TranscriptFinalSummaryProcessor,
|
||||||
|
TranscriptFinalTitleProcessor,
|
||||||
|
TranscriptTopicDetectorProcessor,
|
||||||
|
)
|
||||||
|
from reflector.processors.file_transcript import FileTranscriptInput
|
||||||
|
from reflector.processors.file_transcript_auto import FileTranscriptAutoProcessor
|
||||||
|
from reflector.processors.types import TitleSummary
|
||||||
|
from reflector.processors.types import (
|
||||||
|
Transcript as TranscriptType,
|
||||||
|
)
|
||||||
|
from reflector.settings import settings
|
||||||
|
from reflector.storage import get_transcripts_storage
|
||||||
|
|
||||||
|
|
||||||
|
class EmptyPipeline:
|
||||||
|
def __init__(self, logger: structlog.BoundLogger):
|
||||||
|
self.logger = logger
|
||||||
|
|
||||||
|
def get_pref(self, k, d=None):
|
||||||
|
return d
|
||||||
|
|
||||||
|
async def emit(self, event):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class PipelineMainMultitrack(PipelineMainBase):
|
||||||
|
"""Process multiple participant tracks for a transcript without mixing audio."""
|
||||||
|
|
||||||
|
def __init__(self, transcript_id: str):
|
||||||
|
super().__init__(transcript_id=transcript_id)
|
||||||
|
self.logger = logger.bind(transcript_id=self.transcript_id)
|
||||||
|
self.empty_pipeline = EmptyPipeline(logger=self.logger)
|
||||||
|
|
||||||
|
async def mixdown_tracks(
|
||||||
|
self,
|
||||||
|
track_datas: list[bytes],
|
||||||
|
writer: AudioFileWriterProcessor,
|
||||||
|
offsets_seconds: list[float] | None = None,
|
||||||
|
) -> None:
|
||||||
|
"""
|
||||||
|
Minimal multi-track mixdown using a PyAV filter graph (amix), no resampling.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Discover target sample rate from first decodable frame
|
||||||
|
target_sample_rate: int | None = None
|
||||||
|
for data in track_datas:
|
||||||
|
if not data:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
container = av.open(io.BytesIO(data))
|
||||||
|
try:
|
||||||
|
for frame in container.decode(audio=0):
|
||||||
|
target_sample_rate = frame.sample_rate
|
||||||
|
break
|
||||||
|
finally:
|
||||||
|
container.close()
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if target_sample_rate:
|
||||||
|
break
|
||||||
|
|
||||||
|
if not target_sample_rate:
|
||||||
|
self.logger.warning("Mixdown skipped - no decodable audio frames found")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Build PyAV filter graph:
|
||||||
|
# N abuffer (s32/stereo)
|
||||||
|
# -> optional adelay per input (for alignment)
|
||||||
|
# -> amix (s32)
|
||||||
|
# -> aformat(s16)
|
||||||
|
# -> sink
|
||||||
|
graph = av.filter.Graph()
|
||||||
|
inputs = []
|
||||||
|
valid_track_datas = [d for d in track_datas if d]
|
||||||
|
# Align offsets list with the filtered inputs (skip empties)
|
||||||
|
input_offsets_seconds = None
|
||||||
|
if offsets_seconds is not None:
|
||||||
|
input_offsets_seconds = [
|
||||||
|
offsets_seconds[i] for i, d in enumerate(track_datas) if d
|
||||||
|
]
|
||||||
|
for idx, data in enumerate(valid_track_datas):
|
||||||
|
args = (
|
||||||
|
f"time_base=1/{target_sample_rate}:"
|
||||||
|
f"sample_rate={target_sample_rate}:"
|
||||||
|
f"sample_fmt=s32:"
|
||||||
|
f"channel_layout=stereo"
|
||||||
|
)
|
||||||
|
in_ctx = graph.add("abuffer", args=args, name=f"in{idx}")
|
||||||
|
inputs.append(in_ctx)
|
||||||
|
|
||||||
|
if not inputs:
|
||||||
|
self.logger.warning("Mixdown skipped - no valid inputs for graph")
|
||||||
|
return
|
||||||
|
|
||||||
|
mixer = graph.add("amix", args=f"inputs={len(inputs)}:normalize=0", name="mix")
|
||||||
|
|
||||||
|
fmt = graph.add(
|
||||||
|
"aformat",
|
||||||
|
args=(
|
||||||
|
f"sample_fmts=s32:channel_layouts=stereo:sample_rates={target_sample_rate}"
|
||||||
|
),
|
||||||
|
name="fmt",
|
||||||
|
)
|
||||||
|
|
||||||
|
sink = graph.add("abuffersink", name="out")
|
||||||
|
|
||||||
|
# Optional per-input delay before mixing
|
||||||
|
delays_ms: list[int] = []
|
||||||
|
if input_offsets_seconds is not None:
|
||||||
|
base = min(input_offsets_seconds) if input_offsets_seconds else 0.0
|
||||||
|
delays_ms = [
|
||||||
|
max(0, int(round((o - base) * 1000))) for o in input_offsets_seconds
|
||||||
|
]
|
||||||
|
else:
|
||||||
|
delays_ms = [0 for _ in inputs]
|
||||||
|
|
||||||
|
for idx, in_ctx in enumerate(inputs):
|
||||||
|
delay_ms = delays_ms[idx] if idx < len(delays_ms) else 0
|
||||||
|
if delay_ms > 0:
|
||||||
|
# adelay requires one value per channel; use same for stereo
|
||||||
|
adelay = graph.add(
|
||||||
|
"adelay",
|
||||||
|
args=f"delays={delay_ms}|{delay_ms}:all=1",
|
||||||
|
name=f"delay{idx}",
|
||||||
|
)
|
||||||
|
in_ctx.link_to(adelay)
|
||||||
|
adelay.link_to(mixer, 0, idx)
|
||||||
|
else:
|
||||||
|
in_ctx.link_to(mixer, 0, idx)
|
||||||
|
mixer.link_to(fmt)
|
||||||
|
fmt.link_to(sink)
|
||||||
|
graph.configure()
|
||||||
|
|
||||||
|
# Open containers for decoding
|
||||||
|
containers = []
|
||||||
|
for i, d in enumerate(valid_track_datas):
|
||||||
|
try:
|
||||||
|
c = av.open(io.BytesIO(d))
|
||||||
|
containers.append(c)
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.warning(
|
||||||
|
"Mixdown: failed to open container", input=i, error=str(e)
|
||||||
|
)
|
||||||
|
containers.append(None)
|
||||||
|
# Filter out Nones for decoders
|
||||||
|
containers = [c for c in containers if c is not None]
|
||||||
|
decoders = [c.decode(audio=0) for c in containers]
|
||||||
|
active = [True] * len(decoders)
|
||||||
|
# Per-input resamplers to enforce s32/stereo at the same rate (no resample of rate)
|
||||||
|
resamplers = [
|
||||||
|
AudioResampler(format="s32", layout="stereo", rate=target_sample_rate)
|
||||||
|
for _ in decoders
|
||||||
|
]
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Round-robin feed frames into graph, pull mixed frames as they become available
|
||||||
|
while any(active):
|
||||||
|
for i, (dec, is_active) in enumerate(zip(decoders, active)):
|
||||||
|
if not is_active:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
frame = next(dec)
|
||||||
|
except StopIteration:
|
||||||
|
active[i] = False
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Enforce same sample rate; convert format/layout to s16/stereo (no resample)
|
||||||
|
if frame.sample_rate != target_sample_rate:
|
||||||
|
# Skip frames with differing rate
|
||||||
|
continue
|
||||||
|
out_frames = resamplers[i].resample(frame) or []
|
||||||
|
for rf in out_frames:
|
||||||
|
rf.sample_rate = target_sample_rate
|
||||||
|
rf.time_base = Fraction(1, target_sample_rate)
|
||||||
|
inputs[i].push(rf)
|
||||||
|
|
||||||
|
# Drain available mixed frames
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
mixed = sink.pull()
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
mixed.sample_rate = target_sample_rate
|
||||||
|
mixed.time_base = Fraction(1, target_sample_rate)
|
||||||
|
await writer.push(mixed)
|
||||||
|
|
||||||
|
# Signal EOF to inputs and drain remaining
|
||||||
|
for in_ctx in inputs:
|
||||||
|
in_ctx.push(None)
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
mixed = sink.pull()
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
mixed.sample_rate = target_sample_rate
|
||||||
|
mixed.time_base = Fraction(1, target_sample_rate)
|
||||||
|
await writer.push(mixed)
|
||||||
|
finally:
|
||||||
|
for c in containers:
|
||||||
|
c.close()
|
||||||
|
|
||||||
|
async def set_status(self, transcript_id: str, status: TranscriptStatus):
|
||||||
|
async with self.lock_transaction():
|
||||||
|
return await transcripts_controller.set_status(transcript_id, status)
|
||||||
|
|
||||||
|
async def process(self, bucket_name: str, track_keys: list[str]):
|
||||||
|
transcript = await self.get_transcript()
|
||||||
|
|
||||||
|
s3 = boto3.client(
|
||||||
|
"s3",
|
||||||
|
region_name=settings.RECORDING_STORAGE_AWS_REGION,
|
||||||
|
aws_access_key_id=settings.RECORDING_STORAGE_AWS_ACCESS_KEY_ID,
|
||||||
|
aws_secret_access_key=settings.RECORDING_STORAGE_AWS_SECRET_ACCESS_KEY,
|
||||||
|
)
|
||||||
|
|
||||||
|
storage = get_transcripts_storage()
|
||||||
|
|
||||||
|
# Pre-download bytes for all tracks for mixing and transcription
|
||||||
|
track_datas: list[bytes] = []
|
||||||
|
for key in track_keys:
|
||||||
|
try:
|
||||||
|
obj = s3.get_object(Bucket=bucket_name, Key=key)
|
||||||
|
track_datas.append(obj["Body"].read())
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.warning(
|
||||||
|
"Skipping track - cannot read S3 object", key=key, error=str(e)
|
||||||
|
)
|
||||||
|
track_datas.append(b"")
|
||||||
|
|
||||||
|
# Extract offsets from Daily.co filename timestamps
|
||||||
|
# Format: {rec_start_ts}-{uuid}-{media_type}-{track_start_ts}.{ext}
|
||||||
|
# Example: 1760988935484-uuid-cam-audio-1760988935922
|
||||||
|
import re
|
||||||
|
|
||||||
|
offsets_seconds: list[float] = []
|
||||||
|
recording_start_ts: int | None = None
|
||||||
|
|
||||||
|
for key in track_keys:
|
||||||
|
# Parse Daily.co raw-tracks filename pattern
|
||||||
|
match = re.search(r"(\d+)-([0-9a-f-]{36})-(cam-audio)-(\d+)", key)
|
||||||
|
if not match:
|
||||||
|
self.logger.warning(
|
||||||
|
"Track key doesn't match Daily.co pattern, using 0.0 offset",
|
||||||
|
key=key,
|
||||||
|
)
|
||||||
|
offsets_seconds.append(0.0)
|
||||||
|
continue
|
||||||
|
|
||||||
|
rec_start_ts = int(match.group(1))
|
||||||
|
track_start_ts = int(match.group(4))
|
||||||
|
|
||||||
|
# Validate all tracks belong to same recording
|
||||||
|
if recording_start_ts is None:
|
||||||
|
recording_start_ts = rec_start_ts
|
||||||
|
elif rec_start_ts != recording_start_ts:
|
||||||
|
self.logger.error(
|
||||||
|
"Track belongs to different recording",
|
||||||
|
key=key,
|
||||||
|
expected_start=recording_start_ts,
|
||||||
|
got_start=rec_start_ts,
|
||||||
|
)
|
||||||
|
offsets_seconds.append(0.0)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Calculate offset in seconds
|
||||||
|
offset_ms = track_start_ts - rec_start_ts
|
||||||
|
offset_s = offset_ms / 1000.0
|
||||||
|
|
||||||
|
self.logger.info(
|
||||||
|
"Parsed track offset from filename",
|
||||||
|
key=key,
|
||||||
|
recording_start=rec_start_ts,
|
||||||
|
track_start=track_start_ts,
|
||||||
|
offset_seconds=offset_s,
|
||||||
|
)
|
||||||
|
|
||||||
|
offsets_seconds.append(max(0.0, offset_s))
|
||||||
|
|
||||||
|
# Mixdown all available tracks into transcript.audio_mp3_filename, preserving sample rate
|
||||||
|
try:
|
||||||
|
mp3_writer = AudioFileWriterProcessor(
|
||||||
|
path=str(transcript.audio_mp3_filename)
|
||||||
|
)
|
||||||
|
await self.mixdown_tracks(track_datas, mp3_writer, offsets_seconds)
|
||||||
|
await mp3_writer.flush()
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error("Mixdown failed", error=str(e))
|
||||||
|
|
||||||
|
speaker_transcripts: list[TranscriptType] = []
|
||||||
|
for idx, key in enumerate(track_keys):
|
||||||
|
ext = ".mp4"
|
||||||
|
|
||||||
|
try:
|
||||||
|
obj = s3.get_object(Bucket=bucket_name, Key=key)
|
||||||
|
data = obj["Body"].read()
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(
|
||||||
|
"Skipping track - cannot read S3 object", key=key, error=str(e)
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
storage_path = f"file_pipeline/{transcript.id}/tracks/track_{idx}{ext}"
|
||||||
|
try:
|
||||||
|
await storage.put_file(storage_path, data)
|
||||||
|
audio_url = await storage.get_file_url(storage_path)
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(
|
||||||
|
"Skipping track - cannot upload to storage", key=key, error=str(e)
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
t = await self.transcribe_file(audio_url, transcript.source_language)
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(
|
||||||
|
"Transcription via default backend failed, trying local whisper",
|
||||||
|
key=key,
|
||||||
|
url=audio_url,
|
||||||
|
error=str(e),
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
fallback = FileTranscriptAutoProcessor(name="whisper")
|
||||||
|
result = None
|
||||||
|
|
||||||
|
async def capture_result(r):
|
||||||
|
nonlocal result
|
||||||
|
result = r
|
||||||
|
|
||||||
|
fallback.on(capture_result)
|
||||||
|
await fallback.push(
|
||||||
|
FileTranscriptInput(
|
||||||
|
audio_url=audio_url, language=transcript.source_language
|
||||||
|
)
|
||||||
|
)
|
||||||
|
await fallback.flush()
|
||||||
|
if not result:
|
||||||
|
raise Exception("No transcript captured in fallback")
|
||||||
|
t = result
|
||||||
|
except Exception as e2:
|
||||||
|
self.logger.error(
|
||||||
|
"Skipping track - transcription failed after fallback",
|
||||||
|
key=key,
|
||||||
|
url=audio_url,
|
||||||
|
error=str(e2),
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not t.words:
|
||||||
|
continue
|
||||||
|
# Shift word timestamps by the track's offset so all are relative to 00:00
|
||||||
|
track_offset = offsets_seconds[idx] if idx < len(offsets_seconds) else 0.0
|
||||||
|
for w in t.words:
|
||||||
|
try:
|
||||||
|
if hasattr(w, "start") and w.start is not None:
|
||||||
|
w.start = float(w.start) + track_offset
|
||||||
|
if hasattr(w, "end") and w.end is not None:
|
||||||
|
w.end = float(w.end) + track_offset
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
w.speaker = idx
|
||||||
|
speaker_transcripts.append(t)
|
||||||
|
|
||||||
|
if not speaker_transcripts:
|
||||||
|
raise Exception("No valid track transcriptions")
|
||||||
|
|
||||||
|
merged_words = []
|
||||||
|
for t in speaker_transcripts:
|
||||||
|
merged_words.extend(t.words)
|
||||||
|
merged_words.sort(key=lambda w: w.start)
|
||||||
|
|
||||||
|
merged_transcript = TranscriptType(words=merged_words, translation=None)
|
||||||
|
|
||||||
|
await transcripts_controller.append_event(
|
||||||
|
transcript,
|
||||||
|
event="TRANSCRIPT",
|
||||||
|
data=TranscriptText(
|
||||||
|
text=merged_transcript.text, translation=merged_transcript.translation
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
topics = await self.detect_topics(merged_transcript, transcript.target_language)
|
||||||
|
await asyncio.gather(
|
||||||
|
self.generate_title(topics),
|
||||||
|
self.generate_summaries(topics),
|
||||||
|
return_exceptions=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
await self.set_status(transcript.id, "ended")
|
||||||
|
|
||||||
|
async def transcribe_file(self, audio_url: str, language: str) -> TranscriptType:
|
||||||
|
processor = FileTranscriptAutoProcessor()
|
||||||
|
input_data = FileTranscriptInput(audio_url=audio_url, language=language)
|
||||||
|
|
||||||
|
result: TranscriptType | None = None
|
||||||
|
|
||||||
|
async def capture_result(transcript):
|
||||||
|
nonlocal result
|
||||||
|
result = transcript
|
||||||
|
|
||||||
|
processor.on(capture_result)
|
||||||
|
await processor.push(input_data)
|
||||||
|
await processor.flush()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise ValueError("No transcript captured")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
async def detect_topics(
|
||||||
|
self, transcript: TranscriptType, target_language: str
|
||||||
|
) -> list[TitleSummary]:
|
||||||
|
chunk_size = 300
|
||||||
|
topics: list[TitleSummary] = []
|
||||||
|
|
||||||
|
async def on_topic(topic: TitleSummary):
|
||||||
|
topics.append(topic)
|
||||||
|
return await self.on_topic(topic)
|
||||||
|
|
||||||
|
topic_detector = TranscriptTopicDetectorProcessor(callback=on_topic)
|
||||||
|
topic_detector.set_pipeline(self.empty_pipeline)
|
||||||
|
|
||||||
|
for i in range(0, len(transcript.words), chunk_size):
|
||||||
|
chunk_words = transcript.words[i : i + chunk_size]
|
||||||
|
if not chunk_words:
|
||||||
|
continue
|
||||||
|
|
||||||
|
chunk_transcript = TranscriptType(
|
||||||
|
words=chunk_words, translation=transcript.translation
|
||||||
|
)
|
||||||
|
await topic_detector.push(chunk_transcript)
|
||||||
|
|
||||||
|
await topic_detector.flush()
|
||||||
|
return topics
|
||||||
|
|
||||||
|
async def generate_title(self, topics: list[TitleSummary]):
|
||||||
|
if not topics:
|
||||||
|
self.logger.warning("No topics for title generation")
|
||||||
|
return
|
||||||
|
|
||||||
|
processor = TranscriptFinalTitleProcessor(callback=self.on_title)
|
||||||
|
processor.set_pipeline(self.empty_pipeline)
|
||||||
|
|
||||||
|
for topic in topics:
|
||||||
|
await processor.push(topic)
|
||||||
|
|
||||||
|
await processor.flush()
|
||||||
|
|
||||||
|
async def generate_summaries(self, topics: list[TitleSummary]):
|
||||||
|
if not topics:
|
||||||
|
self.logger.warning("No topics for summary generation")
|
||||||
|
return
|
||||||
|
|
||||||
|
transcript = await self.get_transcript()
|
||||||
|
processor = TranscriptFinalSummaryProcessor(
|
||||||
|
transcript=transcript,
|
||||||
|
callback=self.on_long_summary,
|
||||||
|
on_short_summary=self.on_short_summary,
|
||||||
|
)
|
||||||
|
processor.set_pipeline(self.empty_pipeline)
|
||||||
|
|
||||||
|
for topic in topics:
|
||||||
|
await processor.push(topic)
|
||||||
|
|
||||||
|
await processor.flush()
|
||||||
|
|
||||||
|
|
||||||
|
@shared_task
|
||||||
|
@asynctask
|
||||||
|
async def task_pipeline_multitrack_process(
|
||||||
|
*, transcript_id: str, bucket_name: str, track_keys: list[str]
|
||||||
|
):
|
||||||
|
pipeline = PipelineMainMultitrack(transcript_id=transcript_id)
|
||||||
|
try:
|
||||||
|
await pipeline.set_status(transcript_id, "processing")
|
||||||
|
await pipeline.process(bucket_name, track_keys)
|
||||||
|
except Exception:
|
||||||
|
await pipeline.set_status(transcript_id, "error")
|
||||||
|
raise
|
||||||
|
|
||||||
|
post_chain = chain(
|
||||||
|
task_cleanup_consent.si(transcript_id=transcript_id),
|
||||||
|
task_pipeline_post_to_zulip.si(transcript_id=transcript_id),
|
||||||
|
task_send_webhook_if_needed.si(transcript_id=transcript_id),
|
||||||
|
)
|
||||||
|
post_chain.delay()
|
||||||
@@ -12,6 +12,7 @@ from reflector.asynctask import asynctask
|
|||||||
from reflector.db.transcripts import (
|
from reflector.db.transcripts import (
|
||||||
TranscriptStatus,
|
TranscriptStatus,
|
||||||
TranscriptText,
|
TranscriptText,
|
||||||
|
TranscriptWaveform,
|
||||||
transcripts_controller,
|
transcripts_controller,
|
||||||
)
|
)
|
||||||
from reflector.logger import logger
|
from reflector.logger import logger
|
||||||
@@ -27,6 +28,7 @@ from reflector.processors import (
|
|||||||
TranscriptFinalTitleProcessor,
|
TranscriptFinalTitleProcessor,
|
||||||
TranscriptTopicDetectorProcessor,
|
TranscriptTopicDetectorProcessor,
|
||||||
)
|
)
|
||||||
|
from reflector.processors.audio_waveform_processor import AudioWaveformProcessor
|
||||||
from reflector.processors.file_transcript import FileTranscriptInput
|
from reflector.processors.file_transcript import FileTranscriptInput
|
||||||
from reflector.processors.file_transcript_auto import FileTranscriptAutoProcessor
|
from reflector.processors.file_transcript_auto import FileTranscriptAutoProcessor
|
||||||
from reflector.processors.types import TitleSummary
|
from reflector.processors.types import TitleSummary
|
||||||
@@ -56,6 +58,145 @@ class PipelineMainMultitrack(PipelineMainBase):
|
|||||||
self.logger = logger.bind(transcript_id=self.transcript_id)
|
self.logger = logger.bind(transcript_id=self.transcript_id)
|
||||||
self.empty_pipeline = EmptyPipeline(logger=self.logger)
|
self.empty_pipeline = EmptyPipeline(logger=self.logger)
|
||||||
|
|
||||||
|
async def pad_track_for_transcription(
|
||||||
|
self,
|
||||||
|
track_data: bytes,
|
||||||
|
track_idx: int,
|
||||||
|
storage,
|
||||||
|
) -> tuple[bytes, str]:
|
||||||
|
"""
|
||||||
|
Pad a single track with silence based on stream metadata start_time.
|
||||||
|
This ensures Whisper timestamps will be relative to recording start.
|
||||||
|
Uses ffmpeg subprocess approach proven to work with python-raw-tracks-align.
|
||||||
|
|
||||||
|
Returns: (padded_data, storage_url)
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
import math
|
||||||
|
import subprocess
|
||||||
|
import tempfile
|
||||||
|
|
||||||
|
if not track_data:
|
||||||
|
return b"", ""
|
||||||
|
|
||||||
|
transcript = await self.get_transcript()
|
||||||
|
|
||||||
|
# Create temp files for ffmpeg processing
|
||||||
|
with tempfile.NamedTemporaryFile(suffix=".webm", delete=False) as input_file:
|
||||||
|
input_file.write(track_data)
|
||||||
|
input_file_path = input_file.name
|
||||||
|
|
||||||
|
output_file_path = input_file_path.replace(".webm", "_padded.webm")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get stream metadata using ffprobe
|
||||||
|
ffprobe_cmd = [
|
||||||
|
"ffprobe",
|
||||||
|
"-v",
|
||||||
|
"error",
|
||||||
|
"-show_entries",
|
||||||
|
"stream=start_time",
|
||||||
|
"-of",
|
||||||
|
"json",
|
||||||
|
input_file_path,
|
||||||
|
]
|
||||||
|
|
||||||
|
result = subprocess.run(
|
||||||
|
ffprobe_cmd, capture_output=True, text=True, check=True
|
||||||
|
)
|
||||||
|
metadata = json.loads(result.stdout)
|
||||||
|
|
||||||
|
# Extract start_time from stream metadata
|
||||||
|
start_time_seconds = 0.0
|
||||||
|
if metadata.get("streams") and len(metadata["streams"]) > 0:
|
||||||
|
start_time_str = metadata["streams"][0].get("start_time", "0")
|
||||||
|
start_time_seconds = float(start_time_str)
|
||||||
|
|
||||||
|
self.logger.info(
|
||||||
|
f"Track {track_idx} stream metadata: start_time={start_time_seconds:.3f}s",
|
||||||
|
track_idx=track_idx,
|
||||||
|
)
|
||||||
|
|
||||||
|
# If no padding needed, use original
|
||||||
|
if start_time_seconds <= 0:
|
||||||
|
storage_path = f"file_pipeline/{transcript.id}/tracks/original_track_{track_idx}.webm"
|
||||||
|
await storage.put_file(storage_path, track_data)
|
||||||
|
url = await storage.get_file_url(storage_path)
|
||||||
|
return track_data, url
|
||||||
|
|
||||||
|
# Calculate delay in milliseconds
|
||||||
|
delay_ms = math.floor(start_time_seconds * 1000)
|
||||||
|
|
||||||
|
# Run ffmpeg to pad the audio while maintaining WebM/Opus format for Modal compatibility
|
||||||
|
# ffmpeg quirk: aresample needs to come before adelay in the filter chain
|
||||||
|
ffmpeg_cmd = [
|
||||||
|
"ffmpeg",
|
||||||
|
"-hide_banner",
|
||||||
|
"-loglevel",
|
||||||
|
"error",
|
||||||
|
"-y", # overwrite output
|
||||||
|
"-i",
|
||||||
|
input_file_path,
|
||||||
|
"-af",
|
||||||
|
f"aresample=async=1,adelay={delay_ms}:all=true",
|
||||||
|
"-c:a",
|
||||||
|
"libopus", # Keep Opus codec for Modal compatibility
|
||||||
|
"-b:a",
|
||||||
|
"128k", # Standard bitrate for Opus
|
||||||
|
output_file_path,
|
||||||
|
]
|
||||||
|
|
||||||
|
self.logger.info(
|
||||||
|
f"Padding track {track_idx} with {delay_ms}ms delay using ffmpeg",
|
||||||
|
track_idx=track_idx,
|
||||||
|
delay_ms=delay_ms,
|
||||||
|
command=" ".join(ffmpeg_cmd),
|
||||||
|
)
|
||||||
|
|
||||||
|
result = subprocess.run(ffmpeg_cmd, capture_output=True, text=True)
|
||||||
|
if result.returncode != 0:
|
||||||
|
self.logger.error(
|
||||||
|
f"ffmpeg padding failed for track {track_idx}",
|
||||||
|
track_idx=track_idx,
|
||||||
|
stderr=result.stderr,
|
||||||
|
returncode=result.returncode,
|
||||||
|
)
|
||||||
|
raise Exception(f"ffmpeg padding failed: {result.stderr}")
|
||||||
|
|
||||||
|
# Read the padded output
|
||||||
|
with open(output_file_path, "rb") as f:
|
||||||
|
padded_data = f.read()
|
||||||
|
|
||||||
|
# Store padded track
|
||||||
|
storage_path = (
|
||||||
|
f"file_pipeline/{transcript.id}/tracks/padded_track_{track_idx}.webm"
|
||||||
|
)
|
||||||
|
await storage.put_file(storage_path, padded_data)
|
||||||
|
padded_url = await storage.get_file_url(storage_path)
|
||||||
|
|
||||||
|
self.logger.info(
|
||||||
|
f"Successfully padded track {track_idx} with {start_time_seconds:.3f}s offset, stored at {storage_path}",
|
||||||
|
track_idx=track_idx,
|
||||||
|
delay_ms=delay_ms,
|
||||||
|
padded_url=padded_url,
|
||||||
|
padded_size=len(padded_data),
|
||||||
|
)
|
||||||
|
|
||||||
|
return padded_data, padded_url
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Clean up temp files
|
||||||
|
import os
|
||||||
|
|
||||||
|
try:
|
||||||
|
os.unlink(input_file_path)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
try:
|
||||||
|
os.unlink(output_file_path)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
async def mixdown_tracks(
|
async def mixdown_tracks(
|
||||||
self,
|
self,
|
||||||
track_datas: list[bytes],
|
track_datas: list[bytes],
|
||||||
@@ -228,6 +369,14 @@ class PipelineMainMultitrack(PipelineMainBase):
|
|||||||
async with self.lock_transaction():
|
async with self.lock_transaction():
|
||||||
return await transcripts_controller.set_status(transcript_id, status)
|
return await transcripts_controller.set_status(transcript_id, status)
|
||||||
|
|
||||||
|
async def on_waveform(self, data):
|
||||||
|
async with self.transaction():
|
||||||
|
waveform = TranscriptWaveform(waveform=data)
|
||||||
|
transcript = await self.get_transcript()
|
||||||
|
return await transcripts_controller.append_event(
|
||||||
|
transcript=transcript, event="WAVEFORM", data=waveform
|
||||||
|
)
|
||||||
|
|
||||||
async def process(self, bucket_name: str, track_keys: list[str]):
|
async def process(self, bucket_name: str, track_keys: list[str]):
|
||||||
transcript = await self.get_transcript()
|
transcript = await self.get_transcript()
|
||||||
|
|
||||||
@@ -252,64 +401,90 @@ class PipelineMainMultitrack(PipelineMainBase):
|
|||||||
)
|
)
|
||||||
track_datas.append(b"")
|
track_datas.append(b"")
|
||||||
|
|
||||||
# Estimate offsets from first frame PTS, aligned to track_keys
|
# PAD TRACKS FIRST - this creates full-length tracks with correct timeline
|
||||||
offsets_seconds: list[float] = []
|
padded_track_datas: list[bytes] = []
|
||||||
for data, key in zip(track_datas, track_keys):
|
padded_track_urls: list[str] = []
|
||||||
off_s = 0.0
|
for idx, data in enumerate(track_datas):
|
||||||
if data:
|
if not data:
|
||||||
try:
|
padded_track_datas.append(b"")
|
||||||
c = av.open(io.BytesIO(data))
|
padded_track_urls.append("")
|
||||||
try:
|
continue
|
||||||
for frame in c.decode(audio=0):
|
|
||||||
if frame.pts is not None and frame.time_base:
|
|
||||||
off_s = float(frame.pts * frame.time_base)
|
|
||||||
break
|
|
||||||
finally:
|
|
||||||
c.close()
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
offsets_seconds.append(max(0.0, float(off_s)))
|
|
||||||
|
|
||||||
# Mixdown all available tracks into transcript.audio_mp3_filename, preserving sample rate
|
padded_data, padded_url = await self.pad_track_for_transcription(
|
||||||
|
data, idx, storage
|
||||||
|
)
|
||||||
|
padded_track_datas.append(padded_data)
|
||||||
|
padded_track_urls.append(padded_url)
|
||||||
|
self.logger.info(f"Padded track {idx} for transcription: {padded_url}")
|
||||||
|
|
||||||
|
# Mixdown PADDED tracks (already aligned with timeline) into transcript.audio_mp3_filename
|
||||||
try:
|
try:
|
||||||
|
# Ensure data directory exists
|
||||||
|
transcript.data_path.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
mp3_writer = AudioFileWriterProcessor(
|
mp3_writer = AudioFileWriterProcessor(
|
||||||
path=str(transcript.audio_mp3_filename)
|
path=str(transcript.audio_mp3_filename)
|
||||||
)
|
)
|
||||||
await self.mixdown_tracks(track_datas, mp3_writer, offsets_seconds)
|
# Use PADDED tracks with NO additional offsets (already aligned by padding)
|
||||||
|
await self.mixdown_tracks(
|
||||||
|
padded_track_datas, mp3_writer, offsets_seconds=None
|
||||||
|
)
|
||||||
await mp3_writer.flush()
|
await mp3_writer.flush()
|
||||||
except Exception as e:
|
|
||||||
self.logger.error("Mixdown failed", error=str(e))
|
|
||||||
|
|
||||||
|
# Upload the mixed audio to S3 for web playback
|
||||||
|
if transcript.audio_mp3_filename.exists():
|
||||||
|
mp3_data = transcript.audio_mp3_filename.read_bytes()
|
||||||
|
storage_path = f"{transcript.id}/audio.mp3"
|
||||||
|
await storage.put_file(storage_path, mp3_data)
|
||||||
|
mp3_url = await storage.get_file_url(storage_path)
|
||||||
|
|
||||||
|
# Update transcript to indicate audio is in storage
|
||||||
|
await transcripts_controller.update(
|
||||||
|
transcript, {"audio_location": "storage"}
|
||||||
|
)
|
||||||
|
|
||||||
|
self.logger.info(
|
||||||
|
f"Uploaded mixed audio to storage",
|
||||||
|
storage_path=storage_path,
|
||||||
|
size=len(mp3_data),
|
||||||
|
url=mp3_url,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.logger.warning("Mixdown file does not exist after processing")
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error("Mixdown failed", error=str(e), exc_info=True)
|
||||||
|
|
||||||
|
# Generate waveform from the mixed audio file
|
||||||
|
if transcript.audio_mp3_filename.exists():
|
||||||
|
try:
|
||||||
|
self.logger.info("Generating waveform from mixed audio")
|
||||||
|
waveform_processor = AudioWaveformProcessor(
|
||||||
|
audio_path=transcript.audio_mp3_filename,
|
||||||
|
waveform_path=transcript.audio_waveform_filename,
|
||||||
|
on_waveform=self.on_waveform,
|
||||||
|
)
|
||||||
|
waveform_processor.set_pipeline(self.empty_pipeline)
|
||||||
|
await waveform_processor.flush()
|
||||||
|
self.logger.info("Waveform generated successfully")
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(
|
||||||
|
"Waveform generation failed", error=str(e), exc_info=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# Transcribe PADDED tracks - timestamps will be automatically correct!
|
||||||
speaker_transcripts: list[TranscriptType] = []
|
speaker_transcripts: list[TranscriptType] = []
|
||||||
for idx, key in enumerate(track_keys):
|
for idx, padded_url in enumerate(padded_track_urls):
|
||||||
ext = ".mp4"
|
if not padded_url:
|
||||||
|
|
||||||
try:
|
|
||||||
obj = s3.get_object(Bucket=bucket_name, Key=key)
|
|
||||||
data = obj["Body"].read()
|
|
||||||
except Exception as e:
|
|
||||||
self.logger.error(
|
|
||||||
"Skipping track - cannot read S3 object", key=key, error=str(e)
|
|
||||||
)
|
|
||||||
continue
|
|
||||||
|
|
||||||
storage_path = f"file_pipeline/{transcript.id}/tracks/track_{idx}{ext}"
|
|
||||||
try:
|
|
||||||
await storage.put_file(storage_path, data)
|
|
||||||
audio_url = await storage.get_file_url(storage_path)
|
|
||||||
except Exception as e:
|
|
||||||
self.logger.error(
|
|
||||||
"Skipping track - cannot upload to storage", key=key, error=str(e)
|
|
||||||
)
|
|
||||||
continue
|
continue
|
||||||
|
|
||||||
try:
|
try:
|
||||||
t = await self.transcribe_file(audio_url, transcript.source_language)
|
# Transcribe the PADDED track
|
||||||
|
t = await self.transcribe_file(padded_url, transcript.source_language)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
self.logger.error(
|
self.logger.error(
|
||||||
"Transcription via default backend failed, trying local whisper",
|
"Transcription via default backend failed, trying local whisper",
|
||||||
key=key,
|
track_idx=idx,
|
||||||
url=audio_url,
|
url=padded_url,
|
||||||
error=str(e),
|
error=str(e),
|
||||||
)
|
)
|
||||||
try:
|
try:
|
||||||
@@ -323,7 +498,7 @@ class PipelineMainMultitrack(PipelineMainBase):
|
|||||||
fallback.on(capture_result)
|
fallback.on(capture_result)
|
||||||
await fallback.push(
|
await fallback.push(
|
||||||
FileTranscriptInput(
|
FileTranscriptInput(
|
||||||
audio_url=audio_url, language=transcript.source_language
|
audio_url=padded_url, language=transcript.source_language
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
await fallback.flush()
|
await fallback.flush()
|
||||||
@@ -333,34 +508,37 @@ class PipelineMainMultitrack(PipelineMainBase):
|
|||||||
except Exception as e2:
|
except Exception as e2:
|
||||||
self.logger.error(
|
self.logger.error(
|
||||||
"Skipping track - transcription failed after fallback",
|
"Skipping track - transcription failed after fallback",
|
||||||
key=key,
|
track_idx=idx,
|
||||||
url=audio_url,
|
url=padded_url,
|
||||||
error=str(e2),
|
error=str(e2),
|
||||||
)
|
)
|
||||||
continue
|
continue
|
||||||
|
|
||||||
if not t.words:
|
if not t.words:
|
||||||
continue
|
continue
|
||||||
# Shift word timestamps by the track's offset so all are relative to 00:00
|
|
||||||
track_offset = offsets_seconds[idx] if idx < len(offsets_seconds) else 0.0
|
# NO OFFSET ADJUSTMENT NEEDED!
|
||||||
|
# Timestamps are already correct because we transcribed padded tracks
|
||||||
|
# Just set speaker ID
|
||||||
for w in t.words:
|
for w in t.words:
|
||||||
try:
|
|
||||||
if hasattr(w, "start") and w.start is not None:
|
|
||||||
w.start = float(w.start) + track_offset
|
|
||||||
if hasattr(w, "end") and w.end is not None:
|
|
||||||
w.end = float(w.end) + track_offset
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
w.speaker = idx
|
w.speaker = idx
|
||||||
|
|
||||||
speaker_transcripts.append(t)
|
speaker_transcripts.append(t)
|
||||||
|
self.logger.info(
|
||||||
|
f"Track {idx} transcribed successfully with {len(t.words)} words",
|
||||||
|
track_idx=idx,
|
||||||
|
)
|
||||||
|
|
||||||
if not speaker_transcripts:
|
if not speaker_transcripts:
|
||||||
raise Exception("No valid track transcriptions")
|
raise Exception("No valid track transcriptions")
|
||||||
|
|
||||||
|
# Merge all words and sort by timestamp
|
||||||
merged_words = []
|
merged_words = []
|
||||||
for t in speaker_transcripts:
|
for t in speaker_transcripts:
|
||||||
merged_words.extend(t.words)
|
merged_words.extend(t.words)
|
||||||
merged_words.sort(key=lambda w: w.start)
|
merged_words.sort(
|
||||||
|
key=lambda w: w.start if hasattr(w, "start") and w.start is not None else 0
|
||||||
|
)
|
||||||
|
|
||||||
merged_transcript = TranscriptType(words=merged_words, translation=None)
|
merged_transcript = TranscriptType(words=merged_words, translation=None)
|
||||||
|
|
||||||
|
|||||||
629
server/reflector/pipelines/main_multitrack_pipeline_fixed.py
Normal file
629
server/reflector/pipelines/main_multitrack_pipeline_fixed.py
Normal file
@@ -0,0 +1,629 @@
|
|||||||
|
import asyncio
|
||||||
|
import io
|
||||||
|
from fractions import Fraction
|
||||||
|
|
||||||
|
import av
|
||||||
|
import boto3
|
||||||
|
import structlog
|
||||||
|
from av.audio.resampler import AudioResampler
|
||||||
|
from celery import chain, shared_task
|
||||||
|
|
||||||
|
from reflector.asynctask import asynctask
|
||||||
|
from reflector.db.transcripts import (
|
||||||
|
TranscriptStatus,
|
||||||
|
TranscriptText,
|
||||||
|
transcripts_controller,
|
||||||
|
)
|
||||||
|
from reflector.logger import logger
|
||||||
|
from reflector.pipelines.main_file_pipeline import task_send_webhook_if_needed
|
||||||
|
from reflector.pipelines.main_live_pipeline import (
|
||||||
|
PipelineMainBase,
|
||||||
|
task_cleanup_consent,
|
||||||
|
task_pipeline_post_to_zulip,
|
||||||
|
)
|
||||||
|
from reflector.processors import (
|
||||||
|
AudioFileWriterProcessor,
|
||||||
|
TranscriptFinalSummaryProcessor,
|
||||||
|
TranscriptFinalTitleProcessor,
|
||||||
|
TranscriptTopicDetectorProcessor,
|
||||||
|
)
|
||||||
|
from reflector.processors.file_transcript import FileTranscriptInput
|
||||||
|
from reflector.processors.file_transcript_auto import FileTranscriptAutoProcessor
|
||||||
|
from reflector.processors.types import TitleSummary
|
||||||
|
from reflector.processors.types import (
|
||||||
|
Transcript as TranscriptType,
|
||||||
|
)
|
||||||
|
from reflector.settings import settings
|
||||||
|
from reflector.storage import get_transcripts_storage
|
||||||
|
|
||||||
|
|
||||||
|
class EmptyPipeline:
|
||||||
|
def __init__(self, logger: structlog.BoundLogger):
|
||||||
|
self.logger = logger
|
||||||
|
|
||||||
|
def get_pref(self, k, d=None):
|
||||||
|
return d
|
||||||
|
|
||||||
|
async def emit(self, event):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class PipelineMainMultitrack(PipelineMainBase):
|
||||||
|
"""Process multiple participant tracks for a transcript without mixing audio."""
|
||||||
|
|
||||||
|
def __init__(self, transcript_id: str):
|
||||||
|
super().__init__(transcript_id=transcript_id)
|
||||||
|
self.logger = logger.bind(transcript_id=self.transcript_id)
|
||||||
|
self.empty_pipeline = EmptyPipeline(logger=self.logger)
|
||||||
|
|
||||||
|
async def pad_track_for_transcription(
|
||||||
|
self,
|
||||||
|
track_data: bytes,
|
||||||
|
track_idx: int,
|
||||||
|
storage,
|
||||||
|
) -> tuple[bytes, str]:
|
||||||
|
"""
|
||||||
|
Pad a single track with silence based on stream metadata start_time.
|
||||||
|
This ensures Whisper timestamps will be relative to recording start.
|
||||||
|
|
||||||
|
Returns: (padded_data, storage_url)
|
||||||
|
"""
|
||||||
|
if not track_data:
|
||||||
|
return b"", ""
|
||||||
|
|
||||||
|
transcript = await self.get_transcript()
|
||||||
|
|
||||||
|
# Get stream metadata start_time using PyAV
|
||||||
|
container = av.open(io.BytesIO(track_data))
|
||||||
|
try:
|
||||||
|
audio_stream = container.streams.audio[0]
|
||||||
|
|
||||||
|
# Extract start_time from stream metadata
|
||||||
|
if (
|
||||||
|
audio_stream.start_time is not None
|
||||||
|
and audio_stream.time_base is not None
|
||||||
|
):
|
||||||
|
start_time_seconds = float(
|
||||||
|
audio_stream.start_time * audio_stream.time_base
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
start_time_seconds = 0.0
|
||||||
|
|
||||||
|
sample_rate = audio_stream.sample_rate
|
||||||
|
codec_name = audio_stream.codec.name
|
||||||
|
finally:
|
||||||
|
container.close()
|
||||||
|
|
||||||
|
self.logger.info(
|
||||||
|
f"Track {track_idx} stream metadata: start_time={start_time_seconds:.3f}s, sample_rate={sample_rate}",
|
||||||
|
track_idx=track_idx,
|
||||||
|
)
|
||||||
|
|
||||||
|
# If no padding needed, use original
|
||||||
|
if start_time_seconds <= 0:
|
||||||
|
storage_path = (
|
||||||
|
f"file_pipeline/{transcript.id}/tracks/original_track_{track_idx}.webm"
|
||||||
|
)
|
||||||
|
await storage.put_file(storage_path, track_data)
|
||||||
|
url = await storage.get_file_url(storage_path)
|
||||||
|
return track_data, url
|
||||||
|
|
||||||
|
# Create PyAV filter graph for padding
|
||||||
|
graph = av.filter.Graph()
|
||||||
|
|
||||||
|
# Input buffer
|
||||||
|
in_args = (
|
||||||
|
f"time_base=1/{sample_rate}:"
|
||||||
|
f"sample_rate={sample_rate}:"
|
||||||
|
f"sample_fmt=s16:"
|
||||||
|
f"channel_layout=stereo"
|
||||||
|
)
|
||||||
|
input_buffer = graph.add("abuffer", args=in_args, name="in")
|
||||||
|
|
||||||
|
# Add delay filter for padding
|
||||||
|
delay_ms = int(start_time_seconds * 1000)
|
||||||
|
delay_filter = graph.add(
|
||||||
|
"adelay", args=f"delays={delay_ms}|{delay_ms}:all=1", name="delay"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Output sink
|
||||||
|
sink = graph.add("abuffersink", name="out")
|
||||||
|
|
||||||
|
# Link filters
|
||||||
|
input_buffer.link_to(delay_filter)
|
||||||
|
delay_filter.link_to(sink)
|
||||||
|
|
||||||
|
graph.configure()
|
||||||
|
|
||||||
|
# Process audio through filter
|
||||||
|
output_bytes = io.BytesIO()
|
||||||
|
output_container = av.open(output_bytes, "w", format="webm")
|
||||||
|
output_stream = output_container.add_stream("libopus", rate=sample_rate)
|
||||||
|
output_stream.channels = 2
|
||||||
|
|
||||||
|
# Reopen input for processing
|
||||||
|
input_container = av.open(io.BytesIO(track_data))
|
||||||
|
resampler = AudioResampler(format="s16", layout="stereo", rate=sample_rate)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Process frames
|
||||||
|
for frame in input_container.decode(audio=0):
|
||||||
|
# Resample to match filter requirements
|
||||||
|
resampled_frames = resampler.resample(frame)
|
||||||
|
for resampled_frame in resampled_frames:
|
||||||
|
resampled_frame.pts = frame.pts
|
||||||
|
resampled_frame.time_base = Fraction(1, sample_rate)
|
||||||
|
input_buffer.push(resampled_frame)
|
||||||
|
|
||||||
|
# Pull from filter and encode
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
out_frame = sink.pull()
|
||||||
|
out_frame.pts = out_frame.pts if out_frame.pts else 0
|
||||||
|
out_frame.time_base = Fraction(1, sample_rate)
|
||||||
|
for packet in output_stream.encode(out_frame):
|
||||||
|
output_container.mux(packet)
|
||||||
|
except av.BlockingIOError:
|
||||||
|
break
|
||||||
|
|
||||||
|
# Flush
|
||||||
|
input_buffer.push(None)
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
out_frame = sink.pull()
|
||||||
|
for packet in output_stream.encode(out_frame):
|
||||||
|
output_container.mux(packet)
|
||||||
|
except (av.BlockingIOError, av.EOFError):
|
||||||
|
break
|
||||||
|
|
||||||
|
# Flush encoder
|
||||||
|
for packet in output_stream.encode(None):
|
||||||
|
output_container.mux(packet)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
input_container.close()
|
||||||
|
output_container.close()
|
||||||
|
|
||||||
|
padded_data = output_bytes.getvalue()
|
||||||
|
|
||||||
|
# Store padded track
|
||||||
|
storage_path = (
|
||||||
|
f"file_pipeline/{transcript.id}/tracks/padded_track_{track_idx}.webm"
|
||||||
|
)
|
||||||
|
await storage.put_file(storage_path, padded_data)
|
||||||
|
padded_url = await storage.get_file_url(storage_path)
|
||||||
|
|
||||||
|
self.logger.info(
|
||||||
|
f"Padded track {track_idx} with {start_time_seconds:.3f}s offset, stored at {storage_path}",
|
||||||
|
track_idx=track_idx,
|
||||||
|
delay_ms=delay_ms,
|
||||||
|
padded_url=padded_url,
|
||||||
|
)
|
||||||
|
|
||||||
|
return padded_data, padded_url
|
||||||
|
|
||||||
|
async def mixdown_tracks(
|
||||||
|
self,
|
||||||
|
track_datas: list[bytes],
|
||||||
|
writer: AudioFileWriterProcessor,
|
||||||
|
offsets_seconds: list[float] | None = None,
|
||||||
|
) -> None:
|
||||||
|
"""
|
||||||
|
Minimal multi-track mixdown using a PyAV filter graph (amix), no resampling.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Discover target sample rate from first decodable frame
|
||||||
|
target_sample_rate: int | None = None
|
||||||
|
for data in track_datas:
|
||||||
|
if not data:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
container = av.open(io.BytesIO(data))
|
||||||
|
try:
|
||||||
|
for frame in container.decode(audio=0):
|
||||||
|
target_sample_rate = frame.sample_rate
|
||||||
|
break
|
||||||
|
finally:
|
||||||
|
container.close()
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if target_sample_rate:
|
||||||
|
break
|
||||||
|
|
||||||
|
if not target_sample_rate:
|
||||||
|
self.logger.warning("Mixdown skipped - no decodable audio frames found")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Build PyAV filter graph:
|
||||||
|
# N abuffer (s32/stereo)
|
||||||
|
# -> optional adelay per input (for alignment)
|
||||||
|
# -> amix (s32)
|
||||||
|
# -> aformat(s16)
|
||||||
|
# -> sink
|
||||||
|
graph = av.filter.Graph()
|
||||||
|
inputs = []
|
||||||
|
valid_track_datas = [d for d in track_datas if d]
|
||||||
|
# Align offsets list with the filtered inputs (skip empties)
|
||||||
|
input_offsets_seconds = None
|
||||||
|
if offsets_seconds is not None:
|
||||||
|
input_offsets_seconds = [
|
||||||
|
offsets_seconds[i] for i, d in enumerate(track_datas) if d
|
||||||
|
]
|
||||||
|
for idx, data in enumerate(valid_track_datas):
|
||||||
|
args = (
|
||||||
|
f"time_base=1/{target_sample_rate}:"
|
||||||
|
f"sample_rate={target_sample_rate}:"
|
||||||
|
f"sample_fmt=s32:"
|
||||||
|
f"channel_layout=stereo"
|
||||||
|
)
|
||||||
|
in_ctx = graph.add("abuffer", args=args, name=f"in{idx}")
|
||||||
|
inputs.append(in_ctx)
|
||||||
|
|
||||||
|
if not inputs:
|
||||||
|
self.logger.warning("Mixdown skipped - no valid inputs for graph")
|
||||||
|
return
|
||||||
|
|
||||||
|
mixer = graph.add("amix", args=f"inputs={len(inputs)}:normalize=0", name="mix")
|
||||||
|
|
||||||
|
fmt = graph.add(
|
||||||
|
"aformat",
|
||||||
|
args=(
|
||||||
|
f"sample_fmts=s32:channel_layouts=stereo:sample_rates={target_sample_rate}"
|
||||||
|
),
|
||||||
|
name="fmt",
|
||||||
|
)
|
||||||
|
|
||||||
|
sink = graph.add("abuffersink", name="out")
|
||||||
|
|
||||||
|
# Optional per-input delay before mixing
|
||||||
|
delays_ms: list[int] = []
|
||||||
|
if input_offsets_seconds is not None:
|
||||||
|
base = min(input_offsets_seconds) if input_offsets_seconds else 0.0
|
||||||
|
delays_ms = [
|
||||||
|
max(0, int(round((o - base) * 1000))) for o in input_offsets_seconds
|
||||||
|
]
|
||||||
|
else:
|
||||||
|
delays_ms = [0 for _ in inputs]
|
||||||
|
|
||||||
|
for idx, in_ctx in enumerate(inputs):
|
||||||
|
delay_ms = delays_ms[idx] if idx < len(delays_ms) else 0
|
||||||
|
if delay_ms > 0:
|
||||||
|
# adelay requires one value per channel; use same for stereo
|
||||||
|
adelay = graph.add(
|
||||||
|
"adelay",
|
||||||
|
args=f"delays={delay_ms}|{delay_ms}:all=1",
|
||||||
|
name=f"delay{idx}",
|
||||||
|
)
|
||||||
|
in_ctx.link_to(adelay)
|
||||||
|
adelay.link_to(mixer, 0, idx)
|
||||||
|
else:
|
||||||
|
in_ctx.link_to(mixer, 0, idx)
|
||||||
|
mixer.link_to(fmt)
|
||||||
|
fmt.link_to(sink)
|
||||||
|
graph.configure()
|
||||||
|
|
||||||
|
# Open containers for decoding
|
||||||
|
containers = []
|
||||||
|
for i, d in enumerate(valid_track_datas):
|
||||||
|
try:
|
||||||
|
c = av.open(io.BytesIO(d))
|
||||||
|
containers.append(c)
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.warning(
|
||||||
|
"Mixdown: failed to open container", input=i, error=str(e)
|
||||||
|
)
|
||||||
|
containers.append(None)
|
||||||
|
# Filter out Nones for decoders
|
||||||
|
containers = [c for c in containers if c is not None]
|
||||||
|
decoders = [c.decode(audio=0) for c in containers]
|
||||||
|
active = [True] * len(decoders)
|
||||||
|
# Per-input resamplers to enforce s32/stereo at the same rate (no resample of rate)
|
||||||
|
resamplers = [
|
||||||
|
AudioResampler(format="s32", layout="stereo", rate=target_sample_rate)
|
||||||
|
for _ in decoders
|
||||||
|
]
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Round-robin feed frames into graph, pull mixed frames as they become available
|
||||||
|
while any(active):
|
||||||
|
for i, (dec, is_active) in enumerate(zip(decoders, active)):
|
||||||
|
if not is_active:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
frame = next(dec)
|
||||||
|
except StopIteration:
|
||||||
|
active[i] = False
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Enforce same sample rate; convert format/layout to s16/stereo (no resample)
|
||||||
|
if frame.sample_rate != target_sample_rate:
|
||||||
|
# Skip frames with differing rate
|
||||||
|
continue
|
||||||
|
out_frames = resamplers[i].resample(frame) or []
|
||||||
|
for rf in out_frames:
|
||||||
|
rf.sample_rate = target_sample_rate
|
||||||
|
rf.time_base = Fraction(1, target_sample_rate)
|
||||||
|
inputs[i].push(rf)
|
||||||
|
|
||||||
|
# Drain available mixed frames
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
mixed = sink.pull()
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
mixed.sample_rate = target_sample_rate
|
||||||
|
mixed.time_base = Fraction(1, target_sample_rate)
|
||||||
|
await writer.push(mixed)
|
||||||
|
|
||||||
|
# Signal EOF to inputs and drain remaining
|
||||||
|
for in_ctx in inputs:
|
||||||
|
in_ctx.push(None)
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
mixed = sink.pull()
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
mixed.sample_rate = target_sample_rate
|
||||||
|
mixed.time_base = Fraction(1, target_sample_rate)
|
||||||
|
await writer.push(mixed)
|
||||||
|
finally:
|
||||||
|
for c in containers:
|
||||||
|
c.close()
|
||||||
|
|
||||||
|
async def set_status(self, transcript_id: str, status: TranscriptStatus):
|
||||||
|
async with self.lock_transaction():
|
||||||
|
return await transcripts_controller.set_status(transcript_id, status)
|
||||||
|
|
||||||
|
async def process(self, bucket_name: str, track_keys: list[str]):
|
||||||
|
transcript = await self.get_transcript()
|
||||||
|
|
||||||
|
s3 = boto3.client(
|
||||||
|
"s3",
|
||||||
|
region_name=settings.RECORDING_STORAGE_AWS_REGION,
|
||||||
|
aws_access_key_id=settings.RECORDING_STORAGE_AWS_ACCESS_KEY_ID,
|
||||||
|
aws_secret_access_key=settings.RECORDING_STORAGE_AWS_SECRET_ACCESS_KEY,
|
||||||
|
)
|
||||||
|
|
||||||
|
storage = get_transcripts_storage()
|
||||||
|
|
||||||
|
# Pre-download bytes for all tracks for mixing and transcription
|
||||||
|
track_datas: list[bytes] = []
|
||||||
|
for key in track_keys:
|
||||||
|
try:
|
||||||
|
obj = s3.get_object(Bucket=bucket_name, Key=key)
|
||||||
|
track_datas.append(obj["Body"].read())
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.warning(
|
||||||
|
"Skipping track - cannot read S3 object", key=key, error=str(e)
|
||||||
|
)
|
||||||
|
track_datas.append(b"")
|
||||||
|
|
||||||
|
# REMOVED: Filename offset extraction - not needed anymore!
|
||||||
|
# We use stream metadata start_time for padding instead
|
||||||
|
|
||||||
|
# Get stream metadata start_times for mixing (still useful for mixdown)
|
||||||
|
stream_start_times: list[float] = []
|
||||||
|
for data in track_datas:
|
||||||
|
if not data:
|
||||||
|
stream_start_times.append(0.0)
|
||||||
|
continue
|
||||||
|
|
||||||
|
container = av.open(io.BytesIO(data))
|
||||||
|
try:
|
||||||
|
audio_stream = container.streams.audio[0]
|
||||||
|
if (
|
||||||
|
audio_stream.start_time is not None
|
||||||
|
and audio_stream.time_base is not None
|
||||||
|
):
|
||||||
|
start_time = float(audio_stream.start_time * audio_stream.time_base)
|
||||||
|
else:
|
||||||
|
start_time = 0.0
|
||||||
|
stream_start_times.append(start_time)
|
||||||
|
finally:
|
||||||
|
container.close()
|
||||||
|
|
||||||
|
# Mixdown all available tracks into transcript.audio_mp3_filename, using stream metadata offsets
|
||||||
|
try:
|
||||||
|
mp3_writer = AudioFileWriterProcessor(
|
||||||
|
path=str(transcript.audio_mp3_filename)
|
||||||
|
)
|
||||||
|
await self.mixdown_tracks(track_datas, mp3_writer, stream_start_times)
|
||||||
|
await mp3_writer.flush()
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error("Mixdown failed", error=str(e))
|
||||||
|
|
||||||
|
# PAD TRACKS BEFORE TRANSCRIPTION - THIS IS THE KEY FIX!
|
||||||
|
padded_track_urls: list[str] = []
|
||||||
|
for idx, data in enumerate(track_datas):
|
||||||
|
if not data:
|
||||||
|
padded_track_urls.append("")
|
||||||
|
continue
|
||||||
|
|
||||||
|
_, padded_url = await self.pad_track_for_transcription(data, idx, storage)
|
||||||
|
padded_track_urls.append(padded_url)
|
||||||
|
self.logger.info(f"Padded track {idx} for transcription: {padded_url}")
|
||||||
|
|
||||||
|
# Transcribe PADDED tracks - timestamps will be automatically correct!
|
||||||
|
speaker_transcripts: list[TranscriptType] = []
|
||||||
|
for idx, padded_url in enumerate(padded_track_urls):
|
||||||
|
if not padded_url:
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Transcribe the PADDED track
|
||||||
|
t = await self.transcribe_file(padded_url, transcript.source_language)
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(
|
||||||
|
"Transcription via default backend failed, trying local whisper",
|
||||||
|
track_idx=idx,
|
||||||
|
url=padded_url,
|
||||||
|
error=str(e),
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
fallback = FileTranscriptAutoProcessor(name="whisper")
|
||||||
|
result = None
|
||||||
|
|
||||||
|
async def capture_result(r):
|
||||||
|
nonlocal result
|
||||||
|
result = r
|
||||||
|
|
||||||
|
fallback.on(capture_result)
|
||||||
|
await fallback.push(
|
||||||
|
FileTranscriptInput(
|
||||||
|
audio_url=padded_url, language=transcript.source_language
|
||||||
|
)
|
||||||
|
)
|
||||||
|
await fallback.flush()
|
||||||
|
if not result:
|
||||||
|
raise Exception("No transcript captured in fallback")
|
||||||
|
t = result
|
||||||
|
except Exception as e2:
|
||||||
|
self.logger.error(
|
||||||
|
"Skipping track - transcription failed after fallback",
|
||||||
|
track_idx=idx,
|
||||||
|
url=padded_url,
|
||||||
|
error=str(e2),
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not t.words:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# NO OFFSET ADJUSTMENT NEEDED!
|
||||||
|
# Timestamps are already correct because we transcribed padded tracks
|
||||||
|
# Just set speaker ID
|
||||||
|
for w in t.words:
|
||||||
|
w.speaker = idx
|
||||||
|
|
||||||
|
speaker_transcripts.append(t)
|
||||||
|
self.logger.info(
|
||||||
|
f"Track {idx} transcribed successfully with {len(t.words)} words",
|
||||||
|
track_idx=idx,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not speaker_transcripts:
|
||||||
|
raise Exception("No valid track transcriptions")
|
||||||
|
|
||||||
|
# Merge all words and sort by timestamp
|
||||||
|
merged_words = []
|
||||||
|
for t in speaker_transcripts:
|
||||||
|
merged_words.extend(t.words)
|
||||||
|
merged_words.sort(
|
||||||
|
key=lambda w: w.start if hasattr(w, "start") and w.start is not None else 0
|
||||||
|
)
|
||||||
|
|
||||||
|
merged_transcript = TranscriptType(words=merged_words, translation=None)
|
||||||
|
|
||||||
|
await transcripts_controller.append_event(
|
||||||
|
transcript,
|
||||||
|
event="TRANSCRIPT",
|
||||||
|
data=TranscriptText(
|
||||||
|
text=merged_transcript.text, translation=merged_transcript.translation
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
topics = await self.detect_topics(merged_transcript, transcript.target_language)
|
||||||
|
await asyncio.gather(
|
||||||
|
self.generate_title(topics),
|
||||||
|
self.generate_summaries(topics),
|
||||||
|
return_exceptions=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
await self.set_status(transcript.id, "ended")
|
||||||
|
|
||||||
|
async def transcribe_file(self, audio_url: str, language: str) -> TranscriptType:
|
||||||
|
processor = FileTranscriptAutoProcessor()
|
||||||
|
input_data = FileTranscriptInput(audio_url=audio_url, language=language)
|
||||||
|
|
||||||
|
result: TranscriptType | None = None
|
||||||
|
|
||||||
|
async def capture_result(transcript):
|
||||||
|
nonlocal result
|
||||||
|
result = transcript
|
||||||
|
|
||||||
|
processor.on(capture_result)
|
||||||
|
await processor.push(input_data)
|
||||||
|
await processor.flush()
|
||||||
|
|
||||||
|
if not result:
|
||||||
|
raise ValueError("No transcript captured")
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
async def detect_topics(
|
||||||
|
self, transcript: TranscriptType, target_language: str
|
||||||
|
) -> list[TitleSummary]:
|
||||||
|
chunk_size = 300
|
||||||
|
topics: list[TitleSummary] = []
|
||||||
|
|
||||||
|
async def on_topic(topic: TitleSummary):
|
||||||
|
topics.append(topic)
|
||||||
|
return await self.on_topic(topic)
|
||||||
|
|
||||||
|
topic_detector = TranscriptTopicDetectorProcessor(callback=on_topic)
|
||||||
|
topic_detector.set_pipeline(self.empty_pipeline)
|
||||||
|
|
||||||
|
for i in range(0, len(transcript.words), chunk_size):
|
||||||
|
chunk_words = transcript.words[i : i + chunk_size]
|
||||||
|
if not chunk_words:
|
||||||
|
continue
|
||||||
|
|
||||||
|
chunk_transcript = TranscriptType(
|
||||||
|
words=chunk_words, translation=transcript.translation
|
||||||
|
)
|
||||||
|
await topic_detector.push(chunk_transcript)
|
||||||
|
|
||||||
|
await topic_detector.flush()
|
||||||
|
return topics
|
||||||
|
|
||||||
|
async def generate_title(self, topics: list[TitleSummary]):
|
||||||
|
if not topics:
|
||||||
|
self.logger.warning("No topics for title generation")
|
||||||
|
return
|
||||||
|
|
||||||
|
processor = TranscriptFinalTitleProcessor(callback=self.on_title)
|
||||||
|
processor.set_pipeline(self.empty_pipeline)
|
||||||
|
|
||||||
|
for topic in topics:
|
||||||
|
await processor.push(topic)
|
||||||
|
|
||||||
|
await processor.flush()
|
||||||
|
|
||||||
|
async def generate_summaries(self, topics: list[TitleSummary]):
|
||||||
|
if not topics:
|
||||||
|
self.logger.warning("No topics for summary generation")
|
||||||
|
return
|
||||||
|
|
||||||
|
transcript = await self.get_transcript()
|
||||||
|
processor = TranscriptFinalSummaryProcessor(
|
||||||
|
transcript=transcript,
|
||||||
|
callback=self.on_long_summary,
|
||||||
|
on_short_summary=self.on_short_summary,
|
||||||
|
)
|
||||||
|
processor.set_pipeline(self.empty_pipeline)
|
||||||
|
|
||||||
|
for topic in topics:
|
||||||
|
await processor.push(topic)
|
||||||
|
|
||||||
|
await processor.flush()
|
||||||
|
|
||||||
|
|
||||||
|
@shared_task
|
||||||
|
@asynctask
|
||||||
|
async def task_pipeline_multitrack_process(
|
||||||
|
*, transcript_id: str, bucket_name: str, track_keys: list[str]
|
||||||
|
):
|
||||||
|
pipeline = PipelineMainMultitrack(transcript_id=transcript_id)
|
||||||
|
try:
|
||||||
|
await pipeline.set_status(transcript_id, "processing")
|
||||||
|
await pipeline.process(bucket_name, track_keys)
|
||||||
|
except Exception:
|
||||||
|
await pipeline.set_status(transcript_id, "error")
|
||||||
|
raise
|
||||||
|
|
||||||
|
post_chain = chain(
|
||||||
|
task_cleanup_consent.si(transcript_id=transcript_id),
|
||||||
|
task_pipeline_post_to_zulip.si(transcript_id=transcript_id),
|
||||||
|
task_send_webhook_if_needed.si(transcript_id=transcript_id),
|
||||||
|
)
|
||||||
|
post_chain.delay()
|
||||||
@@ -1,6 +1,6 @@
|
|||||||
from textwrap import dedent
|
from textwrap import dedent
|
||||||
|
|
||||||
from pydantic import BaseModel, Field
|
from pydantic import BaseModel, ConfigDict, Field
|
||||||
|
|
||||||
from reflector.llm import LLM
|
from reflector.llm import LLM
|
||||||
from reflector.processors.base import Processor
|
from reflector.processors.base import Processor
|
||||||
@@ -34,8 +34,14 @@ TOPIC_PROMPT = dedent(
|
|||||||
class TopicResponse(BaseModel):
|
class TopicResponse(BaseModel):
|
||||||
"""Structured response for topic detection"""
|
"""Structured response for topic detection"""
|
||||||
|
|
||||||
title: str = Field(description="A descriptive title for the topic being discussed")
|
model_config = ConfigDict(populate_by_name=True)
|
||||||
summary: str = Field(description="A concise 1-2 sentence summary of the discussion")
|
|
||||||
|
title: str = Field(
|
||||||
|
description="A descriptive title for the topic being discussed", alias="Title"
|
||||||
|
)
|
||||||
|
summary: str = Field(
|
||||||
|
description="A concise 1-2 sentence summary of the discussion", alias="Summary"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
class TranscriptTopicDetectorProcessor(Processor):
|
class TranscriptTopicDetectorProcessor(Processor):
|
||||||
|
|||||||
65
server/reprocess_transcript.py
Normal file
65
server/reprocess_transcript.py
Normal file
@@ -0,0 +1,65 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Reprocess the Daily.co multitrack recording to fix audio mixdown
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
from reflector.pipelines.main_multitrack_pipeline import (
|
||||||
|
task_pipeline_multitrack_process,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def reprocess():
|
||||||
|
"""Process the multitrack recording with fixed mixdown"""
|
||||||
|
|
||||||
|
bucket_name = "reflector-dailyco-local"
|
||||||
|
track_keys = [
|
||||||
|
"monadical/daily-20251020193458/1760988935484-52f7f48b-fbab-431f-9a50-87b9abfc8255-cam-audio-1760988935922",
|
||||||
|
"monadical/daily-20251020193458/1760988935484-a37c35e3-6f8e-4274-a482-e9d0f102a732-cam-audio-1760988943823",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Create a new transcript with fixed mixdown
|
||||||
|
import uuid
|
||||||
|
|
||||||
|
from reflector.db import get_database
|
||||||
|
from reflector.db.transcripts import Transcript, transcripts
|
||||||
|
|
||||||
|
db = get_database()
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
try:
|
||||||
|
transcript_id = str(uuid.uuid4())
|
||||||
|
transcript = Transcript(
|
||||||
|
id=transcript_id,
|
||||||
|
name="Daily Multitrack - With Audio Mixdown",
|
||||||
|
source_kind="file",
|
||||||
|
source_language="en",
|
||||||
|
target_language="en",
|
||||||
|
status="idle",
|
||||||
|
events=[],
|
||||||
|
title="",
|
||||||
|
)
|
||||||
|
|
||||||
|
query = transcripts.insert().values(**transcript.model_dump())
|
||||||
|
await db.execute(query)
|
||||||
|
print(f"Created transcript: {transcript_id}")
|
||||||
|
|
||||||
|
# Process with the fixed pipeline
|
||||||
|
await task_pipeline_multitrack_process(
|
||||||
|
transcript_id=transcript_id, bucket_name=bucket_name, track_keys=track_keys
|
||||||
|
)
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"Processing complete! Check: http://localhost:3000/transcripts/{transcript_id}"
|
||||||
|
)
|
||||||
|
|
||||||
|
return transcript_id
|
||||||
|
finally:
|
||||||
|
await db.disconnect()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
transcript_id = asyncio.run(reprocess())
|
||||||
|
print(f"\n✅ Reprocessing complete!")
|
||||||
|
print(f"📍 View at: http://localhost:3000/transcripts/{transcript_id}")
|
||||||
124
server/test_multitrack_ffmpeg.py
Normal file
124
server/test_multitrack_ffmpeg.py
Normal file
@@ -0,0 +1,124 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Test script to trigger multitrack recording processing with ffmpeg padding fix
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
from reflector.pipelines.main_multitrack_pipeline import PipelineMainMultitrack
|
||||||
|
|
||||||
|
|
||||||
|
async def test_processing():
|
||||||
|
"""Manually trigger multitrack processing for the test recording"""
|
||||||
|
|
||||||
|
# Initialize database connection
|
||||||
|
from reflector.db import get_database
|
||||||
|
|
||||||
|
db = get_database()
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# The test recording with known speaker timeline
|
||||||
|
bucket_name = "monadical"
|
||||||
|
track_keys = [
|
||||||
|
"daily-20251020193458/1760988935484-52f7f48b-fbab-431f-9a50-87b9abfc8255-cam-audio-1760988935922.webm",
|
||||||
|
"daily-20251020193458/1760988935484-a37c35e3-6f8e-4274-a482-e9d0f102a732-cam-audio-1760988943823.webm",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Create a new transcript ID
|
||||||
|
import uuid
|
||||||
|
|
||||||
|
transcript_id = str(uuid.uuid4())
|
||||||
|
|
||||||
|
# Create transcript directly with SQL
|
||||||
|
from reflector.db.transcripts import (
|
||||||
|
Transcript,
|
||||||
|
transcripts,
|
||||||
|
transcripts_controller,
|
||||||
|
)
|
||||||
|
|
||||||
|
pipeline = PipelineMainMultitrack(transcript_id=transcript_id)
|
||||||
|
|
||||||
|
# Create transcript model
|
||||||
|
transcript = Transcript(
|
||||||
|
id=transcript_id,
|
||||||
|
name="FFMPEG Test - Daily Multitrack Recording",
|
||||||
|
source_kind="file",
|
||||||
|
source_language="en",
|
||||||
|
target_language="en",
|
||||||
|
status="idle",
|
||||||
|
events=[],
|
||||||
|
title="",
|
||||||
|
)
|
||||||
|
# Insert into database
|
||||||
|
query = transcripts.insert().values(**transcript.model_dump())
|
||||||
|
await db.execute(query)
|
||||||
|
print(f"Created transcript: {transcript_id}")
|
||||||
|
|
||||||
|
# Process the tracks using the pipeline
|
||||||
|
print(f"Processing multitrack recording with ffmpeg padding...")
|
||||||
|
print(f"Track 0: ...935922.webm (expected to start at ~2s)")
|
||||||
|
print(f"Track 1: ...943823.webm (expected to start at ~51s)")
|
||||||
|
|
||||||
|
try:
|
||||||
|
await pipeline.set_status(transcript_id, "processing")
|
||||||
|
await pipeline.process(bucket_name, track_keys)
|
||||||
|
print(f"Processing complete!")
|
||||||
|
except Exception as e:
|
||||||
|
await pipeline.set_status(transcript_id, "error")
|
||||||
|
print(f"Error during processing: {e}")
|
||||||
|
import traceback
|
||||||
|
|
||||||
|
traceback.print_exc()
|
||||||
|
raise
|
||||||
|
|
||||||
|
# Check the results
|
||||||
|
final_transcript = await transcripts_controller.get(transcript_id)
|
||||||
|
print(f"\nTranscript status: {final_transcript.status}")
|
||||||
|
print(f"Transcript title: {final_transcript.title}")
|
||||||
|
|
||||||
|
# Extract timeline from events
|
||||||
|
if final_transcript.events:
|
||||||
|
for event in final_transcript.events:
|
||||||
|
if event.get("event") == "TRANSCRIPT":
|
||||||
|
text = event.get("data", {}).get("text", "")
|
||||||
|
# Show first 500 chars to check if speakers are properly separated
|
||||||
|
print(f"\nTranscript text (first 500 chars):")
|
||||||
|
print(text[:500])
|
||||||
|
|
||||||
|
# Show last 500 chars too to see if second speaker is at the end
|
||||||
|
print(f"\nTranscript text (last 500 chars):")
|
||||||
|
print(text[-500:])
|
||||||
|
|
||||||
|
# Count words per speaker
|
||||||
|
words = text.split()
|
||||||
|
print(f"\nTotal words in transcript: {len(words)}")
|
||||||
|
|
||||||
|
# Check if text has proper speaker separation
|
||||||
|
# Expected: First ~45% from speaker 0, then ~35% from speaker 1, then ~20% from speaker 0
|
||||||
|
first_third = " ".join(words[: len(words) // 3])
|
||||||
|
middle_third = " ".join(
|
||||||
|
words[len(words) // 3 : 2 * len(words) // 3]
|
||||||
|
)
|
||||||
|
last_third = " ".join(words[2 * len(words) // 3 :])
|
||||||
|
|
||||||
|
print(f"\nFirst third preview: {first_third[:100]}...")
|
||||||
|
print(f"Middle third preview: {middle_third[:100]}...")
|
||||||
|
print(f"Last third preview: {last_third[:100]}...")
|
||||||
|
break
|
||||||
|
|
||||||
|
return transcript_id
|
||||||
|
finally:
|
||||||
|
await db.disconnect()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
transcript_id = asyncio.run(test_processing())
|
||||||
|
print(f"\n✅ Test complete! Transcript ID: {transcript_id}")
|
||||||
|
print(f"\nExpected timeline:")
|
||||||
|
print(f" Speaker 0: ~2s to ~49s (first participant speaks)")
|
||||||
|
print(f" Speaker 1: ~51s to ~70s (second participant speaks)")
|
||||||
|
print(f" Speaker 0: ~73s to end (first participant speaks again)")
|
||||||
|
print(
|
||||||
|
f"\nIf the text shows proper chronological order (not interleaved), the fix worked!"
|
||||||
|
)
|
||||||
162
server/test_multitrack_ffmpeg_local.py
Normal file
162
server/test_multitrack_ffmpeg_local.py
Normal file
@@ -0,0 +1,162 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Test script to trigger multitrack recording processing with ffmpeg padding fix
|
||||||
|
This version loads tracks from local filesystem instead of S3
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import os
|
||||||
|
|
||||||
|
from reflector.pipelines.main_multitrack_pipeline import PipelineMainMultitrack
|
||||||
|
|
||||||
|
|
||||||
|
async def test_processing():
|
||||||
|
"""Manually trigger multitrack processing for the test recording"""
|
||||||
|
|
||||||
|
# Initialize database connection
|
||||||
|
from reflector.db import get_database
|
||||||
|
|
||||||
|
db = get_database()
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create a new transcript ID
|
||||||
|
import uuid
|
||||||
|
|
||||||
|
transcript_id = str(uuid.uuid4())
|
||||||
|
|
||||||
|
# Create transcript directly with SQL
|
||||||
|
from reflector.db.transcripts import (
|
||||||
|
Transcript,
|
||||||
|
transcripts,
|
||||||
|
transcripts_controller,
|
||||||
|
)
|
||||||
|
|
||||||
|
pipeline = PipelineMainMultitrack(transcript_id=transcript_id)
|
||||||
|
|
||||||
|
# Create transcript model
|
||||||
|
transcript = Transcript(
|
||||||
|
id=transcript_id,
|
||||||
|
name="FFMPEG Test - Daily Multitrack Recording",
|
||||||
|
source_kind="file",
|
||||||
|
source_language="en",
|
||||||
|
target_language="en",
|
||||||
|
status="idle",
|
||||||
|
events=[],
|
||||||
|
title="",
|
||||||
|
)
|
||||||
|
# Insert into database
|
||||||
|
query = transcripts.insert().values(**transcript.model_dump())
|
||||||
|
await db.execute(query)
|
||||||
|
print(f"Created transcript: {transcript_id}")
|
||||||
|
|
||||||
|
# Read track files from local filesystem (in the container they'll be at /app/)
|
||||||
|
tracks_dir = "/app"
|
||||||
|
track_files = [
|
||||||
|
"1760988935484-52f7f48b-fbab-431f-9a50-87b9abfc8255-cam-audio-1760988935922.webm",
|
||||||
|
"1760988935484-a37c35e3-6f8e-4274-a482-e9d0f102a732-cam-audio-1760988943823.webm",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Read track data
|
||||||
|
track_datas = []
|
||||||
|
for track_file in track_files:
|
||||||
|
file_path = os.path.join(tracks_dir, track_file)
|
||||||
|
if os.path.exists(file_path):
|
||||||
|
with open(file_path, "rb") as f:
|
||||||
|
track_datas.append(f.read())
|
||||||
|
print(f"Loaded track: {track_file} ({len(track_datas[-1])} bytes)")
|
||||||
|
else:
|
||||||
|
print(f"Track file not found: {file_path}")
|
||||||
|
track_datas.append(b"")
|
||||||
|
|
||||||
|
# Process the tracks using the pipeline
|
||||||
|
print(f"\nProcessing multitrack recording with ffmpeg padding...")
|
||||||
|
print(f"Track 0: ...935922.webm (expected to start at ~2s)")
|
||||||
|
print(f"Track 1: ...943823.webm (expected to start at ~51s)")
|
||||||
|
|
||||||
|
# Call the process method directly with track data
|
||||||
|
# We'll need to mock S3 operations and directly work with the data
|
||||||
|
|
||||||
|
# Save tracks to temporary files and process them
|
||||||
|
|
||||||
|
try:
|
||||||
|
await pipeline.set_status(transcript_id, "processing")
|
||||||
|
|
||||||
|
# Create a mock bucket and keys setup
|
||||||
|
bucket_name = "test-bucket"
|
||||||
|
track_keys = ["track0.webm", "track1.webm"]
|
||||||
|
|
||||||
|
# Mock S3 client to return our local data
|
||||||
|
from unittest.mock import MagicMock, patch
|
||||||
|
|
||||||
|
mock_s3 = MagicMock()
|
||||||
|
|
||||||
|
def mock_get_object(Bucket, Key):
|
||||||
|
idx = 0 if "track0" in Key else 1
|
||||||
|
return {"Body": MagicMock(read=lambda: track_datas[idx])}
|
||||||
|
|
||||||
|
mock_s3.get_object = mock_get_object
|
||||||
|
|
||||||
|
# Patch boto3.client to return our mock
|
||||||
|
with patch("boto3.client", return_value=mock_s3):
|
||||||
|
await pipeline.process(bucket_name, track_keys)
|
||||||
|
|
||||||
|
print(f"Processing complete!")
|
||||||
|
except Exception as e:
|
||||||
|
await pipeline.set_status(transcript_id, "error")
|
||||||
|
print(f"Error during processing: {e}")
|
||||||
|
import traceback
|
||||||
|
|
||||||
|
traceback.print_exc()
|
||||||
|
raise
|
||||||
|
|
||||||
|
# Check the results
|
||||||
|
final_transcript = await transcripts_controller.get(transcript_id)
|
||||||
|
print(f"\nTranscript status: {final_transcript.status}")
|
||||||
|
print(f"Transcript title: {final_transcript.title}")
|
||||||
|
|
||||||
|
# Extract timeline from events
|
||||||
|
if final_transcript.events:
|
||||||
|
for event in final_transcript.events:
|
||||||
|
if event.get("event") == "TRANSCRIPT":
|
||||||
|
text = event.get("data", {}).get("text", "")
|
||||||
|
# Show first 500 chars to check if speakers are properly separated
|
||||||
|
print(f"\nTranscript text (first 500 chars):")
|
||||||
|
print(text[:500])
|
||||||
|
|
||||||
|
# Show last 500 chars too to see if second speaker is at the end
|
||||||
|
print(f"\nTranscript text (last 500 chars):")
|
||||||
|
print(text[-500:])
|
||||||
|
|
||||||
|
# Count words per speaker
|
||||||
|
words = text.split()
|
||||||
|
print(f"\nTotal words in transcript: {len(words)}")
|
||||||
|
|
||||||
|
# Check if text has proper speaker separation
|
||||||
|
# Expected: First ~45% from speaker 0, then ~35% from speaker 1, then ~20% from speaker 0
|
||||||
|
first_third = " ".join(words[: len(words) // 3])
|
||||||
|
middle_third = " ".join(
|
||||||
|
words[len(words) // 3 : 2 * len(words) // 3]
|
||||||
|
)
|
||||||
|
last_third = " ".join(words[2 * len(words) // 3 :])
|
||||||
|
|
||||||
|
print(f"\nFirst third preview: {first_third[:100]}...")
|
||||||
|
print(f"Middle third preview: {middle_third[:100]}...")
|
||||||
|
print(f"Last third preview: {last_third[:100]}...")
|
||||||
|
break
|
||||||
|
|
||||||
|
return transcript_id
|
||||||
|
finally:
|
||||||
|
await db.disconnect()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
transcript_id = asyncio.run(test_processing())
|
||||||
|
print(f"\n✅ Test complete! Transcript ID: {transcript_id}")
|
||||||
|
print(f"\nExpected timeline:")
|
||||||
|
print(f" Speaker 0: ~2s to ~49s (first participant speaks)")
|
||||||
|
print(f" Speaker 1: ~51s to ~70s (second participant speaks)")
|
||||||
|
print(f" Speaker 0: ~73s to end (first participant speaks again)")
|
||||||
|
print(
|
||||||
|
f"\nIf the text shows proper chronological order (not interleaved), the fix worked!"
|
||||||
|
)
|
||||||
66
server/test_s3_multitrack.py
Normal file
66
server/test_s3_multitrack.py
Normal file
@@ -0,0 +1,66 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Test multitrack processing with correct S3 bucket configuration
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import uuid
|
||||||
|
|
||||||
|
from reflector.db import get_database
|
||||||
|
from reflector.db.transcripts import Transcript, transcripts
|
||||||
|
from reflector.pipelines.main_multitrack_pipeline import (
|
||||||
|
task_pipeline_multitrack_process,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def create_and_process():
|
||||||
|
"""Create a new transcript and process with correct S3 bucket"""
|
||||||
|
|
||||||
|
# Correct S3 configuration
|
||||||
|
bucket_name = "reflector-dailyco-local"
|
||||||
|
track_keys = [
|
||||||
|
"monadical/daily-20251020193458/1760988935484-52f7f48b-fbab-431f-9a50-87b9abfc8255-cam-audio-1760988935922",
|
||||||
|
"monadical/daily-20251020193458/1760988935484-a37c35e3-6f8e-4274-a482-e9d0f102a732-cam-audio-1760988943823",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Create a new transcript
|
||||||
|
db = get_database()
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
try:
|
||||||
|
transcript_id = str(uuid.uuid4())
|
||||||
|
transcript = Transcript(
|
||||||
|
id=transcript_id,
|
||||||
|
name="Daily Multitrack - Correct S3 Bucket Test",
|
||||||
|
source_kind="file",
|
||||||
|
source_language="en",
|
||||||
|
target_language="en",
|
||||||
|
status="idle",
|
||||||
|
events=[],
|
||||||
|
title="",
|
||||||
|
)
|
||||||
|
|
||||||
|
query = transcripts.insert().values(**transcript.model_dump())
|
||||||
|
await db.execute(query)
|
||||||
|
print(f"Created transcript: {transcript_id}")
|
||||||
|
|
||||||
|
# Trigger processing with Celery
|
||||||
|
result = task_pipeline_multitrack_process.delay(
|
||||||
|
transcript_id=transcript_id, bucket_name=bucket_name, track_keys=track_keys
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"Task ID: {result.id}")
|
||||||
|
print(
|
||||||
|
f"Processing started! Check: http://localhost:3000/transcripts/{transcript_id}"
|
||||||
|
)
|
||||||
|
print(f"API Status: http://localhost:1250/v1/transcripts/{transcript_id}")
|
||||||
|
|
||||||
|
return transcript_id
|
||||||
|
finally:
|
||||||
|
await db.disconnect()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
transcript_id = asyncio.run(create_and_process())
|
||||||
|
print(f"\n✅ Task submitted successfully!")
|
||||||
|
print(f"📍 Transcript ID: {transcript_id}")
|
||||||
23
server/trigger_reprocess.py
Normal file
23
server/trigger_reprocess.py
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Trigger reprocessing of Daily.co multitrack recording via Celery
|
||||||
|
"""
|
||||||
|
|
||||||
|
from reflector.pipelines.main_multitrack_pipeline import (
|
||||||
|
task_pipeline_multitrack_process,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Trigger the Celery task
|
||||||
|
result = task_pipeline_multitrack_process.delay(
|
||||||
|
transcript_id="32fad706-f8cf-434c-94c8-1ee69f7be081", # The ID that was created
|
||||||
|
bucket_name="reflector-dailyco-local",
|
||||||
|
track_keys=[
|
||||||
|
"monadical/daily-20251020193458/1760988935484-52f7f48b-fbab-431f-9a50-87b9abfc8255-cam-audio-1760988935922",
|
||||||
|
"monadical/daily-20251020193458/1760988935484-a37c35e3-6f8e-4274-a482-e9d0f102a732-cam-audio-1760988943823",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"Task ID: {result.id}")
|
||||||
|
print(
|
||||||
|
f"Processing started! Check: http://localhost:3000/transcripts/32fad706-f8cf-434c-94c8-1ee69f7be081"
|
||||||
|
)
|
||||||
Reference in New Issue
Block a user