dailico track merge vibe

This commit is contained in:
Igor Loskutov
2025-10-21 10:30:19 -04:00
parent f844b9fc1f
commit 7d239fe380
12 changed files with 1993 additions and 124 deletions

View File

@@ -1,27 +1,27 @@
# Daily.co Integration Test Plan
## ⚠️ IMPORTANT: Stub Implementation
## IMPLEMENTATION STATUS: Real Transcription Active
**This test validates Daily.co webhook integration with MOCK transcription data.**
**This test validates Daily.co multitrack recording integration with REAL transcription/diarization.**
The actual audio/video files are recorded to S3, but transcription/diarization is NOT performed. Instead:
- A **stub processor** generates fake transcript with predetermined text ("The Great Fish Eating Argument")
- **Audio track is downloaded from Daily.co S3** to local storage for playback in the frontend
- All database entities (recording, transcript, topics, participants, words) are created with **fake "fish" conversation data**
- This allows testing the complete webhook → database flow WITHOUT expensive GPU processing
The implementation includes complete audio processing pipeline:
- **Multitrack recordings** from Daily.co S3 (separate audio stream per participant)
- **PyAV-based audio mixdown** with PTS-based track alignment
- **Real transcription** via Modal GPU backend (Whisper)
- **Real diarization** via Modal GPU backend (speaker identification)
- **Per-track transcription** with timestamp synchronization
- **Complete database entities** (recording, transcript, topics, participants, words)
**Expected transcript content:**
- Title: "The Great Fish Eating Argument"
- Participants: "Fish Eater" (speaker 0), "Annoying Person" (speaker 1)
- Transcription: Nonsensical argument about eating fish (see `reflector/worker/daily_stub_data.py`)
- Audio file: Downloaded WebM from Daily.co S3 (stored in `data/{transcript_id}/upload.webm`)
**Processing pipeline** (`PipelineMainMultitrack`):
1. Download all audio tracks from Daily.co S3
2. Align tracks by PTS (presentation timestamp) to handle late joiners
3. Mix tracks into single audio file for unified playback
4. Transcribe each track individually with proper offset handling
5. Perform diarization on mixed audio
6. Generate topics, summaries, and word-level timestamps
7. Convert audio to MP3 and generate waveform visualization
**File processing pipeline** then:
- Converts WebM to MP3 format (for frontend audio player)
- Generates waveform visualization data (audio.json)
- These files enable proper frontend transcript page display
**Next implementation step:** Replace stub with real transcription pipeline (merge audio tracks, run Whisper/diarization).
**Note:** A stub processor (`process_daily_recording`) exists for testing webhook flow without GPU costs, but the production code path uses `process_multitrack_recording` with full ML pipeline.
---
@@ -29,6 +29,7 @@ The actual audio/video files are recorded to S3, but transcription/diarization i
**1. Environment Variables** (check in `.env.development.local`):
```bash
# Daily.co API Configuration
DAILY_API_KEY=<key>
DAILY_SUBDOMAIN=monadical
DAILY_WEBHOOK_SECRET=<base64-encoded-secret>
@@ -37,25 +38,43 @@ AWS_DAILY_S3_REGION=us-east-1
AWS_DAILY_ROLE_ARN=arn:aws:iam::950402358378:role/DailyCo
DAILY_MIGRATION_ENABLED=true
DAILY_MIGRATION_ROOM_IDS=["552640fd-16f2-4162-9526-8cf40cd2357e"]
# Transcription/Diarization Backend (Required for real processing)
DIARIZATION_BACKEND=modal
DIARIZATION_MODAL_API_KEY=<modal-api-key>
# TRANSCRIPTION_BACKEND is not explicitly set (uses default/modal)
```
**2. Services Running:**
```bash
docker-compose ps # server, postgres, redis should be UP
docker compose ps # server, postgres, redis, worker, beat should be UP
```
**IMPORTANT:** Worker and beat services MUST be running for transcription processing:
```bash
docker compose up -d worker beat
```
**3. ngrok Tunnel for Webhooks:**
```bash
ngrok http 1250 # Note the URL (e.g., https://abc123.ngrok-free.app)
# Start ngrok (if not already running)
ngrok http 1250 --log=stdout > /tmp/ngrok.log 2>&1 &
# Get public URL
curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; data=json.load(sys.stdin); print(data['tunnels'][0]['public_url'])"
```
**Current ngrok URL:** `https://0503947384a3.ngrok-free.app` (as of last registration)
**4. Webhook Created:**
```bash
cd server
uv run python scripts/recreate_daily_webhook.py https://abc123.ngrok-free.app/v1/daily/webhook
uv run python scripts/recreate_daily_webhook.py https://0503947384a3.ngrok-free.app/v1/daily/webhook
# Verify: "Created webhook <uuid> (state: ACTIVE)"
```
**Current webhook status:** ✅ ACTIVE (webhook ID: dad5ad16-ceca-488e-8fc5-dae8650b51d0)
---
## Test 1: Database Configuration
@@ -338,23 +357,25 @@ recorded_at: <recent-timestamp>
**Check transcript created:**
```bash
docker-compose exec -T postgres psql -U reflector -d reflector -c \
docker compose exec -T postgres psql -U reflector -d reflector -c \
"SELECT id, title, status, duration, recording_id, meeting_id, room_id
FROM transcript
ORDER BY created_at DESC LIMIT 1;"
```
**Expected:**
**Expected (REAL transcription):**
```
id: <transcript-id>
title: The Great Fish Eating Argument
status: uploaded (audio file downloaded for playback)
duration: ~200-300 seconds (depends on fish text parsing)
title: <AI-generated title based on actual conversation content>
status: uploaded (audio file processed and available)
duration: <actual meeting duration in seconds>
recording_id: <same-as-recording-id-above>
meeting_id: <meeting-id>
room_id: 552640fd-16f2-4162-9526-8cf40cd2357e
```
**Note:** Title and content will reflect the ACTUAL conversation, not mock data. Processing time depends on recording length and GPU backend availability (Modal).
**Verify audio file exists:**
```bash
ls -lh data/<transcript-id>/upload.webm
@@ -365,12 +386,12 @@ ls -lh data/<transcript-id>/upload.webm
-rw-r--r-- 1 user staff ~100-200K Oct 10 18:48 upload.webm
```
**Check transcript topics (stub data):**
**Check transcript topics (REAL transcription):**
```bash
TRANSCRIPT_ID=$(docker-compose exec -T postgres psql -U reflector -d reflector -t -c \
TRANSCRIPT_ID=$(docker compose exec -T postgres psql -U reflector -d reflector -t -c \
"SELECT id FROM transcript ORDER BY created_at DESC LIMIT 1;")
docker-compose exec -T postgres psql -U reflector -d reflector -c \
docker compose exec -T postgres psql -U reflector -d reflector -c \
"SELECT
jsonb_array_length(topics) as num_topics,
jsonb_array_length(participants) as num_participants,
@@ -380,55 +401,52 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \
WHERE id = '$TRANSCRIPT_ID';"
```
**Expected:**
**Expected (REAL data):**
```
num_topics: 3
num_participants: 2
short_summary: Two people argue about eating fish
title: The Great Fish Eating Argument
num_topics: <varies based on conversation>
num_participants: <actual number of participants who spoke>
short_summary: <AI-generated summary of actual conversation>
title: <AI-generated title based on content>
```
**Check topics contain fish text:**
**Check topics contain actual transcription:**
```bash
docker-compose exec -T postgres psql -U reflector -d reflector -c \
docker compose exec -T postgres psql -U reflector -d reflector -c \
"SELECT topics->0->'title', topics->0->'summary', topics->0->'transcript'
FROM transcript
ORDER BY created_at DESC LIMIT 1;" | head -20
```
**Expected output should contain:**
```
Fish Argument Part 1
Argument about eating fish continues (part 1)
Fish for dinner are nothing wrong with you? There's nothing...
```
**Expected output:** Will contain the ACTUAL transcribed conversation from the Daily.co meeting, not mock data.
**Check participants:**
```bash
docker-compose exec -T postgres psql -U reflector -d reflector -c \
docker compose exec -T postgres psql -U reflector -d reflector -c \
"SELECT participants FROM transcript ORDER BY created_at DESC LIMIT 1;" \
| python3 -c "import sys, json; data=json.loads(sys.stdin.read()); print(json.dumps(data, indent=2))"
```
**Expected:**
**Expected (REAL diarization):**
```json
[
{
"id": "<uuid>",
"speaker": 0,
"name": "Fish Eater"
"name": "Speaker 1"
},
{
"id": "<uuid>",
"speaker": 1,
"name": "Annoying Person"
"name": "Speaker 2"
}
]
```
**Note:** Speaker names will be generic ("Speaker 1", "Speaker 2", etc.) as determined by the diarization backend. Number of participants depends on how many actually spoke during the meeting.
**Check word-level data:**
```bash
docker-compose exec -T postgres psql -U reflector -d reflector -c \
docker compose exec -T postgres psql -U reflector -d reflector -c \
"SELECT jsonb_array_length(topics->0->'words') as num_words_first_topic
FROM transcript
ORDER BY created_at DESC LIMIT 1;"
@@ -436,12 +454,12 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \
**Expected:**
```
num_words_first_topic: ~100-150 (varies based on topic chunking)
num_words_first_topic: <varies based on actual conversation length and topic chunking>
```
**Verify speaker diarization in words:**
```bash
docker-compose exec -T postgres psql -U reflector -d reflector -c \
docker compose exec -T postgres psql -U reflector -d reflector -c \
"SELECT
topics->0->'words'->0->>'text' as first_word,
topics->0->'words'->0->>'speaker' as speaker,
@@ -451,14 +469,16 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \
ORDER BY created_at DESC LIMIT 1;"
```
**Expected:**
**Expected (REAL transcription):**
```
first_word: Fish
speaker: 0 or 1 (depends on parsing)
start_time: 0.0
end_time: 0.35 (approximate)
first_word: <actual first word from transcription>
speaker: 0, 1, 2, ... (actual speaker ID from diarization)
start_time: <actual timestamp in seconds>
end_time: <actual end timestamp>
```
**Note:** All timestamps and speaker IDs are from real transcription/diarization, synchronized across tracks.
---
## Test 8: Recording Type Verification
@@ -579,13 +599,15 @@ Recording: raw-tracks
- [x] S3 path: `monadical/test2-{timestamp}/{recording-start-ts}-{participant-uuid}-cam-{audio|video}-{track-start-ts}`
- [x] Database `num_clients` increments/decrements correctly
- [x] **Database recording entry created** with correct S3 path and status `completed`
- [x] **Database transcript entry created** with status `uploaded`
- [x] **Audio file downloaded** to `data/{transcript_id}/upload.webm` (~100-200KB)
- [x] **Transcript has stub data**: title "The Great Fish Eating Argument"
- [x] **Transcript has 3 topics** about fish argument
- [x] **Transcript has 2 participants**: "Fish Eater" (speaker 0) and "Annoying Person" (speaker 1)
- [x] **Topics contain word-level data** with timestamps and speaker IDs
- [x] **Total duration** ~200-300 seconds based on fish text parsing
- [x] **MP3 and waveform files generated** by file processing pipeline
- [x] **Frontend transcript page loads** without "Failed to load audio" error
- [x] **Audio player functional** with working playback and waveform visualization
- [ ] **Database transcript entry created** with status `uploaded`
- [ ] **Audio file downloaded** to `data/{transcript_id}/upload.webm`
- [ ] **Transcript has REAL data**: AI-generated title based on conversation
- [ ] **Transcript has topics** generated from actual content
- [ ] **Transcript has participants** with proper speaker diarization
- [ ] **Topics contain word-level data** with accurate timestamps and speaker IDs
- [ ] **Total duration** matches actual meeting length
- [ ] **MP3 and waveform files generated** by file processing pipeline
- [ ] **Frontend transcript page loads** without "Failed to load audio" error
- [ ] **Audio player functional** with working playback and waveform visualization
- [ ] **Multitrack processing completed** without errors in worker logs
- [ ] **Modal GPU backends accessible** (transcription and diarization)