dailico track merge vibe

2025-12-23 05:39:05 +00:00 · 2025-10-21 10:30:19 -04:00
parent f844b9fc1f
commit 7d239fe380
12 changed files with 1993 additions and 124 deletions
--- a/server/DAILYCO_TEST.md
+++ b/server/DAILYCO_TEST.md
@@ -1,27 +1,27 @@
 # Daily.co Integration Test Plan

-## ⚠️ IMPORTANT: Stub Implementation
+## ✅ IMPLEMENTATION STATUS: Real Transcription Active

-**This test validates Daily.co webhook integration with MOCK transcription data.**
+**This test validates Daily.co multitrack recording integration with REAL transcription/diarization.**

-The actual audio/video files are recorded to S3, but transcription/diarization is NOT performed. Instead:
- A **stub processor** generates fake transcript with predetermined text ("The Great Fish Eating Argument")
- **Audio track is downloaded from Daily.co S3** to local storage for playback in the frontend
- All database entities (recording, transcript, topics, participants, words) are created with **fake "fish" conversation data**
- This allows testing the complete webhook → database flow WITHOUT expensive GPU processing
+The implementation includes complete audio processing pipeline:
+- **Multitrack recordings** from Daily.co S3 (separate audio stream per participant)
+- **PyAV-based audio mixdown** with PTS-based track alignment
+- **Real transcription** via Modal GPU backend (Whisper)
+- **Real diarization** via Modal GPU backend (speaker identification)
+- **Per-track transcription** with timestamp synchronization
+- **Complete database entities** (recording, transcript, topics, participants, words)

-**Expected transcript content:**
- Title: "The Great Fish Eating Argument"
- Participants: "Fish Eater" (speaker 0), "Annoying Person" (speaker 1)
- Transcription: Nonsensical argument about eating fish (see `reflector/worker/daily_stub_data.py`)
- Audio file: Downloaded WebM from Daily.co S3 (stored in `data/{transcript_id}/upload.webm`)
+**Processing pipeline** (`PipelineMainMultitrack`):
+1. Download all audio tracks from Daily.co S3
+2. Align tracks by PTS (presentation timestamp) to handle late joiners
+3. Mix tracks into single audio file for unified playback
+4. Transcribe each track individually with proper offset handling
+5. Perform diarization on mixed audio
+6. Generate topics, summaries, and word-level timestamps
+7. Convert audio to MP3 and generate waveform visualization

-**File processing pipeline** then:
- Converts WebM to MP3 format (for frontend audio player)
- Generates waveform visualization data (audio.json)
- These files enable proper frontend transcript page display
-
-**Next implementation step:** Replace stub with real transcription pipeline (merge audio tracks, run Whisper/diarization).
+**Note:** A stub processor (`process_daily_recording`) exists for testing webhook flow without GPU costs, but the production code path uses `process_multitrack_recording` with full ML pipeline.

 ---

@@ -29,6 +29,7 @@ The actual audio/video files are recorded to S3, but transcription/diarization i

 **1. Environment Variables** (check in `.env.development.local`):
 ```bash
+# Daily.co API Configuration
 DAILY_API_KEY=<key>
 DAILY_SUBDOMAIN=monadical
 DAILY_WEBHOOK_SECRET=<base64-encoded-secret>
@@ -37,25 +38,43 @@ AWS_DAILY_S3_REGION=us-east-1
 AWS_DAILY_ROLE_ARN=arn:aws:iam::950402358378:role/DailyCo
 DAILY_MIGRATION_ENABLED=true
 DAILY_MIGRATION_ROOM_IDS=["552640fd-16f2-4162-9526-8cf40cd2357e"]
+
+# Transcription/Diarization Backend (Required for real processing)
+DIARIZATION_BACKEND=modal
+DIARIZATION_MODAL_API_KEY=<modal-api-key>
+# TRANSCRIPTION_BACKEND is not explicitly set (uses default/modal)
 ```

 **2. Services Running:**
 ```bash
-docker-compose ps  # server, postgres, redis should be UP
+docker compose ps  # server, postgres, redis, worker, beat should be UP
+```
+
+**IMPORTANT:** Worker and beat services MUST be running for transcription processing:
+```bash
+docker compose up -d worker beat
 ```

 **3. ngrok Tunnel for Webhooks:**
 ```bash
-ngrok http 1250  # Note the URL (e.g., https://abc123.ngrok-free.app)
+# Start ngrok (if not already running)
+ngrok http 1250 --log=stdout > /tmp/ngrok.log 2>&1 &
+
+# Get public URL
+curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; data=json.load(sys.stdin); print(data['tunnels'][0]['public_url'])"
 ```

+**Current ngrok URL:** `https://0503947384a3.ngrok-free.app` (as of last registration)
+
 **4. Webhook Created:**
 ```bash
 cd server
-uv run python scripts/recreate_daily_webhook.py https://abc123.ngrok-free.app/v1/daily/webhook
+uv run python scripts/recreate_daily_webhook.py https://0503947384a3.ngrok-free.app/v1/daily/webhook
 # Verify: "Created webhook <uuid> (state: ACTIVE)"
 ```

+**Current webhook status:** ✅ ACTIVE (webhook ID: dad5ad16-ceca-488e-8fc5-dae8650b51d0)
+
 ---

 ## Test 1: Database Configuration
@@ -338,23 +357,25 @@ recorded_at: <recent-timestamp>

 **Check transcript created:**
 ```bash
-docker-compose exec -T postgres psql -U reflector -d reflector -c \
+docker compose exec -T postgres psql -U reflector -d reflector -c \
  "SELECT id, title, status, duration, recording_id, meeting_id, room_id
   FROM transcript
   ORDER BY created_at DESC LIMIT 1;"
 ```

-**Expected:**
+**Expected (REAL transcription):**
 ```
 id: <transcript-id>
-title: The Great Fish Eating Argument
-status: uploaded  (audio file downloaded for playback)
-duration: ~200-300 seconds (depends on fish text parsing)
+title: <AI-generated title based on actual conversation content>
+status: uploaded  (audio file processed and available)
+duration: <actual meeting duration in seconds>
 recording_id: <same-as-recording-id-above>
 meeting_id: <meeting-id>
 room_id: 552640fd-16f2-4162-9526-8cf40cd2357e
 ```

+**Note:** Title and content will reflect the ACTUAL conversation, not mock data. Processing time depends on recording length and GPU backend availability (Modal).
+
 **Verify audio file exists:**
 ```bash
 ls -lh data/<transcript-id>/upload.webm
@@ -365,12 +386,12 @@ ls -lh data/<transcript-id>/upload.webm
 -rw-r--r-- 1 user staff ~100-200K Oct 10 18:48 upload.webm
 ```

-**Check transcript topics (stub data):**
+**Check transcript topics (REAL transcription):**
 ```bash
-TRANSCRIPT_ID=$(docker-compose exec -T postgres psql -U reflector -d reflector -t -c \
+TRANSCRIPT_ID=$(docker compose exec -T postgres psql -U reflector -d reflector -t -c \
  "SELECT id FROM transcript ORDER BY created_at DESC LIMIT 1;")

-docker-compose exec -T postgres psql -U reflector -d reflector -c \
+docker compose exec -T postgres psql -U reflector -d reflector -c \
  "SELECT
     jsonb_array_length(topics) as num_topics,
     jsonb_array_length(participants) as num_participants,
@@ -380,55 +401,52 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \
   WHERE id = '$TRANSCRIPT_ID';"
 ```

-**Expected:**
+**Expected (REAL data):**
 ```
-num_topics: 3
-num_participants: 2
-short_summary: Two people argue about eating fish
-title: The Great Fish Eating Argument
+num_topics: <varies based on conversation>
+num_participants: <actual number of participants who spoke>
+short_summary: <AI-generated summary of actual conversation>
+title: <AI-generated title based on content>
 ```

-**Check topics contain fish text:**
+**Check topics contain actual transcription:**
 ```bash
-docker-compose exec -T postgres psql -U reflector -d reflector -c \
+docker compose exec -T postgres psql -U reflector -d reflector -c \
  "SELECT topics->0->'title', topics->0->'summary', topics->0->'transcript'
   FROM transcript
   ORDER BY created_at DESC LIMIT 1;" | head -20
 ```

-**Expected output should contain:**
-```
-Fish Argument Part 1
-Argument about eating fish continues (part 1)
-Fish for dinner are nothing wrong with you? There's nothing...
-```
+**Expected output:** Will contain the ACTUAL transcribed conversation from the Daily.co meeting, not mock data.

 **Check participants:**
 ```bash
-docker-compose exec -T postgres psql -U reflector -d reflector -c \
+docker compose exec -T postgres psql -U reflector -d reflector -c \
  "SELECT participants FROM transcript ORDER BY created_at DESC LIMIT 1;" \
  | python3 -c "import sys, json; data=json.loads(sys.stdin.read()); print(json.dumps(data, indent=2))"
 ```

-**Expected:**
+**Expected (REAL diarization):**
 ```json
 [
  {
    "id": "<uuid>",
    "speaker": 0,
-    "name": "Fish Eater"
+    "name": "Speaker 1"
  },
  {
    "id": "<uuid>",
    "speaker": 1,
-    "name": "Annoying Person"
+    "name": "Speaker 2"
  }
 ]
 ```

+**Note:** Speaker names will be generic ("Speaker 1", "Speaker 2", etc.) as determined by the diarization backend. Number of participants depends on how many actually spoke during the meeting.
+
 **Check word-level data:**
 ```bash
-docker-compose exec -T postgres psql -U reflector -d reflector -c \
+docker compose exec -T postgres psql -U reflector -d reflector -c \
  "SELECT jsonb_array_length(topics->0->'words') as num_words_first_topic
   FROM transcript
   ORDER BY created_at DESC LIMIT 1;"
@@ -436,12 +454,12 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \

 **Expected:**
 ```
-num_words_first_topic: ~100-150 (varies based on topic chunking)
+num_words_first_topic: <varies based on actual conversation length and topic chunking>
 ```

 **Verify speaker diarization in words:**
 ```bash
-docker-compose exec -T postgres psql -U reflector -d reflector -c \
+docker compose exec -T postgres psql -U reflector -d reflector -c \
  "SELECT
     topics->0->'words'->0->>'text' as first_word,
     topics->0->'words'->0->>'speaker' as speaker,
@@ -451,14 +469,16 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \
   ORDER BY created_at DESC LIMIT 1;"
 ```

-**Expected:**
+**Expected (REAL transcription):**
 ```
-first_word: Fish
-speaker: 0 or 1 (depends on parsing)
-start_time: 0.0
-end_time: 0.35 (approximate)
+first_word: <actual first word from transcription>
+speaker: 0, 1, 2, ... (actual speaker ID from diarization)
+start_time: <actual timestamp in seconds>
+end_time: <actual end timestamp>
 ```

+**Note:** All timestamps and speaker IDs are from real transcription/diarization, synchronized across tracks.
+
 ---

 ## Test 8: Recording Type Verification
@@ -579,13 +599,15 @@ Recording: raw-tracks
 - [x] S3 path: `monadical/test2-{timestamp}/{recording-start-ts}-{participant-uuid}-cam-{audio|video}-{track-start-ts}`
 - [x] Database `num_clients` increments/decrements correctly
 - [x] **Database recording entry created** with correct S3 path and status `completed`
- [x] **Database transcript entry created** with status `uploaded`
- [x] **Audio file downloaded** to `data/{transcript_id}/upload.webm` (~100-200KB)
- [x] **Transcript has stub data**: title "The Great Fish Eating Argument"
- [x] **Transcript has 3 topics** about fish argument
- [x] **Transcript has 2 participants**: "Fish Eater" (speaker 0) and "Annoying Person" (speaker 1)
- [x] **Topics contain word-level data** with timestamps and speaker IDs
- [x] **Total duration** ~200-300 seconds based on fish text parsing
- [x] **MP3 and waveform files generated** by file processing pipeline
- [x] **Frontend transcript page loads** without "Failed to load audio" error
- [x] **Audio player functional** with working playback and waveform visualization
+- [ ] **Database transcript entry created** with status `uploaded`
+- [ ] **Audio file downloaded** to `data/{transcript_id}/upload.webm`
+- [ ] **Transcript has REAL data**: AI-generated title based on conversation
+- [ ] **Transcript has topics** generated from actual content
+- [ ] **Transcript has participants** with proper speaker diarization
+- [ ] **Topics contain word-level data** with accurate timestamps and speaker IDs
+- [ ] **Total duration** matches actual meeting length
+- [ ] **MP3 and waveform files generated** by file processing pipeline
+- [ ] **Frontend transcript page loads** without "Failed to load audio" error
+- [ ] **Audio player functional** with working playback and waveform visualization
+- [ ] **Multitrack processing completed** without errors in worker logs
+- [ ] **Modal GPU backends accessible** (transcription and diarization)