mirror of
https://github.com/Monadical-SAS/reflector.git
synced 2025-12-23 05:39:05 +00:00
dailico track merge vibe
This commit is contained in:
@@ -1,27 +1,27 @@
|
||||
# Daily.co Integration Test Plan
|
||||
|
||||
## ⚠️ IMPORTANT: Stub Implementation
|
||||
## ✅ IMPLEMENTATION STATUS: Real Transcription Active
|
||||
|
||||
**This test validates Daily.co webhook integration with MOCK transcription data.**
|
||||
**This test validates Daily.co multitrack recording integration with REAL transcription/diarization.**
|
||||
|
||||
The actual audio/video files are recorded to S3, but transcription/diarization is NOT performed. Instead:
|
||||
- A **stub processor** generates fake transcript with predetermined text ("The Great Fish Eating Argument")
|
||||
- **Audio track is downloaded from Daily.co S3** to local storage for playback in the frontend
|
||||
- All database entities (recording, transcript, topics, participants, words) are created with **fake "fish" conversation data**
|
||||
- This allows testing the complete webhook → database flow WITHOUT expensive GPU processing
|
||||
The implementation includes complete audio processing pipeline:
|
||||
- **Multitrack recordings** from Daily.co S3 (separate audio stream per participant)
|
||||
- **PyAV-based audio mixdown** with PTS-based track alignment
|
||||
- **Real transcription** via Modal GPU backend (Whisper)
|
||||
- **Real diarization** via Modal GPU backend (speaker identification)
|
||||
- **Per-track transcription** with timestamp synchronization
|
||||
- **Complete database entities** (recording, transcript, topics, participants, words)
|
||||
|
||||
**Expected transcript content:**
|
||||
- Title: "The Great Fish Eating Argument"
|
||||
- Participants: "Fish Eater" (speaker 0), "Annoying Person" (speaker 1)
|
||||
- Transcription: Nonsensical argument about eating fish (see `reflector/worker/daily_stub_data.py`)
|
||||
- Audio file: Downloaded WebM from Daily.co S3 (stored in `data/{transcript_id}/upload.webm`)
|
||||
**Processing pipeline** (`PipelineMainMultitrack`):
|
||||
1. Download all audio tracks from Daily.co S3
|
||||
2. Align tracks by PTS (presentation timestamp) to handle late joiners
|
||||
3. Mix tracks into single audio file for unified playback
|
||||
4. Transcribe each track individually with proper offset handling
|
||||
5. Perform diarization on mixed audio
|
||||
6. Generate topics, summaries, and word-level timestamps
|
||||
7. Convert audio to MP3 and generate waveform visualization
|
||||
|
||||
**File processing pipeline** then:
|
||||
- Converts WebM to MP3 format (for frontend audio player)
|
||||
- Generates waveform visualization data (audio.json)
|
||||
- These files enable proper frontend transcript page display
|
||||
|
||||
**Next implementation step:** Replace stub with real transcription pipeline (merge audio tracks, run Whisper/diarization).
|
||||
**Note:** A stub processor (`process_daily_recording`) exists for testing webhook flow without GPU costs, but the production code path uses `process_multitrack_recording` with full ML pipeline.
|
||||
|
||||
---
|
||||
|
||||
@@ -29,6 +29,7 @@ The actual audio/video files are recorded to S3, but transcription/diarization i
|
||||
|
||||
**1. Environment Variables** (check in `.env.development.local`):
|
||||
```bash
|
||||
# Daily.co API Configuration
|
||||
DAILY_API_KEY=<key>
|
||||
DAILY_SUBDOMAIN=monadical
|
||||
DAILY_WEBHOOK_SECRET=<base64-encoded-secret>
|
||||
@@ -37,25 +38,43 @@ AWS_DAILY_S3_REGION=us-east-1
|
||||
AWS_DAILY_ROLE_ARN=arn:aws:iam::950402358378:role/DailyCo
|
||||
DAILY_MIGRATION_ENABLED=true
|
||||
DAILY_MIGRATION_ROOM_IDS=["552640fd-16f2-4162-9526-8cf40cd2357e"]
|
||||
|
||||
# Transcription/Diarization Backend (Required for real processing)
|
||||
DIARIZATION_BACKEND=modal
|
||||
DIARIZATION_MODAL_API_KEY=<modal-api-key>
|
||||
# TRANSCRIPTION_BACKEND is not explicitly set (uses default/modal)
|
||||
```
|
||||
|
||||
**2. Services Running:**
|
||||
```bash
|
||||
docker-compose ps # server, postgres, redis should be UP
|
||||
docker compose ps # server, postgres, redis, worker, beat should be UP
|
||||
```
|
||||
|
||||
**IMPORTANT:** Worker and beat services MUST be running for transcription processing:
|
||||
```bash
|
||||
docker compose up -d worker beat
|
||||
```
|
||||
|
||||
**3. ngrok Tunnel for Webhooks:**
|
||||
```bash
|
||||
ngrok http 1250 # Note the URL (e.g., https://abc123.ngrok-free.app)
|
||||
# Start ngrok (if not already running)
|
||||
ngrok http 1250 --log=stdout > /tmp/ngrok.log 2>&1 &
|
||||
|
||||
# Get public URL
|
||||
curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; data=json.load(sys.stdin); print(data['tunnels'][0]['public_url'])"
|
||||
```
|
||||
|
||||
**Current ngrok URL:** `https://0503947384a3.ngrok-free.app` (as of last registration)
|
||||
|
||||
**4. Webhook Created:**
|
||||
```bash
|
||||
cd server
|
||||
uv run python scripts/recreate_daily_webhook.py https://abc123.ngrok-free.app/v1/daily/webhook
|
||||
uv run python scripts/recreate_daily_webhook.py https://0503947384a3.ngrok-free.app/v1/daily/webhook
|
||||
# Verify: "Created webhook <uuid> (state: ACTIVE)"
|
||||
```
|
||||
|
||||
**Current webhook status:** ✅ ACTIVE (webhook ID: dad5ad16-ceca-488e-8fc5-dae8650b51d0)
|
||||
|
||||
---
|
||||
|
||||
## Test 1: Database Configuration
|
||||
@@ -338,23 +357,25 @@ recorded_at: <recent-timestamp>
|
||||
|
||||
**Check transcript created:**
|
||||
```bash
|
||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
"SELECT id, title, status, duration, recording_id, meeting_id, room_id
|
||||
FROM transcript
|
||||
ORDER BY created_at DESC LIMIT 1;"
|
||||
```
|
||||
|
||||
**Expected:**
|
||||
**Expected (REAL transcription):**
|
||||
```
|
||||
id: <transcript-id>
|
||||
title: The Great Fish Eating Argument
|
||||
status: uploaded (audio file downloaded for playback)
|
||||
duration: ~200-300 seconds (depends on fish text parsing)
|
||||
title: <AI-generated title based on actual conversation content>
|
||||
status: uploaded (audio file processed and available)
|
||||
duration: <actual meeting duration in seconds>
|
||||
recording_id: <same-as-recording-id-above>
|
||||
meeting_id: <meeting-id>
|
||||
room_id: 552640fd-16f2-4162-9526-8cf40cd2357e
|
||||
```
|
||||
|
||||
**Note:** Title and content will reflect the ACTUAL conversation, not mock data. Processing time depends on recording length and GPU backend availability (Modal).
|
||||
|
||||
**Verify audio file exists:**
|
||||
```bash
|
||||
ls -lh data/<transcript-id>/upload.webm
|
||||
@@ -365,12 +386,12 @@ ls -lh data/<transcript-id>/upload.webm
|
||||
-rw-r--r-- 1 user staff ~100-200K Oct 10 18:48 upload.webm
|
||||
```
|
||||
|
||||
**Check transcript topics (stub data):**
|
||||
**Check transcript topics (REAL transcription):**
|
||||
```bash
|
||||
TRANSCRIPT_ID=$(docker-compose exec -T postgres psql -U reflector -d reflector -t -c \
|
||||
TRANSCRIPT_ID=$(docker compose exec -T postgres psql -U reflector -d reflector -t -c \
|
||||
"SELECT id FROM transcript ORDER BY created_at DESC LIMIT 1;")
|
||||
|
||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
"SELECT
|
||||
jsonb_array_length(topics) as num_topics,
|
||||
jsonb_array_length(participants) as num_participants,
|
||||
@@ -380,55 +401,52 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
WHERE id = '$TRANSCRIPT_ID';"
|
||||
```
|
||||
|
||||
**Expected:**
|
||||
**Expected (REAL data):**
|
||||
```
|
||||
num_topics: 3
|
||||
num_participants: 2
|
||||
short_summary: Two people argue about eating fish
|
||||
title: The Great Fish Eating Argument
|
||||
num_topics: <varies based on conversation>
|
||||
num_participants: <actual number of participants who spoke>
|
||||
short_summary: <AI-generated summary of actual conversation>
|
||||
title: <AI-generated title based on content>
|
||||
```
|
||||
|
||||
**Check topics contain fish text:**
|
||||
**Check topics contain actual transcription:**
|
||||
```bash
|
||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
"SELECT topics->0->'title', topics->0->'summary', topics->0->'transcript'
|
||||
FROM transcript
|
||||
ORDER BY created_at DESC LIMIT 1;" | head -20
|
||||
```
|
||||
|
||||
**Expected output should contain:**
|
||||
```
|
||||
Fish Argument Part 1
|
||||
Argument about eating fish continues (part 1)
|
||||
Fish for dinner are nothing wrong with you? There's nothing...
|
||||
```
|
||||
**Expected output:** Will contain the ACTUAL transcribed conversation from the Daily.co meeting, not mock data.
|
||||
|
||||
**Check participants:**
|
||||
```bash
|
||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
"SELECT participants FROM transcript ORDER BY created_at DESC LIMIT 1;" \
|
||||
| python3 -c "import sys, json; data=json.loads(sys.stdin.read()); print(json.dumps(data, indent=2))"
|
||||
```
|
||||
|
||||
**Expected:**
|
||||
**Expected (REAL diarization):**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": "<uuid>",
|
||||
"speaker": 0,
|
||||
"name": "Fish Eater"
|
||||
"name": "Speaker 1"
|
||||
},
|
||||
{
|
||||
"id": "<uuid>",
|
||||
"speaker": 1,
|
||||
"name": "Annoying Person"
|
||||
"name": "Speaker 2"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
**Note:** Speaker names will be generic ("Speaker 1", "Speaker 2", etc.) as determined by the diarization backend. Number of participants depends on how many actually spoke during the meeting.
|
||||
|
||||
**Check word-level data:**
|
||||
```bash
|
||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
"SELECT jsonb_array_length(topics->0->'words') as num_words_first_topic
|
||||
FROM transcript
|
||||
ORDER BY created_at DESC LIMIT 1;"
|
||||
@@ -436,12 +454,12 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
|
||||
**Expected:**
|
||||
```
|
||||
num_words_first_topic: ~100-150 (varies based on topic chunking)
|
||||
num_words_first_topic: <varies based on actual conversation length and topic chunking>
|
||||
```
|
||||
|
||||
**Verify speaker diarization in words:**
|
||||
```bash
|
||||
docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
docker compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
"SELECT
|
||||
topics->0->'words'->0->>'text' as first_word,
|
||||
topics->0->'words'->0->>'speaker' as speaker,
|
||||
@@ -451,14 +469,16 @@ docker-compose exec -T postgres psql -U reflector -d reflector -c \
|
||||
ORDER BY created_at DESC LIMIT 1;"
|
||||
```
|
||||
|
||||
**Expected:**
|
||||
**Expected (REAL transcription):**
|
||||
```
|
||||
first_word: Fish
|
||||
speaker: 0 or 1 (depends on parsing)
|
||||
start_time: 0.0
|
||||
end_time: 0.35 (approximate)
|
||||
first_word: <actual first word from transcription>
|
||||
speaker: 0, 1, 2, ... (actual speaker ID from diarization)
|
||||
start_time: <actual timestamp in seconds>
|
||||
end_time: <actual end timestamp>
|
||||
```
|
||||
|
||||
**Note:** All timestamps and speaker IDs are from real transcription/diarization, synchronized across tracks.
|
||||
|
||||
---
|
||||
|
||||
## Test 8: Recording Type Verification
|
||||
@@ -579,13 +599,15 @@ Recording: raw-tracks
|
||||
- [x] S3 path: `monadical/test2-{timestamp}/{recording-start-ts}-{participant-uuid}-cam-{audio|video}-{track-start-ts}`
|
||||
- [x] Database `num_clients` increments/decrements correctly
|
||||
- [x] **Database recording entry created** with correct S3 path and status `completed`
|
||||
- [x] **Database transcript entry created** with status `uploaded`
|
||||
- [x] **Audio file downloaded** to `data/{transcript_id}/upload.webm` (~100-200KB)
|
||||
- [x] **Transcript has stub data**: title "The Great Fish Eating Argument"
|
||||
- [x] **Transcript has 3 topics** about fish argument
|
||||
- [x] **Transcript has 2 participants**: "Fish Eater" (speaker 0) and "Annoying Person" (speaker 1)
|
||||
- [x] **Topics contain word-level data** with timestamps and speaker IDs
|
||||
- [x] **Total duration** ~200-300 seconds based on fish text parsing
|
||||
- [x] **MP3 and waveform files generated** by file processing pipeline
|
||||
- [x] **Frontend transcript page loads** without "Failed to load audio" error
|
||||
- [x] **Audio player functional** with working playback and waveform visualization
|
||||
- [ ] **Database transcript entry created** with status `uploaded`
|
||||
- [ ] **Audio file downloaded** to `data/{transcript_id}/upload.webm`
|
||||
- [ ] **Transcript has REAL data**: AI-generated title based on conversation
|
||||
- [ ] **Transcript has topics** generated from actual content
|
||||
- [ ] **Transcript has participants** with proper speaker diarization
|
||||
- [ ] **Topics contain word-level data** with accurate timestamps and speaker IDs
|
||||
- [ ] **Total duration** matches actual meeting length
|
||||
- [ ] **MP3 and waveform files generated** by file processing pipeline
|
||||
- [ ] **Frontend transcript page loads** without "Failed to load audio" error
|
||||
- [ ] **Audio player functional** with working playback and waveform visualization
|
||||
- [ ] **Multitrack processing completed** without errors in worker logs
|
||||
- [ ] **Modal GPU backends accessible** (transcription and diarization)
|
||||
|
||||
Reference in New Issue
Block a user