Files
reflector/server/docs/DAILY_REFLECTOR_DATA_MODEL.md
Igor Monadical 6c175a11d8 feat: brady bunch (#816)
* brady bunch PRD/tasks

* clean dead daily.co code

* brady bunch prototype (no-mistakes)

* brady bunch prototype (no-mistakes) review

* self-review

* daily poll time match (no-mistakes)

* daily poll self-review (no-mistakes)

* daily poll self-review (no-mistakes)

* daily co doc

* cleanup

* cleanup

* self-review (no-mistakes)

* self-review (no-mistakes)

* self-review

* self-review

* ui typefix

* dupe calls error handling proper

* daily reflector data model doc

* logging style fix

* migration merge

---------

Co-authored-by: Igor Loskutov <igor.loskutoff@gmail.com>
2026-01-23 12:33:06 -05:00

497 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Daily.co and Reflector Data Model
This document explains the data model relationships between Daily.co's API concepts and Reflector's database schema, clarifying common sources of confusion.
---
## Table of Contents
1. [Core Entities Overview](#core-entities-overview)
2. [Daily.co vs Reflector Terminology](#dailyco-vs-reflector-terminology)
3. [Entity Relationships](#entity-relationships)
4. [Recording Multiplicity](#recording-multiplicity)
5. [Session Identifiers Explained](#session-identifiers-explained)
6. [Time-Based Matching](#time-based-matching)
7. [Multitrack Recording Details](#multitrack-recording-details)
8. [Verified Example](#verified-example)
---
## Core Entities Overview
### Reflector's Four Primary Entities
```
┌─────────────────────────────────────────────────────────────────┐
│ Room (Reflector) │
│ - Persistent meeting template │
│ - User-created configuration │
│ - Example: "team-standup" │
└────────────────────┬────────────────────────────────────────────┘
│ 1:N
┌─────────────────────────────────────────────────────────────────┐
│ Meeting (Reflector) │
│ - Single session instance │
│ - Creates NEW Daily.co room with timestamp │
│ - Example: "team-standup-20260115120000" │
└────────────────────┬────────────────────────────────────────────┘
│ 1:N
┌─────────────────────────────────────────────────────────────────┐
│ Recording (Reflector + Daily.co) │
│ - One segment of audio/video │
│ - New recording created on stop/restart │
│ - track_keys: JSON array of S3 file paths │
└────────────────────┬────────────────────────────────────────────┘
│ 1:1
┌─────────────────────────────────────────────────────────────────┐
│ Transcript (Reflector) │
│ - Processed audio with transcription │
│ - Diarization, summaries, topics │
│ - One transcript per recording │
└─────────────────────────────────────────────────────────────────┘
```
---
## Daily.co vs Reflector Terminology
### Room
| Aspect | Daily.co | Reflector |
|--------|----------|-----------|
| **Definition** | Virtual meeting space on Daily.co platform | User-created meeting template/configuration |
| **Lifetime** | Configurable expiration | Persistent until user deletes |
| **Creation** | API call for each meeting | Pre-created by user once |
| **Reuse** | Can host multiple sessions | Generates new Daily.co room per meeting |
| **Name Format** | `room-name` (reusable) | `room-name` (base identifier) |
| **Timestamping** | Not required | Meeting adds timestamp: `{name}-YYYYMMDDHHMMSS` |
**Example:**
```
Reflector Room: "daily-private-igor" (persistent config)
↓ starts meeting
Daily.co Room: "daily-private-igor-20260110042117"
```
### Meeting
| Aspect | Daily.co | Reflector |
|--------|----------|-----------|
| **Definition** | Session that starts when first participant joins | Explicit database record of a session |
| **Identifier** | `mtgSessionId` (generated by Daily.co) | `meeting.id` (UUID, generated by Reflector) |
| **Creation** | Implicit (first participant join) | Explicit API call before participants join |
| **Purpose** | Tracks active session state | Links recordings, transcripts, participants |
| **Scope** | Per room instance | Per Reflector room + timestamp |
**Critical Limitation:** Daily.co's recordings API often does NOT return `mtgSessionId`, requiring time-based matching (see [Time-Based Matching](#time-based-matching)).
### Recording
| Aspect | Daily.co | Reflector |
|--------|----------|-----------|
| **Definition** | Audio/video files on S3 | Metadata + processing status |
| **Types** | `cloud` (composed video), `raw-tracks` (multitrack) | Stores references + `track_keys` array |
| **Multiplicity** | One recording object per start/stop cycle | One DB row per Daily.co recording object |
| **Identifier** | Daily.co `recording_id` | Same `recording_id` (stored in DB) |
| **Multitrack** | Array of `.webm` files (one per participant) | `track_keys` JSON array with S3 paths |
| **Linkage** | Via `room_name` + `start_ts` | FK `meeting_id` (set via time-based match) |
**Critical Behavior:** Recording **stops/restarts** create **separate recording objects** with unique IDs.
---
## Entity Relationships
### Database Schema Relationships
```sql
-- Simplified schema showing key relationships
TABLE room (
id VARCHAR PRIMARY KEY,
name VARCHAR UNIQUE,
platform VARCHAR -- 'whereby' | 'daily'
)
TABLE meeting (
id VARCHAR PRIMARY KEY,
room_id VARCHAR REFERENCES room(id) ON DELETE CASCADE, -- nullable
room_name VARCHAR, -- Daily.co room name (timestamped)
start_date TIMESTAMP,
platform VARCHAR
)
TABLE recording (
id VARCHAR PRIMARY KEY, -- Daily.co recording_id
meeting_id VARCHAR, -- FK to meeting (set via time-based match)
bucket_name VARCHAR,
object_key VARCHAR, -- S3 prefix
track_keys JSON, -- Array of S3 keys for multitrack
recorded_at TIMESTAMP
)
TABLE transcript (
id VARCHAR PRIMARY KEY,
recording_id VARCHAR, -- nullable FK
meeting_id VARCHAR, -- nullable FK
room_id VARCHAR, -- nullable FK
participants JSON, -- [{id, speaker, name, user_id}, ...]
title VARCHAR,
long_summary VARCHAR,
webvtt TEXT
)
```
**Relationship Cardinalities:**
```
1 Room → N Meetings
1 Meeting → N Recordings (common: 1-21 recordings per meeting)
1 Recording → 1 Transcript
1 Meeting → N Transcripts (via recordings)
```
---
## Recording Multiplicity
### Why Multiple Recordings Per Meeting?
Daily.co creates a **new recording object** (new ID, new files) whenever recording stops and restarts. This happens due to:
1. **Manual stop/start** - User clicks stop, then start recording again
2. **Network reconnection** - Participant drops, reconnects → triggers restart
3. **Participant rejoin** - Last participant leaves, new one joins → new session
---
## Session Identifiers Explained
### The Hidden Entity: Daily.co Meeting Session
Daily.co has an **implicit ephemeral entity** that sits between Room and Recording:
```
Daily.co Room: "daily-private-igor-20260110042117"
├─ Daily.co Meeting Session #1 (mtgSessionId: c04334de...)
│ └─ Recording #3 (f4a50f94) - 4s, 1 track
└─ Daily.co Meeting Session #2 (mtgSessionId: 4cdae3c0...)
├─ Recording #2 (b0fa94da) - 80s, 2 tracks ← recording stopped
└─ Recording #1 (05edf519) - 62s, 1 track ← then restarted
```
**Daily.co Meeting Session:**
- **Lifecycle:** Starts when first participant joins, ends when last participant leaves
- **Identifier:** `mtgSessionId` (generated by Daily.co)
- **Persistence:** Ephemeral - new ID if everyone leaves and someone rejoins
- **Relationship:** 1 Session → N Recordings (if recording stops/restarts during session)
**Key Insight:** Multiple recordings can share the same `mtgSessionId` if recording was stopped and restarted while participants remained connected.
### mtgSessionId (Meeting Session Identifier)
`mtgSessionId` identifies a **Daily.co meeting session** (not individual participants, not a room).
### session_id (Per-Participant)
**Different concept:** Per-participant connection identifier from webhooks.
**Reflector Tracking:** `daily_participant_session` table
```sql
TABLE daily_participant_session (
id VARCHAR PRIMARY KEY, -- {meeting_id}:{user_id}:{joined_at_ms}
meeting_id VARCHAR,
session_id VARCHAR, -- From webhook (per-participant)
user_id VARCHAR,
user_name VARCHAR,
joined_at TIMESTAMP,
left_at TIMESTAMP
)
```
---
## Time-Based Matching
### Problem Statement
Daily.co's recordings API does not reliably return `mtgSessionId`, making it impossible to directly link recordings to meetings via Daily.co's identifiers.
**Example API response:**
```json
{
"id": "recording-uuid",
"room_name": "daily-private-igor-20260110042117",
"start_ts": 1768018896,
"mtgSessionId": null Missing!
}
```
### Solution: Time-Based Matching
**Implementation:** `reflector/db/meetings.py:get_by_room_name_and_time()`
---
## Multitrack Recording Details
### track_keys JSON Array
**Schema:** `recording.track_keys` (JSON, nullable)
```sql
-- Example recording with 2 audio tracks
{
"id": "b0fa94da-73b5-4f95-9239-5216a682a505",
"track_keys": [
"igormonadical/daily-private-igor-20260110042117/1768018896877-890c0eae-e186-4534-a7bd-7c794b7d6d7f-cam-audio-1768018914565",
"igormonadical/daily-private-igor-20260110042117/1768018896877-9660e8e9-4297-4f17-951d-0b2bf2401803-cam-audio-1768018899286"
]
}
```
**Semantics:**
- `track_keys = null` → Not multitrack (cloud recording)
- `track_keys = []` → Multitrack recording with no audio captured (silence/muted)
- `track_keys = [...]` → Multitrack with N audio tracks
**Property:** `recording.is_multitrack` (Python)
```python
@property
def is_multitrack(self) -> bool:
return self.track_keys is not None and len(self.track_keys) > 0
```
### Track Filename Format
Daily.co multitrack filenames encode timing and participant information:
**Format:** `{recording_start_ts}-{participant_id}-cam-audio-{track_start_ts}`
**Example:** `1768018896877-890c0eae-e186-4534-a7bd-7c794b7d6d7f-cam-audio-1768018914565`
**Parsed Components:**
```python
# reflector/utils/daily.py:25-60
class DailyRecordingFilename(NamedTuple):
recording_start_ts: int # 1768018896877 (milliseconds)
participant_id: str # 890c0eae-e186-4534-a7bd-7c794b7d6d7f
track_start_ts: int # 1768018914565 (milliseconds)
```
**Note:** Browser downloads from S3 add `.webm` extension due to MIME headers, but S3 object keys have no extension.
### Video Track Filtering
Daily.co API returns both audio and video tracks, but Reflector only processes audio.
**Filtering Logic:** `reflector/worker/process.py:660`
```python
track_keys = [t.s3Key for t in recording.tracks if t.type == "audio"]
```
**Example API Response:**
```json
{
"tracks": [
{"type": "audio", "s3Key": "...cam-audio-1768018914565"},
{"type": "audio", "s3Key": "...cam-audio-1768018899286"},
{"type": "video", "s3Key": "...cam-video-1768018897095"} Filtered out
]
}
```
**Result:** Only 2 audio tracks stored in `recording.track_keys`, video track discarded.
**Rationale:** Reflector is audio transcription system; video not needed for processing.
### Track-to-Participant Mapping
**Flow:**
1. Daily.co webhook/polling provides `track_keys` array
2. Each track filename contains `participant_id`
3. Reflector queries Daily.co API: `GET /meetings/{mtgSessionId}/participants`
4. Maps `participant_id``user_name`
5. Stores in `transcript.participants` JSON:
```json
[
{
"id": "890c0eae-e186-4534-a7bd-7c794b7d6d7f",
"speaker": 0,
"name": "test2",
"user_id": "907f2cc1-eaab-435f-8ee2-09185f416b22"
},
{
"id": "9660e8e9-4297-4f17-951d-0b2bf2401803",
"speaker": 1,
"name": "test",
"user_id": "907f2cc1-eaab-435f-8ee2-09185f416b22"
}
]
```
**Diarization:** Multitrack recordings don't need speaker diarization AI — speaker identity comes from separate audio tracks.
---
## Example
### Meeting: daily-private-igor-20260110042117
**Context:** User conducted test recording with start/stop cycles, producing 3 recordings.
#### Database State
```sql
-- Meeting
id: 034804b8-cee2-4fb4-94d7-122f6f068a61
room_name: daily-private-igor-20260110042117
start_date: 2026-01-10 04:21:17+00
```
#### Daily.co API Response
```json
[
{
"id": "f4a50f94-053c-4f9d-bda6-78ad051fbc36",
"room_name": "daily-private-igor-20260110042117",
"start_ts": 1768018885,
"duration": 4,
"status": "finished",
"mtgSessionId": "c04334de-42a0-4c2a-96be-a49b068dca85",
"tracks": [
{"type": "audio", "s3Key": "...62e8f3ae...cam-audio-1768018885417"}
]
},
{
"id": "b0fa94da-73b5-4f95-9239-5216a682a505",
"room_name": "daily-private-igor-20260110042117",
"start_ts": 1768018896,
"duration": 80,
"status": "finished",
"mtgSessionId": "4cdae3c0-86cb-4578-8a6d-3a228bb48345",
"tracks": [
{"type": "audio", "s3Key": "...890c0eae...cam-audio-1768018914565"},
{"type": "audio", "s3Key": "...9660e8e9...cam-audio-1768018899286"},
{"type": "video", "s3Key": "...9660e8e9...cam-video-1768018897095"}
]
},
{
"id": "05edf519-9048-4b49-9a75-73e9826fd950",
"room_name": "daily-private-igor-20260110042117",
"start_ts": 1768018914,
"duration": 62,
"status": "finished",
"mtgSessionId": "4cdae3c0-86cb-4578-8a6d-3a228bb48345",
"tracks": [
{"type": "audio", "s3Key": "...890c0eae...cam-audio-1768018914948"}
]
}
]
```
**Key Observations:**
- 3 recording objects returned by Daily.co
- 2 different `mtgSessionId` values (2 different meeting instances)
- Recording #2 has 3 tracks (2 audio + 1 video)
- Timestamps: 1768018885 → 1768018896 (+11s) → 1768018914 (+18s)
#### Reflector Database
**Recordings:**
```
┌──────────────────────────────────────┬──────────────┬────────────┬──────────────────────────────────────┐
│ id │ track_count │ duration │ mtgSessionId │
├──────────────────────────────────────┼──────────────┼────────────┼──────────────────────────────────────┤
│ f4a50f94-053c-4f9d-bda6-78ad051fbc36 │ 1 │ 4s │ c04334de-42a0-4c2a-96be-a49b068dca85 │
│ b0fa94da-73b5-4f95-9239-5216a682a505 │ 2 (video=0) │ 80s │ 4cdae3c0-86cb-4578-8a6d-3a228bb48345 │
│ 05edf519-9048-4b49-9a75-73e9826fd950 │ 1 │ 62s │ 4cdae3c0-86cb-4578-8a6d-3a228bb48345 │
└──────────────────────────────────────┴──────────────┴────────────┴──────────────────────────────────────┘
```
**Note:** Recording #2 has 2 audio tracks (video filtered out), not 3.
**Transcripts:**
```
┌──────────────────────────────────────┬──────────────────────────────────────┬──────────────┬──────────────────────────────────────────────┐
│ id │ recording_id │ participants │ title │
├──────────────────────────────────────┼──────────────────────────────────────┼──────────────┼──────────────────────────────────────────────┤
│ 17149b1f-546c-4837-80a0-f8140bd16592 │ f4a50f94-053c-4f9d-bda6-78ad051fbc36 │ 1 (test) │ (empty - no speech) │
│ 49801332-3222-4c11-bdb2-375479fc87f2 │ b0fa94da-73b5-4f95-9239-5216a682a505 │ 2 (test, │ "Examination and Validation Procedures │
│ │ │ test2) │ Review" │
│ e5271e12-20fb-42d2-b5a8-21438abadef9 │ 05edf519-9048-4b49-9a75-73e9826fd950 │ 1 (test2) │ "Technical Sound Check Procedure Review" │
└──────────────────────────────────────┴──────────────────────────────────────┴──────────────┴──────────────────────────────────────────────┘
```
**Transcript Content:**
*Transcript #1* (17149b1f): Empty WebVTT (no audio captured)
*Transcript #2* (49801332):
```webvtt
WEBVTT
00:00:03.109 --> 00:00:05.589
<v Speaker1>Test, test, test. Test, test, test, test, test.
00:00:19.829 --> 00:00:22.710
<v Speaker0>Test test test test test test test test test test test.
```
**AI-Generated Summary:**
> "The meeting focused on the critical importance of rigorous testing for ensuring reliability and quality, with test and test2 emphasizing the need for a structured testing framework and meticulous documentation..."
*Transcript #3* (e5271e12):
```webvtt
WEBVTT
00:00:02.029 --> 00:00:04.910
<v Speaker0>Test, test, test, test, test, test, test, test, test, test, test.
```
#### Validation: track_keys → participants
**Recording #2 (b0fa94da) tracks:**
```json
[
".../890c0eae-e186-4534-a7bd-7c794b7d6d7f-cam-audio-...",
".../9660e8e9-4297-4f17-951d-0b2bf2401803-cam-audio-..."
]
```
**Transcript #2 (49801332) participants:**
```json
[
{"id": "890c0eae-e186-4534-a7bd-7c794b7d6d7f", "speaker": 0, "name": "test2"},
{"id": "9660e8e9-4297-4f17-951d-0b2bf2401803", "speaker": 1, "name": "test"}
]
```
### Data Flow
```
Daily.co API: 3 recordings
Polling: _poll_raw_tracks_recordings()
Worker: process_multitrack_recording.delay() × 3
DB: 3 recording rows created
Pipeline: Audio processing + transcription × 3
DB: 3 transcript rows created (1:1 with recordings)
UI: User sees 3 separate transcripts
```
**Result:** ✅ 1:1 Recording → Transcript relationship maintained.
---
**Document Version:** 1.0
**Last Verified:** 2026-01-15
**Data Source:** Production database + Daily.co API inspection