durable (no-mistakes)

2025-12-23 05:39:05 +00:00 · 2025-12-15 12:17:39 -05:00
parent 90c3ecc9c3
commit 243ff2177c
43 changed files with 7808 additions and 1660 deletions
--- a/CONDUCTOR_MIGRATION_REQUIREMENTS.md
+++ b/CONDUCTOR_MIGRATION_REQUIREMENTS.md
@@ -0,0 +1,617 @@
+# Conductor Migration Requirements: Daily.co Multitrack Pipeline
+
+## Executive Summary
+
+Migrate the Daily.co multitrack diarization pipeline from a monolithic Celery task to a decomposed Conductor workflow, enabling visual progress tracking, granular retries, and operational observability.
+
+---
+
+## Business Value
+
+### 1. Visibility: Where Are We Now? (UX, DevEx)
+
+**Current State**: Users see only three states: `idle` → `processing` → `ended/error`. A 10-minute pipeline appears frozen with no feedback.
+
+**Target State**: Real-time visibility into which step is executing:
+- "Transcribing track 2 of 3"
+- "Generating summary (step 8 of 9)"
+- Visual DAG in admin UI showing completed/in-progress/pending nodes
+
+**Business Impact**:
+- Reduced support tickets ("is it stuck?")
+- Engineers can instantly identify bottlenecks
+- Users have confidence the system is working
+
+### 2. Progress Tracking: What's Left? (UX, DevEx)
+
+**Current State**: No indication of remaining work. A failure at step 8 gives same error as failure at step 1.
+
+**Target State**:
+- Progress percentage based on completed steps
+- Clear step enumeration (e.g., "Step 5/9: Transcription")
+- Frontend receives structured progress events with step metadata
+
+**Business Impact**:
+- Users can estimate completion time
+- Frontend can render meaningful progress bars
+- Error messages include context ("Failed during summary generation")
+
+### 3. Audit Trail & Profiling (DevEx, Ops)
+
+**Current State**: Logs scattered across Celery workers. No unified view of a single recording's journey. Resource consumption unknown per step.
+
+**Target State**:
+- Single workflow ID traces entire recording lifecycle
+- Per-step execution times recorded
+- Resource consumption (GPU seconds, LLM tokens) attributable to specific steps
+- Conductor UI provides complete audit history
+
+**Business Impact**:
+- Debugging: "Recording X failed at step Y after Z seconds"
+- Cost attribution: "Transcription costs $X, summarization costs $Y"
+- Performance optimization: identify slowest steps
+
+### 4. Clear Event Dictionary (DevEx)
+
+**Current State**: Frontend receives WebSocket events (`TRANSCRIPT`, `TOPIC`, `FINAL_TITLE`, etc.) but mapping to pipeline phases is implicit. Adding new events requires tracing through Python code.
+
+**Target State**:
+- Each Conductor task explicitly defines its output events
+- Event schema documented alongside task definition
+- Frontend developers can reference task→event mapping directly
+
+**Business Impact**:
+- Faster frontend development
+- Reduced miscommunication between backend/frontend teams
+- Self-documenting pipeline
+
+### 5. Restart Without Reprocessing (UX, DevEx)
+
+**Current State**: Any failure restarts the entire pipeline. A timeout during summary generation re-runs transcription (wasting GPU costs).
+
+**Target State**:
+- Failures resume from last successful step
+- Completed work is checkpointed (e.g., transcription results stored before summary)
+- Manual retry triggers only failed step, not entire workflow
+
+**Business Impact**:
+- Reduced GPU/LLM costs on retries
+- Faster recovery from transient failures
+- Users don't wait for re-transcription on summary failures
+
+### 6. Per-Step Timeouts (UX, DevEx)
+
+**Current State**: Single task timeout for entire pipeline. A hung GPU call blocks everything. Killing the task loses all progress.
+
+**Target State**:
+- Each step has independent timeout (e.g., transcription: 5min, LLM: 30s)
+- Timeout kills only the hung step
+- Pipeline can retry just that step or fail gracefully
+
+**Business Impact**:
+- Faster detection of stuck external services
+- Reduced blast radius from hung calls
+- More granular SLAs per operation type
+
+### 7. Native Retries with Backoff (DevEx, UX)
+
+**Current State**: Celery retry logic is per-task, not per-external-call. Custom retry wrappers needed for each API call.
+
+**Target State**:
+- Conductor provides native retry policies per task
+- Exponential backoff configured declaratively
+- Retry state visible in UI (attempt 2/5)
+
+**Business Impact**:
+- Reduced boilerplate code
+- Consistent retry behavior across all external calls
+- Visibility into retry attempts for debugging
+
+---
+
+## Current Architecture
+
+### Daily.co Multitrack Pipeline Flow
+
+```
+Daily webhook (recording.ready-to-download)     Polling (every 3 min)
+              │                                        │
+              ▼                                        ▼
+         _handle_recording_ready()              poll_daily_recordings()
+              │                                        │
+              └──────────────┬─────────────────────────┘
+                             ▼
+              process_multitrack_recording.delay()     ← Celery task #1
+                             │
+                  ├── Daily API: GET /recordings/{id}
+                  ├── Daily API: GET /meetings/{mtgSessionId}/participants
+                  ├── DB: Create recording + transcript
+                             │
+                             ▼
+              task_pipeline_multitrack_process.delay() ← Celery task #2 (MONOLITH)
+                             │
+                             │   ┌─────────────────────────────────────────────────┐
+                             │   │  pipeline.process() - ALL PHASES INSIDE HERE   │
+                             │   │                                                 │
+                             │   │  Phase 2: Track Padding (N tracks, sequential) │
+                             │   │  Phase 3: Mixdown → S3 upload                  │
+                             │   │  Phase 4: Waveform generation                  │
+                             │   │  Phase 5: Transcription (N GPU calls, serial!) │
+                             │   │  Phase 6: Topic Detection (C LLM calls)        │
+                             │   │  Phase 7a: Title Generation (1 LLM call)       │
+                             │   │  Phase 7b: Summary Generation (2+2M LLM calls) │
+                             │   │  Phase 8: Finalize status                      │
+                             │   └─────────────────────────────────────────────────┘
+                             │
+                             ▼
+              chain(cleanup → zulip → webhook).delay() ← Celery chain (3 tasks)
+```
+
+### Problem: Monolithic `pipeline.process()`
+
+The heavy lifting happens inside a single Python function call. Celery only sees:
+- Task started
+- Task succeeded/failed
+
+It cannot see or control the 8 internal phases.
+
+---
+
+## Target Architecture
+
+### Decomposed Conductor Workflow
+
+```
+                        ┌─────────────────────┐
+                        │   get_recording     │  ← Daily API
+                        │   get_participants  │
+                        └──────────┬──────────┘
+                                   │
+                ┌──────────────────┼──────────────────┐
+                ▼                  ▼                  ▼
+          ┌──────────┐       ┌──────────┐       ┌──────────┐
+          │ pad_tk_0 │       │ pad_tk_1 │       │ pad_tk_N │  ← FORK (parallel)
+          └────┬─────┘       └────┬─────┘       └────┬─────┘
+                └──────────────────┼──────────────────┘
+                                   ▼
+                        ┌─────────────────────┐
+                        │   mixdown_tracks    │  ← PyAV → S3
+                        └──────────┬──────────┘
+                                   │
+                        ┌──────────┴──────────┐
+                        ▼                     ▼
+                ┌───────────────┐     ┌───────────────┐
+                │generate_wave  │     │  (continue)   │  ← waveform parallel with transcription setup
+                └───────────────┘     └───────────────┘
+                                   │
+                ┌──────────────────┼──────────────────┐
+                ▼                  ▼                  ▼
+          ┌────────────┐    ┌────────────┐    ┌────────────┐
+          │transcribe_0│    │transcribe_1│    │transcribe_N│  ← FORK (parallel GPU!)
+          └─────┬──────┘    └─────┬──────┘    └─────┬──────┘
+                └──────────────────┼──────────────────┘
+                                   ▼
+                        ┌─────────────────────┐
+                        │  merge_transcripts  │
+                        └──────────┬──────────┘
+                                   │
+                        ┌──────────┴──────────┐
+                        ▼                     ▼
+                ┌───────────────┐     ┌───────────────┐
+                │detect_topics  │     │     (or)      │  ← topic detection
+                └───────┬───────┘     └───────────────┘
+                        │
+         ┌──────────────┴──────────────┐
+         ▼                             ▼
+   ┌─────────────┐              ┌─────────────┐
+   │generate_title│              │gen_summary │  ← FORK (parallel LLM)
+   └──────┬──────┘              └──────┬──────┘
+          └──────────────┬─────────────┘
+                         ▼
+                ┌─────────────────────┐
+                │      finalize       │
+                └──────────┬──────────┘
+                           │
+            ┌──────────────┼──────────────┐
+            ▼              ▼              ▼
+     ┌──────────┐   ┌──────────┐   ┌──────────┐
+     │ consent  │──▶│  zulip   │──▶│ webhook  │  ← sequential chain
+     └──────────┘   └──────────┘   └──────────┘
+```
+
+### Key Improvements
+
+| Aspect | Current (Celery) | Target (Conductor) |
+|--------|------------------|-------------------|
+| Transcription parallelism | Serial (N × 30s) | Parallel (max 30s) |
+| Failure granularity | Restart all | Retry failed step only |
+| Progress visibility | None | Per-step status in UI |
+| Timeout control | Entire pipeline | Per-step timeouts |
+| Audit trail | Scattered logs | Unified workflow history |
+
+---
+
+## Scope of Work
+
+### Module 1: Conductor Infrastructure Setup
+
+**Files to Create/Modify:**
+- `docker-compose.yml` - Add Conductor server container
+- `server/reflector/conductor/` - New module for Conductor client
+- Environment configuration for Conductor URL
+
+**Tasks:**
+- [ ] Add `conductoross/conductor-standalone:3.15.0` to docker-compose
+- [ ] Create Conductor client wrapper (Python `conductor-python` SDK)
+- [ ] Configure health checks and service dependencies
+- [ ] Document Conductor UI access (port 8127)
+
+### Module 2: Task Decomposition - Worker Definitions
+
+**Files to Create:**
+- `server/reflector/conductor/workers/` directory with:
+  - `get_recording.py` - Daily API recording fetch
+  - `get_participants.py` - Daily API participant fetch
+  - `pad_track.py` - Single track padding (PyAV)
+  - `mixdown_tracks.py` - Multi-track mixdown
+  - `generate_waveform.py` - Waveform generation
+  - `transcribe_track.py` - Single track GPU transcription
+  - `merge_transcripts.py` - Combine transcriptions
+  - `detect_topics.py` - LLM topic detection
+  - `generate_title.py` - LLM title generation
+  - `generate_summary.py` - LLM summary generation
+  - `finalize.py` - Status update and cleanup
+  - `cleanup_consent.py` - Consent check
+  - `post_zulip.py` - Zulip notification
+  - `send_webhook.py` - External webhook
+  - `generate_dynamic_fork_tasks.py` - Helper for FORK_JOIN_DYNAMIC task generation
+
+**Reference Files (Current Implementation):**
+- `server/reflector/pipelines/main_multitrack_pipeline.py`
+- `server/reflector/worker/process.py`
+- `server/reflector/worker/webhook.py`
+
+**Key Considerations:**
+- Each worker receives input from previous step via Conductor
+- Workers must be idempotent (same input → same output)
+- State serialization between steps (JSON-compatible types)
+
+### Module 3: Workflow Definition
+
+**Files to Create:**
+- `server/reflector/conductor/workflows/diarization_pipeline.json`
+- `server/reflector/conductor/workflows/register.py` - Registration script
+
+**Workflow Structure:**
+```json
+{
+  "name": "daily_diarization_pipeline",
+  "version": 1,
+  "tasks": [
+    {"name": "get_recording", "type": "SIMPLE"},
+    {"name": "get_participants", "type": "SIMPLE"},
+    {
+      "name": "fork_padding",
+      "type": "FORK_JOIN_DYNAMIC",
+      "dynamicForkTasksParam": "track_keys"
+    },
+    {"name": "mixdown_tracks", "type": "SIMPLE"},
+    {"name": "generate_waveform", "type": "SIMPLE"},
+    {
+      "name": "fork_transcription",
+      "type": "FORK_JOIN_DYNAMIC",
+      "dynamicForkTasksParam": "padded_urls"
+    },
+    {"name": "merge_transcripts", "type": "SIMPLE"},
+    {"name": "detect_topics", "type": "SIMPLE"},
+    {
+      "name": "fork_generation",
+      "type": "FORK_JOIN",
+      "forkTasks": [["generate_title"], ["generate_summary"]]
+    },
+    {"name": "finalize", "type": "SIMPLE"},
+    {"name": "cleanup_consent", "type": "SIMPLE"},
+    {"name": "post_zulip", "type": "SIMPLE"},
+    {"name": "send_webhook", "type": "SIMPLE"}
+  ]
+}
+```
+
+**Key Considerations:**
+- Dynamic FORK for variable number of tracks (N)
+- Timeout configuration per task type
+- Retry policies with exponential backoff
+
+### Module 4: Pipeline Trigger Migration
+
+**Files to Modify:**
+- `server/reflector/worker/process.py`
+
+**Changes:**
+- Replace `task_pipeline_multitrack_process.delay()` with Conductor workflow start
+- Store workflow ID on Recording for status tracking
+- Handle Conductor API errors
+- Keep `process_multitrack_recording` as-is (creates DB entities before workflow)
+
+**Note:** Both webhook AND polling entry points converge at `process_multitrack_recording`,
+which then calls `task_pipeline_multitrack_process.delay()`. By modifying this single call site,
+we capture both entry paths without duplicating integration logic.
+
+### Module 5: Task Definition Registration
+
+**Files to Create:**
+- `server/reflector/conductor/tasks/definitions.py`
+
+**Task Definitions with Timeouts:**
+
+| Task | Timeout | Response Timeout | Retry Count |
+|------|---------|------------------|-------------|
+| get_recording | 60s | 30s | 3 |
+| get_participants | 60s | 30s | 3 |
+| pad_track | 300s | 120s | 3 |
+| mixdown_tracks | 600s | 300s | 3 |
+| generate_waveform | 120s | 60s | 3 |
+| transcribe_track | 1800s | 900s | 3 |
+| merge_transcripts | 60s | 30s | 3 |
+| detect_topics | 300s | 120s | 3 |
+| generate_title | 60s | 30s | 3 |
+| generate_summary | 300s | 120s | 3 |
+| finalize | 60s | 30s | 3 |
+| cleanup_consent | 60s | 30s | 3 |
+| post_zulip | 60s | 30s | 5 |
+| send_webhook | 60s | 30s | 30 |
+| generate_dynamic_fork_tasks | 30s | 15s | 3 |
+
+### Module 6: Frontend Integration
+
+**WebSocket Events (Already Defined):**
+
+Events continue to be broadcast as today. No change to event structure.
+
+| Event | Triggered By Task | Payload |
+|-------|-------------------|---------|
+| STATUS | finalize | `{value: "processing"\|"ended"\|"error"}` |
+| DURATION | mixdown_tracks | `{duration: float}` |
+| WAVEFORM | generate_waveform | `{waveform: float[]}` |
+| TRANSCRIPT | merge_transcripts | `{text: string, translation: string\|null}` |
+| TOPIC | detect_topics | `{id, title, summary, timestamp, duration}` |
+| FINAL_TITLE | generate_title | `{title: string}` |
+| FINAL_LONG_SUMMARY | generate_summary | `{long_summary: string}` |
+| FINAL_SHORT_SUMMARY | generate_summary | `{short_summary: string}` |
+
+**New: Progress Events**
+
+Add new event type for granular progress:
+
+```python
+# PipelineProgressEvent
+{
+    "event": "PIPELINE_PROGRESS",
+    "data": {
+        "workflow_id": str,
+        "current_step": str,
+        "step_index": int,
+        "total_steps": int,
+        "step_status": "pending" | "in_progress" | "completed" | "failed"
+    }
+}
+```
+
+### Module 7: State Management & Checkpointing
+
+**Current State Storage:**
+- `transcript.status` - High-level status
+- `transcript.events[]` - Append-only event log
+- `transcript.topics[]` - Topic results
+- `transcript.title`, `transcript.long_summary`, etc.
+
+**Conductor State Storage:**
+- Workflow execution state in Conductor database
+- Per-task input/output in Conductor
+
+**Checkpointing Strategy:**
+1. Each task reads required state from DB (not previous task output for large data)
+2. Each task writes results to DB before returning
+3. Task output contains references (IDs, URLs) not large payloads
+4. On retry, task can check DB for existing results (idempotency)
+
+---
+
+## Data Flow Between Tasks
+
+### Input/Output Contracts
+
+```
+get_recording
+  Input:  { recording_id: string }
+  Output: { id, mtg_session_id, room_name, duration }
+
+get_participants
+  Input:  { mtg_session_id: string }
+  Output: { participants: [{participant_id, user_name}] }
+
+pad_track
+  Input:  { track_index: number, s3_key: string }
+  Output: { padded_url: string, size: number }
+
+mixdown_tracks
+  Input:  { padded_urls: string[] }
+  Output: { audio_key: string, duration: number }
+
+generate_waveform
+  Input:  { audio_key: string }
+  Output: { waveform: number[] }
+
+transcribe_track
+  Input:  { track_index: number, audio_url: string }
+  Output: { words: Word[] }
+
+merge_transcripts
+  Input:  { transcripts: Word[][] }
+  Output: { all_words: Word[], word_count: number }
+
+detect_topics
+  Input:  { words: Word[] }
+  Output: { topics: Topic[] }
+
+generate_title
+  Input:  { topics: Topic[] }
+  Output: { title: string }
+
+generate_summary
+  Input:  { words: Word[], topics: Topic[] }
+  Output: { summary: string, short_summary: string }
+
+finalize
+  Input:  { recording_id, title, summary, duration }
+  Output: { status: "COMPLETED" }
+```
+
+---
+
+## External API Calls Summary
+
+### Per-Step External Dependencies
+
+| Task | External Service | Calls | Notes |
+|------|------------------|-------|-------|
+| get_recording | Daily.co API | 1 | GET /recordings/{id} |
+| get_participants | Daily.co API | 1 | GET /meetings/{id}/participants |
+| pad_track | S3 | 2 | presign read + PUT padded |
+| mixdown_tracks | S3 | 1 | PUT audio.mp3 |
+| transcribe_track | Modal.com GPU | 1 | POST /transcriptions |
+| detect_topics | LLM (OpenAI) | C | C = ceil(words/300) |
+| generate_title | LLM (OpenAI) | 1 | - |
+| generate_summary | LLM (OpenAI) | 2+2M | M = subjects (max 6) |
+| post_zulip | Zulip API | 1 | POST or PATCH |
+| send_webhook | External | 1 | Customer webhook URL |
+
+### Cost Attribution Enabled
+
+With decomposed tasks, costs can be attributed:
+- **GPU costs**: Sum of `transcribe_track` durations
+- **LLM costs**: Sum of `detect_topics` + `generate_title` + `generate_summary` token usage
+- **S3 costs**: Bytes uploaded by `pad_track` + `mixdown_tracks`
+
+---
+
+## Idempotency Requirements
+
+### By Task
+
+| Task | Idempotent? | Strategy |
+|------|-------------|----------|
+| get_recording | ✅ | Read-only API call |
+| get_participants | ✅ | Read-only API call |
+| pad_track | ⚠️ | Overwrite same S3 key |
+| mixdown_tracks | ⚠️ | Overwrite same S3 key |
+| generate_waveform | ✅ | Deterministic from audio |
+| transcribe_track | ❌ | Cache by hash(audio_url) |
+| detect_topics | ❌ | Cache by hash(words) |
+| generate_title | ❌ | Cache by hash(topic_titles) |
+| generate_summary | ❌ | Cache by hash(words+topics) |
+| finalize | ✅ | Upsert status |
+| cleanup_consent | ✅ | Idempotent deletes |
+| post_zulip | ⚠️ | Use message_id for updates |
+| send_webhook | ⚠️ | Receiver's responsibility |
+
+### Caching Strategy for LLM/GPU Calls
+
+```python
+class TaskCache:
+    async def get(self, input_hash: str) -> Optional[Output]: ...
+    async def set(self, input_hash: str, output: Output) -> None: ...
+
+# Before calling external service:
+cached = await cache.get(hash(input))
+if cached:
+    return cached
+
+result = await external_service.call(input)
+await cache.set(hash(input), result)
+return result
+```
+
+---
+
+## Migration Strategy
+
+### Phase 1: Infrastructure (No Behavior Change)
+- Add Conductor container to docker-compose
+- Create Conductor client library
+- Verify Conductor UI accessible
+
+### Phase 2: Parallel Implementation
+- Implement all worker tasks
+- Register workflow definition
+- Test with synthetic recordings
+
+### Phase 3: Shadow Mode
+- Trigger both Celery and Conductor pipelines
+- Compare results for consistency
+- Monitor Conductor execution in UI
+
+### Phase 4: Cutover
+- Disable Celery pipeline trigger
+- Enable Conductor-only execution
+- Monitor error rates and performance
+
+### Phase 5: Cleanup
+- Remove Celery task definitions
+- Remove old pipeline code
+- Update documentation
+
+---
+
+## Risks & Mitigations
+
+| Risk | Mitigation |
+|------|------------|
+| Conductor server downtime | Health checks, failover to Celery (Phase 3) |
+| Worker serialization issues | Extensive testing with real data |
+| Performance regression | Benchmark parallel vs serial transcription |
+| Data loss on migration | Shadow mode comparison (Phase 3) |
+| Learning curve for team | Documentation, Conductor UI training |
+
+---
+
+## Success Metrics
+
+| Metric | Current | Target |
+|--------|---------|--------|
+| Pipeline visibility | 3 states | 14+ steps visible |
+| Transcription latency (N tracks) | N × 30s | ~30s (parallel) |
+| Retry granularity | Entire pipeline | Single step |
+| Cost attribution | None | Per-step breakdown |
+| Debug time for failures | ~30 min | ~5 min (UI trace) |
+
+---
+
+## Appendix: Conductor Mock Implementation
+
+A working Python mock demonstrating the target workflow structure is available at:
+`docs/conductor-pipeline-mock/`
+
+To run:
+```bash
+cd docs/conductor-pipeline-mock
+docker compose up --build
+./test_workflow.sh
+```
+
+UI: http://localhost:8127
+
+This mock validates:
+- Workflow definition structure
+- FORK_JOIN parallelism
+- Worker task patterns
+- Conductor SDK usage
+
+---
+
+## References
+
+- Diarization Pipeline Diagram: `DIARIZATION_PIPELINE_DIAGRAM.md`
+- Current Celery Implementation: `server/reflector/pipelines/main_multitrack_pipeline.py`
+- Conductor OSS Documentation: https://conductor-oss.github.io/conductor/
+- Conductor Python SDK: https://github.com/conductor-sdk/conductor-python