24 KiB
Conductor Migration Requirements: Daily.co Multitrack Pipeline
Executive Summary
Migrate the Daily.co multitrack diarization pipeline from a monolithic Celery task to a decomposed Conductor workflow, enabling visual progress tracking, granular retries, and operational observability.
Business Value
1. Visibility: Where Are We Now? (UX, DevEx)
Current State: Users see only three states: idle → processing → ended/error. A 10-minute pipeline appears frozen with no feedback.
Target State: Real-time visibility into which step is executing:
- "Transcribing track 2 of 3"
- "Generating summary (step 8 of 9)"
- Visual DAG in admin UI showing completed/in-progress/pending nodes
Business Impact:
- Reduced support tickets ("is it stuck?")
- Engineers can instantly identify bottlenecks
- Users have confidence the system is working
2. Progress Tracking: What's Left? (UX, DevEx)
Current State: No indication of remaining work. A failure at step 8 gives same error as failure at step 1.
Target State:
- Progress percentage based on completed steps
- Clear step enumeration (e.g., "Step 5/9: Transcription")
- Frontend receives structured progress events with step metadata
Business Impact:
- Users can estimate completion time
- Frontend can render meaningful progress bars
- Error messages include context ("Failed during summary generation")
3. Audit Trail & Profiling (DevEx, Ops)
Current State: Logs scattered across Celery workers. No unified view of a single recording's journey. Resource consumption unknown per step.
Target State:
- Single workflow ID traces entire recording lifecycle
- Per-step execution times recorded
- Resource consumption (GPU seconds, LLM tokens) attributable to specific steps
- Conductor UI provides complete audit history
Business Impact:
- Debugging: "Recording X failed at step Y after Z seconds"
- Cost attribution: "Transcription costs $X, summarization costs $Y"
- Performance optimization: identify slowest steps
4. Clear Event Dictionary (DevEx)
Current State: Frontend receives WebSocket events (TRANSCRIPT, TOPIC, FINAL_TITLE, etc.) but mapping to pipeline phases is implicit. Adding new events requires tracing through Python code.
Target State:
- Each Conductor task explicitly defines its output events
- Event schema documented alongside task definition
- Frontend developers can reference task→event mapping directly
Business Impact:
- Faster frontend development
- Reduced miscommunication between backend/frontend teams
- Self-documenting pipeline
5. Restart Without Reprocessing (UX, DevEx)
Current State: Any failure restarts the entire pipeline. A timeout during summary generation re-runs transcription (wasting GPU costs).
Target State:
- Failures resume from last successful step
- Completed work is checkpointed (e.g., transcription results stored before summary)
- Manual retry triggers only failed step, not entire workflow
Business Impact:
- Reduced GPU/LLM costs on retries
- Faster recovery from transient failures
- Users don't wait for re-transcription on summary failures
6. Per-Step Timeouts (UX, DevEx)
Current State: Single task timeout for entire pipeline. A hung GPU call blocks everything. Killing the task loses all progress.
Target State:
- Each step has independent timeout (e.g., transcription: 5min, LLM: 30s)
- Timeout kills only the hung step
- Pipeline can retry just that step or fail gracefully
Business Impact:
- Faster detection of stuck external services
- Reduced blast radius from hung calls
- More granular SLAs per operation type
7. Native Retries with Backoff (DevEx, UX)
Current State: Celery retry logic is per-task, not per-external-call. Custom retry wrappers needed for each API call.
Target State:
- Conductor provides native retry policies per task
- Exponential backoff configured declaratively
- Retry state visible in UI (attempt 2/5)
Business Impact:
- Reduced boilerplate code
- Consistent retry behavior across all external calls
- Visibility into retry attempts for debugging
Current Architecture
Daily.co Multitrack Pipeline Flow
Daily webhook (recording.ready-to-download) Polling (every 3 min)
│ │
▼ ▼
_handle_recording_ready() poll_daily_recordings()
│ │
└──────────────┬─────────────────────────┘
▼
process_multitrack_recording.delay() ← Celery task #1
│
├── Daily API: GET /recordings/{id}
├── Daily API: GET /meetings/{mtgSessionId}/participants
├── DB: Create recording + transcript
│
▼
task_pipeline_multitrack_process.delay() ← Celery task #2 (MONOLITH)
│
│ ┌─────────────────────────────────────────────────┐
│ │ pipeline.process() - ALL PHASES INSIDE HERE │
│ │ │
│ │ Phase 2: Track Padding (N tracks, sequential) │
│ │ Phase 3: Mixdown → S3 upload │
│ │ Phase 4: Waveform generation │
│ │ Phase 5: Transcription (N GPU calls, serial!) │
│ │ Phase 6: Topic Detection (C LLM calls) │
│ │ Phase 7a: Title Generation (1 LLM call) │
│ │ Phase 7b: Summary Generation (2+2M LLM calls) │
│ │ Phase 8: Finalize status │
│ └─────────────────────────────────────────────────┘
│
▼
chain(cleanup → zulip → webhook).delay() ← Celery chain (3 tasks)
Problem: Monolithic pipeline.process()
The heavy lifting happens inside a single Python function call. Celery only sees:
- Task started
- Task succeeded/failed
It cannot see or control the 8 internal phases.
Target Architecture
Decomposed Conductor Workflow
┌─────────────────────┐
│ get_recording │ ← Daily API
│ get_participants │
└──────────┬──────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ pad_tk_0 │ │ pad_tk_1 │ │ pad_tk_N │ ← FORK (parallel)
└────┬─────┘ └────┬─────┘ └────┬─────┘
└──────────────────┼──────────────────┘
▼
┌─────────────────────┐
│ mixdown_tracks │ ← PyAV → S3
└──────────┬──────────┘
│
┌──────────┴──────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│generate_wave │ │ (continue) │ ← waveform parallel with transcription setup
└───────────────┘ └───────────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│transcribe_0│ │transcribe_1│ │transcribe_N│ ← FORK (parallel GPU!)
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
└──────────────────┼──────────────────┘
▼
┌─────────────────────┐
│ merge_transcripts │
└──────────┬──────────┘
│
┌──────────┴──────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│detect_topics │ │ (or) │ ← topic detection
└───────┬───────┘ └───────────────┘
│
┌──────────────┴──────────────┐
▼ ▼
┌─────────────┐ ┌─────────────┐
│generate_title│ │gen_summary │ ← FORK (parallel LLM)
└──────┬──────┘ └──────┬──────┘
└──────────────┬─────────────┘
▼
┌─────────────────────┐
│ finalize │
└──────────┬──────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ consent │──▶│ zulip │──▶│ webhook │ ← sequential chain
└──────────┘ └──────────┘ └──────────┘
Key Improvements
| Aspect | Current (Celery) | Target (Conductor) |
|---|---|---|
| Transcription parallelism | Serial (N × 30s) | Parallel (max 30s) |
| Failure granularity | Restart all | Retry failed step only |
| Progress visibility | None | Per-step status in UI |
| Timeout control | Entire pipeline | Per-step timeouts |
| Audit trail | Scattered logs | Unified workflow history |
Scope of Work
Module 1: Conductor Infrastructure Setup
Files to Create/Modify:
docker-compose.yml- Add Conductor server containerserver/reflector/conductor/- New module for Conductor client- Environment configuration for Conductor URL
Tasks:
- Add
conductoross/conductor-standalone:3.15.0to docker-compose - Create Conductor client wrapper (Python
conductor-pythonSDK) - Configure health checks and service dependencies
- Document Conductor UI access (port 8127)
Module 2: Task Decomposition - Worker Definitions
Files to Create:
server/reflector/conductor/workers/directory with:get_recording.py- Daily API recording fetchget_participants.py- Daily API participant fetchpad_track.py- Single track padding (PyAV)mixdown_tracks.py- Multi-track mixdowngenerate_waveform.py- Waveform generationtranscribe_track.py- Single track GPU transcriptionmerge_transcripts.py- Combine transcriptionsdetect_topics.py- LLM topic detectiongenerate_title.py- LLM title generationgenerate_summary.py- LLM summary generationfinalize.py- Status update and cleanupcleanup_consent.py- Consent checkpost_zulip.py- Zulip notificationsend_webhook.py- External webhookgenerate_dynamic_fork_tasks.py- Helper for FORK_JOIN_DYNAMIC task generation
Reference Files (Current Implementation):
server/reflector/pipelines/main_multitrack_pipeline.pyserver/reflector/worker/process.pyserver/reflector/worker/webhook.py
Key Considerations:
- Each worker receives input from previous step via Conductor
- Workers must be idempotent (same input → same output)
- State serialization between steps (JSON-compatible types)
Module 3: Workflow Definition
Files to Create:
server/reflector/conductor/workflows/diarization_pipeline.jsonserver/reflector/conductor/workflows/register.py- Registration script
Workflow Structure:
{
"name": "daily_diarization_pipeline",
"version": 1,
"tasks": [
{"name": "get_recording", "type": "SIMPLE"},
{"name": "get_participants", "type": "SIMPLE"},
{
"name": "fork_padding",
"type": "FORK_JOIN_DYNAMIC",
"dynamicForkTasksParam": "track_keys"
},
{"name": "mixdown_tracks", "type": "SIMPLE"},
{"name": "generate_waveform", "type": "SIMPLE"},
{
"name": "fork_transcription",
"type": "FORK_JOIN_DYNAMIC",
"dynamicForkTasksParam": "padded_urls"
},
{"name": "merge_transcripts", "type": "SIMPLE"},
{"name": "detect_topics", "type": "SIMPLE"},
{
"name": "fork_generation",
"type": "FORK_JOIN",
"forkTasks": [["generate_title"], ["generate_summary"]]
},
{"name": "finalize", "type": "SIMPLE"},
{"name": "cleanup_consent", "type": "SIMPLE"},
{"name": "post_zulip", "type": "SIMPLE"},
{"name": "send_webhook", "type": "SIMPLE"}
]
}
Key Considerations:
- Dynamic FORK for variable number of tracks (N)
- Timeout configuration per task type
- Retry policies with exponential backoff
Module 4: Pipeline Trigger Migration
Files to Modify:
server/reflector/worker/process.py
Changes:
- Replace
task_pipeline_multitrack_process.delay()with Conductor workflow start - Store workflow ID on Recording for status tracking
- Handle Conductor API errors
- Keep
process_multitrack_recordingas-is (creates DB entities before workflow)
Note: Both webhook AND polling entry points converge at process_multitrack_recording,
which then calls task_pipeline_multitrack_process.delay(). By modifying this single call site,
we capture both entry paths without duplicating integration logic.
Module 5: Task Definition Registration
Files to Create:
server/reflector/conductor/tasks/definitions.py
Task Definitions with Timeouts:
| Task | Timeout | Response Timeout | Retry Count |
|---|---|---|---|
| get_recording | 60s | 30s | 3 |
| get_participants | 60s | 30s | 3 |
| pad_track | 300s | 120s | 3 |
| mixdown_tracks | 600s | 300s | 3 |
| generate_waveform | 120s | 60s | 3 |
| transcribe_track | 1800s | 900s | 3 |
| merge_transcripts | 60s | 30s | 3 |
| detect_topics | 300s | 120s | 3 |
| generate_title | 60s | 30s | 3 |
| generate_summary | 300s | 120s | 3 |
| finalize | 60s | 30s | 3 |
| cleanup_consent | 60s | 30s | 3 |
| post_zulip | 60s | 30s | 5 |
| send_webhook | 60s | 30s | 30 |
| generate_dynamic_fork_tasks | 30s | 15s | 3 |
Module 6: Frontend Integration
WebSocket Events (Already Defined):
Events continue to be broadcast as today. No change to event structure.
| Event | Triggered By Task | Payload |
|---|---|---|
| STATUS | finalize | {value: "processing"|"ended"|"error"} |
| DURATION | mixdown_tracks | {duration: float} |
| WAVEFORM | generate_waveform | {waveform: float[]} |
| TRANSCRIPT | merge_transcripts | {text: string, translation: string|null} |
| TOPIC | detect_topics | {id, title, summary, timestamp, duration} |
| FINAL_TITLE | generate_title | {title: string} |
| FINAL_LONG_SUMMARY | generate_summary | {long_summary: string} |
| FINAL_SHORT_SUMMARY | generate_summary | {short_summary: string} |
New: Progress Events
Add new event type for granular progress:
# PipelineProgressEvent
{
"event": "PIPELINE_PROGRESS",
"data": {
"workflow_id": str,
"current_step": str,
"step_index": int,
"total_steps": int,
"step_status": "pending" | "in_progress" | "completed" | "failed"
}
}
Module 7: State Management & Checkpointing
Current State Storage:
transcript.status- High-level statustranscript.events[]- Append-only event logtranscript.topics[]- Topic resultstranscript.title,transcript.long_summary, etc.
Conductor State Storage:
- Workflow execution state in Conductor database
- Per-task input/output in Conductor
Checkpointing Strategy:
- Each task reads required state from DB (not previous task output for large data)
- Each task writes results to DB before returning
- Task output contains references (IDs, URLs) not large payloads
- On retry, task can check DB for existing results (idempotency)
Data Flow Between Tasks
Input/Output Contracts
get_recording
Input: { recording_id: string }
Output: { id, mtg_session_id, room_name, duration }
get_participants
Input: { mtg_session_id: string }
Output: { participants: [{participant_id, user_name}] }
pad_track
Input: { track_index: number, s3_key: string }
Output: { padded_url: string, size: number }
mixdown_tracks
Input: { padded_urls: string[] }
Output: { audio_key: string, duration: number }
generate_waveform
Input: { audio_key: string }
Output: { waveform: number[] }
transcribe_track
Input: { track_index: number, audio_url: string }
Output: { words: Word[] }
merge_transcripts
Input: { transcripts: Word[][] }
Output: { all_words: Word[], word_count: number }
detect_topics
Input: { words: Word[] }
Output: { topics: Topic[] }
generate_title
Input: { topics: Topic[] }
Output: { title: string }
generate_summary
Input: { words: Word[], topics: Topic[] }
Output: { summary: string, short_summary: string }
finalize
Input: { recording_id, title, summary, duration }
Output: { status: "COMPLETED" }
External API Calls Summary
Per-Step External Dependencies
| Task | External Service | Calls | Notes |
|---|---|---|---|
| get_recording | Daily.co API | 1 | GET /recordings/{id} |
| get_participants | Daily.co API | 1 | GET /meetings/{id}/participants |
| pad_track | S3 | 2 | presign read + PUT padded |
| mixdown_tracks | S3 | 1 | PUT audio.mp3 |
| transcribe_track | Modal.com GPU | 1 | POST /transcriptions |
| detect_topics | LLM (OpenAI) | C | C = ceil(words/300) |
| generate_title | LLM (OpenAI) | 1 | - |
| generate_summary | LLM (OpenAI) | 2+2M | M = subjects (max 6) |
| post_zulip | Zulip API | 1 | POST or PATCH |
| send_webhook | External | 1 | Customer webhook URL |
Cost Attribution Enabled
With decomposed tasks, costs can be attributed:
- GPU costs: Sum of
transcribe_trackdurations - LLM costs: Sum of
detect_topics+generate_title+generate_summarytoken usage - S3 costs: Bytes uploaded by
pad_track+mixdown_tracks
Idempotency Requirements
By Task
| Task | Idempotent? | Strategy |
|---|---|---|
| get_recording | ✅ | Read-only API call |
| get_participants | ✅ | Read-only API call |
| pad_track | ⚠️ | Overwrite same S3 key |
| mixdown_tracks | ⚠️ | Overwrite same S3 key |
| generate_waveform | ✅ | Deterministic from audio |
| transcribe_track | ❌ | Cache by hash(audio_url) |
| detect_topics | ❌ | Cache by hash(words) |
| generate_title | ❌ | Cache by hash(topic_titles) |
| generate_summary | ❌ | Cache by hash(words+topics) |
| finalize | ✅ | Upsert status |
| cleanup_consent | ✅ | Idempotent deletes |
| post_zulip | ⚠️ | Use message_id for updates |
| send_webhook | ⚠️ | Receiver's responsibility |
Caching Strategy for LLM/GPU Calls
class TaskCache:
async def get(self, input_hash: str) -> Optional[Output]: ...
async def set(self, input_hash: str, output: Output) -> None: ...
# Before calling external service:
cached = await cache.get(hash(input))
if cached:
return cached
result = await external_service.call(input)
await cache.set(hash(input), result)
return result
Migration Strategy
Phase 1: Infrastructure (No Behavior Change)
- Add Conductor container to docker-compose
- Create Conductor client library
- Verify Conductor UI accessible
Phase 2: Parallel Implementation
- Implement all worker tasks
- Register workflow definition
- Test with synthetic recordings
Phase 3: Shadow Mode
- Trigger both Celery and Conductor pipelines
- Compare results for consistency
- Monitor Conductor execution in UI
Phase 4: Cutover
- Disable Celery pipeline trigger
- Enable Conductor-only execution
- Monitor error rates and performance
Phase 5: Cleanup
- Remove Celery task definitions
- Remove old pipeline code
- Update documentation
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Conductor server downtime | Health checks, failover to Celery (Phase 3) |
| Worker serialization issues | Extensive testing with real data |
| Performance regression | Benchmark parallel vs serial transcription |
| Data loss on migration | Shadow mode comparison (Phase 3) |
| Learning curve for team | Documentation, Conductor UI training |
Success Metrics
| Metric | Current | Target |
|---|---|---|
| Pipeline visibility | 3 states | 14+ steps visible |
| Transcription latency (N tracks) | N × 30s | ~30s (parallel) |
| Retry granularity | Entire pipeline | Single step |
| Cost attribution | None | Per-step breakdown |
| Debug time for failures | ~30 min | ~5 min (UI trace) |
Appendix: Conductor Mock Implementation
A working Python mock demonstrating the target workflow structure is available at:
docs/conductor-pipeline-mock/
To run:
cd docs/conductor-pipeline-mock
docker compose up --build
./test_workflow.sh
This mock validates:
- Workflow definition structure
- FORK_JOIN parallelism
- Worker task patterns
- Conductor SDK usage
References
- Diarization Pipeline Diagram:
DIARIZATION_PIPELINE_DIAGRAM.md - Current Celery Implementation:
server/reflector/pipelines/main_multitrack_pipeline.py - Conductor OSS Documentation: https://conductor-oss.github.io/conductor/
- Conductor Python SDK: https://github.com/conductor-sdk/conductor-python