Files
reflector/CONDUCTOR_MIGRATION_REQUIREMENTS.md
2025-12-15 12:18:47 -05:00

24 KiB
Raw Blame History

Conductor Migration Requirements: Daily.co Multitrack Pipeline

Executive Summary

Migrate the Daily.co multitrack diarization pipeline from a monolithic Celery task to a decomposed Conductor workflow, enabling visual progress tracking, granular retries, and operational observability.


Business Value

1. Visibility: Where Are We Now? (UX, DevEx)

Current State: Users see only three states: idleprocessingended/error. A 10-minute pipeline appears frozen with no feedback.

Target State: Real-time visibility into which step is executing:

  • "Transcribing track 2 of 3"
  • "Generating summary (step 8 of 9)"
  • Visual DAG in admin UI showing completed/in-progress/pending nodes

Business Impact:

  • Reduced support tickets ("is it stuck?")
  • Engineers can instantly identify bottlenecks
  • Users have confidence the system is working

2. Progress Tracking: What's Left? (UX, DevEx)

Current State: No indication of remaining work. A failure at step 8 gives same error as failure at step 1.

Target State:

  • Progress percentage based on completed steps
  • Clear step enumeration (e.g., "Step 5/9: Transcription")
  • Frontend receives structured progress events with step metadata

Business Impact:

  • Users can estimate completion time
  • Frontend can render meaningful progress bars
  • Error messages include context ("Failed during summary generation")

3. Audit Trail & Profiling (DevEx, Ops)

Current State: Logs scattered across Celery workers. No unified view of a single recording's journey. Resource consumption unknown per step.

Target State:

  • Single workflow ID traces entire recording lifecycle
  • Per-step execution times recorded
  • Resource consumption (GPU seconds, LLM tokens) attributable to specific steps
  • Conductor UI provides complete audit history

Business Impact:

  • Debugging: "Recording X failed at step Y after Z seconds"
  • Cost attribution: "Transcription costs $X, summarization costs $Y"
  • Performance optimization: identify slowest steps

4. Clear Event Dictionary (DevEx)

Current State: Frontend receives WebSocket events (TRANSCRIPT, TOPIC, FINAL_TITLE, etc.) but mapping to pipeline phases is implicit. Adding new events requires tracing through Python code.

Target State:

  • Each Conductor task explicitly defines its output events
  • Event schema documented alongside task definition
  • Frontend developers can reference task→event mapping directly

Business Impact:

  • Faster frontend development
  • Reduced miscommunication between backend/frontend teams
  • Self-documenting pipeline

5. Restart Without Reprocessing (UX, DevEx)

Current State: Any failure restarts the entire pipeline. A timeout during summary generation re-runs transcription (wasting GPU costs).

Target State:

  • Failures resume from last successful step
  • Completed work is checkpointed (e.g., transcription results stored before summary)
  • Manual retry triggers only failed step, not entire workflow

Business Impact:

  • Reduced GPU/LLM costs on retries
  • Faster recovery from transient failures
  • Users don't wait for re-transcription on summary failures

6. Per-Step Timeouts (UX, DevEx)

Current State: Single task timeout for entire pipeline. A hung GPU call blocks everything. Killing the task loses all progress.

Target State:

  • Each step has independent timeout (e.g., transcription: 5min, LLM: 30s)
  • Timeout kills only the hung step
  • Pipeline can retry just that step or fail gracefully

Business Impact:

  • Faster detection of stuck external services
  • Reduced blast radius from hung calls
  • More granular SLAs per operation type

7. Native Retries with Backoff (DevEx, UX)

Current State: Celery retry logic is per-task, not per-external-call. Custom retry wrappers needed for each API call.

Target State:

  • Conductor provides native retry policies per task
  • Exponential backoff configured declaratively
  • Retry state visible in UI (attempt 2/5)

Business Impact:

  • Reduced boilerplate code
  • Consistent retry behavior across all external calls
  • Visibility into retry attempts for debugging

Current Architecture

Daily.co Multitrack Pipeline Flow

Daily webhook (recording.ready-to-download)     Polling (every 3 min)
              │                                        │
              ▼                                        ▼
         _handle_recording_ready()              poll_daily_recordings()
              │                                        │
              └──────────────┬─────────────────────────┘
                             ▼
              process_multitrack_recording.delay()     ← Celery task #1
                             │
                  ├── Daily API: GET /recordings/{id}
                  ├── Daily API: GET /meetings/{mtgSessionId}/participants
                  ├── DB: Create recording + transcript
                             │
                             ▼
              task_pipeline_multitrack_process.delay() ← Celery task #2 (MONOLITH)
                             │
                             │   ┌─────────────────────────────────────────────────┐
                             │   │  pipeline.process() - ALL PHASES INSIDE HERE   │
                             │   │                                                 │
                             │   │  Phase 2: Track Padding (N tracks, sequential) │
                             │   │  Phase 3: Mixdown → S3 upload                  │
                             │   │  Phase 4: Waveform generation                  │
                             │   │  Phase 5: Transcription (N GPU calls, serial!) │
                             │   │  Phase 6: Topic Detection (C LLM calls)        │
                             │   │  Phase 7a: Title Generation (1 LLM call)       │
                             │   │  Phase 7b: Summary Generation (2+2M LLM calls) │
                             │   │  Phase 8: Finalize status                      │
                             │   └─────────────────────────────────────────────────┘
                             │
                             ▼
              chain(cleanup → zulip → webhook).delay() ← Celery chain (3 tasks)

Problem: Monolithic pipeline.process()

The heavy lifting happens inside a single Python function call. Celery only sees:

  • Task started
  • Task succeeded/failed

It cannot see or control the 8 internal phases.


Target Architecture

Decomposed Conductor Workflow

                        ┌─────────────────────┐
                        │   get_recording     │  ← Daily API
                        │   get_participants  │
                        └──────────┬──────────┘
                                   │
                ┌──────────────────┼──────────────────┐
                ▼                  ▼                  ▼
          ┌──────────┐       ┌──────────┐       ┌──────────┐
          │ pad_tk_0 │       │ pad_tk_1 │       │ pad_tk_N │  ← FORK (parallel)
          └────┬─────┘       └────┬─────┘       └────┬─────┘
                └──────────────────┼──────────────────┘
                                   ▼
                        ┌─────────────────────┐
                        │   mixdown_tracks    │  ← PyAV → S3
                        └──────────┬──────────┘
                                   │
                        ┌──────────┴──────────┐
                        ▼                     ▼
                ┌───────────────┐     ┌───────────────┐
                │generate_wave  │     │  (continue)   │  ← waveform parallel with transcription setup
                └───────────────┘     └───────────────┘
                                   │
                ┌──────────────────┼──────────────────┐
                ▼                  ▼                  ▼
          ┌────────────┐    ┌────────────┐    ┌────────────┐
          │transcribe_0│    │transcribe_1│    │transcribe_N│  ← FORK (parallel GPU!)
          └─────┬──────┘    └─────┬──────┘    └─────┬──────┘
                └──────────────────┼──────────────────┘
                                   ▼
                        ┌─────────────────────┐
                        │  merge_transcripts  │
                        └──────────┬──────────┘
                                   │
                        ┌──────────┴──────────┐
                        ▼                     ▼
                ┌───────────────┐     ┌───────────────┐
                │detect_topics  │     │     (or)      │  ← topic detection
                └───────┬───────┘     └───────────────┘
                        │
         ┌──────────────┴──────────────┐
         ▼                             ▼
   ┌─────────────┐              ┌─────────────┐
   │generate_title│              │gen_summary │  ← FORK (parallel LLM)
   └──────┬──────┘              └──────┬──────┘
          └──────────────┬─────────────┘
                         ▼
                ┌─────────────────────┐
                │      finalize       │
                └──────────┬──────────┘
                           │
            ┌──────────────┼──────────────┐
            ▼              ▼              ▼
     ┌──────────┐   ┌──────────┐   ┌──────────┐
     │ consent  │──▶│  zulip   │──▶│ webhook  │  ← sequential chain
     └──────────┘   └──────────┘   └──────────┘

Key Improvements

Aspect Current (Celery) Target (Conductor)
Transcription parallelism Serial (N × 30s) Parallel (max 30s)
Failure granularity Restart all Retry failed step only
Progress visibility None Per-step status in UI
Timeout control Entire pipeline Per-step timeouts
Audit trail Scattered logs Unified workflow history

Scope of Work

Module 1: Conductor Infrastructure Setup

Files to Create/Modify:

  • docker-compose.yml - Add Conductor server container
  • server/reflector/conductor/ - New module for Conductor client
  • Environment configuration for Conductor URL

Tasks:

  • Add conductoross/conductor-standalone:3.15.0 to docker-compose
  • Create Conductor client wrapper (Python conductor-python SDK)
  • Configure health checks and service dependencies
  • Document Conductor UI access (port 8127)

Module 2: Task Decomposition - Worker Definitions

Files to Create:

  • server/reflector/conductor/workers/ directory with:
    • get_recording.py - Daily API recording fetch
    • get_participants.py - Daily API participant fetch
    • pad_track.py - Single track padding (PyAV)
    • mixdown_tracks.py - Multi-track mixdown
    • generate_waveform.py - Waveform generation
    • transcribe_track.py - Single track GPU transcription
    • merge_transcripts.py - Combine transcriptions
    • detect_topics.py - LLM topic detection
    • generate_title.py - LLM title generation
    • generate_summary.py - LLM summary generation
    • finalize.py - Status update and cleanup
    • cleanup_consent.py - Consent check
    • post_zulip.py - Zulip notification
    • send_webhook.py - External webhook
    • generate_dynamic_fork_tasks.py - Helper for FORK_JOIN_DYNAMIC task generation

Reference Files (Current Implementation):

  • server/reflector/pipelines/main_multitrack_pipeline.py
  • server/reflector/worker/process.py
  • server/reflector/worker/webhook.py

Key Considerations:

  • Each worker receives input from previous step via Conductor
  • Workers must be idempotent (same input → same output)
  • State serialization between steps (JSON-compatible types)

Module 3: Workflow Definition

Files to Create:

  • server/reflector/conductor/workflows/diarization_pipeline.json
  • server/reflector/conductor/workflows/register.py - Registration script

Workflow Structure:

{
  "name": "daily_diarization_pipeline",
  "version": 1,
  "tasks": [
    {"name": "get_recording", "type": "SIMPLE"},
    {"name": "get_participants", "type": "SIMPLE"},
    {
      "name": "fork_padding",
      "type": "FORK_JOIN_DYNAMIC",
      "dynamicForkTasksParam": "track_keys"
    },
    {"name": "mixdown_tracks", "type": "SIMPLE"},
    {"name": "generate_waveform", "type": "SIMPLE"},
    {
      "name": "fork_transcription",
      "type": "FORK_JOIN_DYNAMIC",
      "dynamicForkTasksParam": "padded_urls"
    },
    {"name": "merge_transcripts", "type": "SIMPLE"},
    {"name": "detect_topics", "type": "SIMPLE"},
    {
      "name": "fork_generation",
      "type": "FORK_JOIN",
      "forkTasks": [["generate_title"], ["generate_summary"]]
    },
    {"name": "finalize", "type": "SIMPLE"},
    {"name": "cleanup_consent", "type": "SIMPLE"},
    {"name": "post_zulip", "type": "SIMPLE"},
    {"name": "send_webhook", "type": "SIMPLE"}
  ]
}

Key Considerations:

  • Dynamic FORK for variable number of tracks (N)
  • Timeout configuration per task type
  • Retry policies with exponential backoff

Module 4: Pipeline Trigger Migration

Files to Modify:

  • server/reflector/worker/process.py

Changes:

  • Replace task_pipeline_multitrack_process.delay() with Conductor workflow start
  • Store workflow ID on Recording for status tracking
  • Handle Conductor API errors
  • Keep process_multitrack_recording as-is (creates DB entities before workflow)

Note: Both webhook AND polling entry points converge at process_multitrack_recording, which then calls task_pipeline_multitrack_process.delay(). By modifying this single call site, we capture both entry paths without duplicating integration logic.

Module 5: Task Definition Registration

Files to Create:

  • server/reflector/conductor/tasks/definitions.py

Task Definitions with Timeouts:

Task Timeout Response Timeout Retry Count
get_recording 60s 30s 3
get_participants 60s 30s 3
pad_track 300s 120s 3
mixdown_tracks 600s 300s 3
generate_waveform 120s 60s 3
transcribe_track 1800s 900s 3
merge_transcripts 60s 30s 3
detect_topics 300s 120s 3
generate_title 60s 30s 3
generate_summary 300s 120s 3
finalize 60s 30s 3
cleanup_consent 60s 30s 3
post_zulip 60s 30s 5
send_webhook 60s 30s 30
generate_dynamic_fork_tasks 30s 15s 3

Module 6: Frontend Integration

WebSocket Events (Already Defined):

Events continue to be broadcast as today. No change to event structure.

Event Triggered By Task Payload
STATUS finalize {value: "processing"|"ended"|"error"}
DURATION mixdown_tracks {duration: float}
WAVEFORM generate_waveform {waveform: float[]}
TRANSCRIPT merge_transcripts {text: string, translation: string|null}
TOPIC detect_topics {id, title, summary, timestamp, duration}
FINAL_TITLE generate_title {title: string}
FINAL_LONG_SUMMARY generate_summary {long_summary: string}
FINAL_SHORT_SUMMARY generate_summary {short_summary: string}

New: Progress Events

Add new event type for granular progress:

# PipelineProgressEvent
{
    "event": "PIPELINE_PROGRESS",
    "data": {
        "workflow_id": str,
        "current_step": str,
        "step_index": int,
        "total_steps": int,
        "step_status": "pending" | "in_progress" | "completed" | "failed"
    }
}

Module 7: State Management & Checkpointing

Current State Storage:

  • transcript.status - High-level status
  • transcript.events[] - Append-only event log
  • transcript.topics[] - Topic results
  • transcript.title, transcript.long_summary, etc.

Conductor State Storage:

  • Workflow execution state in Conductor database
  • Per-task input/output in Conductor

Checkpointing Strategy:

  1. Each task reads required state from DB (not previous task output for large data)
  2. Each task writes results to DB before returning
  3. Task output contains references (IDs, URLs) not large payloads
  4. On retry, task can check DB for existing results (idempotency)

Data Flow Between Tasks

Input/Output Contracts

get_recording
  Input:  { recording_id: string }
  Output: { id, mtg_session_id, room_name, duration }

get_participants
  Input:  { mtg_session_id: string }
  Output: { participants: [{participant_id, user_name}] }

pad_track
  Input:  { track_index: number, s3_key: string }
  Output: { padded_url: string, size: number }

mixdown_tracks
  Input:  { padded_urls: string[] }
  Output: { audio_key: string, duration: number }

generate_waveform
  Input:  { audio_key: string }
  Output: { waveform: number[] }

transcribe_track
  Input:  { track_index: number, audio_url: string }
  Output: { words: Word[] }

merge_transcripts
  Input:  { transcripts: Word[][] }
  Output: { all_words: Word[], word_count: number }

detect_topics
  Input:  { words: Word[] }
  Output: { topics: Topic[] }

generate_title
  Input:  { topics: Topic[] }
  Output: { title: string }

generate_summary
  Input:  { words: Word[], topics: Topic[] }
  Output: { summary: string, short_summary: string }

finalize
  Input:  { recording_id, title, summary, duration }
  Output: { status: "COMPLETED" }

External API Calls Summary

Per-Step External Dependencies

Task External Service Calls Notes
get_recording Daily.co API 1 GET /recordings/{id}
get_participants Daily.co API 1 GET /meetings/{id}/participants
pad_track S3 2 presign read + PUT padded
mixdown_tracks S3 1 PUT audio.mp3
transcribe_track Modal.com GPU 1 POST /transcriptions
detect_topics LLM (OpenAI) C C = ceil(words/300)
generate_title LLM (OpenAI) 1 -
generate_summary LLM (OpenAI) 2+2M M = subjects (max 6)
post_zulip Zulip API 1 POST or PATCH
send_webhook External 1 Customer webhook URL

Cost Attribution Enabled

With decomposed tasks, costs can be attributed:

  • GPU costs: Sum of transcribe_track durations
  • LLM costs: Sum of detect_topics + generate_title + generate_summary token usage
  • S3 costs: Bytes uploaded by pad_track + mixdown_tracks

Idempotency Requirements

By Task

Task Idempotent? Strategy
get_recording Read-only API call
get_participants Read-only API call
pad_track ⚠️ Overwrite same S3 key
mixdown_tracks ⚠️ Overwrite same S3 key
generate_waveform Deterministic from audio
transcribe_track Cache by hash(audio_url)
detect_topics Cache by hash(words)
generate_title Cache by hash(topic_titles)
generate_summary Cache by hash(words+topics)
finalize Upsert status
cleanup_consent Idempotent deletes
post_zulip ⚠️ Use message_id for updates
send_webhook ⚠️ Receiver's responsibility

Caching Strategy for LLM/GPU Calls

class TaskCache:
    async def get(self, input_hash: str) -> Optional[Output]: ...
    async def set(self, input_hash: str, output: Output) -> None: ...

# Before calling external service:
cached = await cache.get(hash(input))
if cached:
    return cached

result = await external_service.call(input)
await cache.set(hash(input), result)
return result

Migration Strategy

Phase 1: Infrastructure (No Behavior Change)

  • Add Conductor container to docker-compose
  • Create Conductor client library
  • Verify Conductor UI accessible

Phase 2: Parallel Implementation

  • Implement all worker tasks
  • Register workflow definition
  • Test with synthetic recordings

Phase 3: Shadow Mode

  • Trigger both Celery and Conductor pipelines
  • Compare results for consistency
  • Monitor Conductor execution in UI

Phase 4: Cutover

  • Disable Celery pipeline trigger
  • Enable Conductor-only execution
  • Monitor error rates and performance

Phase 5: Cleanup

  • Remove Celery task definitions
  • Remove old pipeline code
  • Update documentation

Risks & Mitigations

Risk Mitigation
Conductor server downtime Health checks, failover to Celery (Phase 3)
Worker serialization issues Extensive testing with real data
Performance regression Benchmark parallel vs serial transcription
Data loss on migration Shadow mode comparison (Phase 3)
Learning curve for team Documentation, Conductor UI training

Success Metrics

Metric Current Target
Pipeline visibility 3 states 14+ steps visible
Transcription latency (N tracks) N × 30s ~30s (parallel)
Retry granularity Entire pipeline Single step
Cost attribution None Per-step breakdown
Debug time for failures ~30 min ~5 min (UI trace)

Appendix: Conductor Mock Implementation

A working Python mock demonstrating the target workflow structure is available at: docs/conductor-pipeline-mock/

To run:

cd docs/conductor-pipeline-mock
docker compose up --build
./test_workflow.sh

UI: http://localhost:8127

This mock validates:

  • Workflow definition structure
  • FORK_JOIN parallelism
  • Worker task patterns
  • Conductor SDK usage

References