mirror of
https://github.com/Monadical-SAS/reflector.git
synced 2025-12-23 05:39:05 +00:00
durable (no-mistakes)
This commit is contained in:
617
CONDUCTOR_MIGRATION_REQUIREMENTS.md
Normal file
617
CONDUCTOR_MIGRATION_REQUIREMENTS.md
Normal file
@@ -0,0 +1,617 @@
|
||||
# Conductor Migration Requirements: Daily.co Multitrack Pipeline
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Migrate the Daily.co multitrack diarization pipeline from a monolithic Celery task to a decomposed Conductor workflow, enabling visual progress tracking, granular retries, and operational observability.
|
||||
|
||||
---
|
||||
|
||||
## Business Value
|
||||
|
||||
### 1. Visibility: Where Are We Now? (UX, DevEx)
|
||||
|
||||
**Current State**: Users see only three states: `idle` → `processing` → `ended/error`. A 10-minute pipeline appears frozen with no feedback.
|
||||
|
||||
**Target State**: Real-time visibility into which step is executing:
|
||||
- "Transcribing track 2 of 3"
|
||||
- "Generating summary (step 8 of 9)"
|
||||
- Visual DAG in admin UI showing completed/in-progress/pending nodes
|
||||
|
||||
**Business Impact**:
|
||||
- Reduced support tickets ("is it stuck?")
|
||||
- Engineers can instantly identify bottlenecks
|
||||
- Users have confidence the system is working
|
||||
|
||||
### 2. Progress Tracking: What's Left? (UX, DevEx)
|
||||
|
||||
**Current State**: No indication of remaining work. A failure at step 8 gives same error as failure at step 1.
|
||||
|
||||
**Target State**:
|
||||
- Progress percentage based on completed steps
|
||||
- Clear step enumeration (e.g., "Step 5/9: Transcription")
|
||||
- Frontend receives structured progress events with step metadata
|
||||
|
||||
**Business Impact**:
|
||||
- Users can estimate completion time
|
||||
- Frontend can render meaningful progress bars
|
||||
- Error messages include context ("Failed during summary generation")
|
||||
|
||||
### 3. Audit Trail & Profiling (DevEx, Ops)
|
||||
|
||||
**Current State**: Logs scattered across Celery workers. No unified view of a single recording's journey. Resource consumption unknown per step.
|
||||
|
||||
**Target State**:
|
||||
- Single workflow ID traces entire recording lifecycle
|
||||
- Per-step execution times recorded
|
||||
- Resource consumption (GPU seconds, LLM tokens) attributable to specific steps
|
||||
- Conductor UI provides complete audit history
|
||||
|
||||
**Business Impact**:
|
||||
- Debugging: "Recording X failed at step Y after Z seconds"
|
||||
- Cost attribution: "Transcription costs $X, summarization costs $Y"
|
||||
- Performance optimization: identify slowest steps
|
||||
|
||||
### 4. Clear Event Dictionary (DevEx)
|
||||
|
||||
**Current State**: Frontend receives WebSocket events (`TRANSCRIPT`, `TOPIC`, `FINAL_TITLE`, etc.) but mapping to pipeline phases is implicit. Adding new events requires tracing through Python code.
|
||||
|
||||
**Target State**:
|
||||
- Each Conductor task explicitly defines its output events
|
||||
- Event schema documented alongside task definition
|
||||
- Frontend developers can reference task→event mapping directly
|
||||
|
||||
**Business Impact**:
|
||||
- Faster frontend development
|
||||
- Reduced miscommunication between backend/frontend teams
|
||||
- Self-documenting pipeline
|
||||
|
||||
### 5. Restart Without Reprocessing (UX, DevEx)
|
||||
|
||||
**Current State**: Any failure restarts the entire pipeline. A timeout during summary generation re-runs transcription (wasting GPU costs).
|
||||
|
||||
**Target State**:
|
||||
- Failures resume from last successful step
|
||||
- Completed work is checkpointed (e.g., transcription results stored before summary)
|
||||
- Manual retry triggers only failed step, not entire workflow
|
||||
|
||||
**Business Impact**:
|
||||
- Reduced GPU/LLM costs on retries
|
||||
- Faster recovery from transient failures
|
||||
- Users don't wait for re-transcription on summary failures
|
||||
|
||||
### 6. Per-Step Timeouts (UX, DevEx)
|
||||
|
||||
**Current State**: Single task timeout for entire pipeline. A hung GPU call blocks everything. Killing the task loses all progress.
|
||||
|
||||
**Target State**:
|
||||
- Each step has independent timeout (e.g., transcription: 5min, LLM: 30s)
|
||||
- Timeout kills only the hung step
|
||||
- Pipeline can retry just that step or fail gracefully
|
||||
|
||||
**Business Impact**:
|
||||
- Faster detection of stuck external services
|
||||
- Reduced blast radius from hung calls
|
||||
- More granular SLAs per operation type
|
||||
|
||||
### 7. Native Retries with Backoff (DevEx, UX)
|
||||
|
||||
**Current State**: Celery retry logic is per-task, not per-external-call. Custom retry wrappers needed for each API call.
|
||||
|
||||
**Target State**:
|
||||
- Conductor provides native retry policies per task
|
||||
- Exponential backoff configured declaratively
|
||||
- Retry state visible in UI (attempt 2/5)
|
||||
|
||||
**Business Impact**:
|
||||
- Reduced boilerplate code
|
||||
- Consistent retry behavior across all external calls
|
||||
- Visibility into retry attempts for debugging
|
||||
|
||||
---
|
||||
|
||||
## Current Architecture
|
||||
|
||||
### Daily.co Multitrack Pipeline Flow
|
||||
|
||||
```
|
||||
Daily webhook (recording.ready-to-download) Polling (every 3 min)
|
||||
│ │
|
||||
▼ ▼
|
||||
_handle_recording_ready() poll_daily_recordings()
|
||||
│ │
|
||||
└──────────────┬─────────────────────────┘
|
||||
▼
|
||||
process_multitrack_recording.delay() ← Celery task #1
|
||||
│
|
||||
├── Daily API: GET /recordings/{id}
|
||||
├── Daily API: GET /meetings/{mtgSessionId}/participants
|
||||
├── DB: Create recording + transcript
|
||||
│
|
||||
▼
|
||||
task_pipeline_multitrack_process.delay() ← Celery task #2 (MONOLITH)
|
||||
│
|
||||
│ ┌─────────────────────────────────────────────────┐
|
||||
│ │ pipeline.process() - ALL PHASES INSIDE HERE │
|
||||
│ │ │
|
||||
│ │ Phase 2: Track Padding (N tracks, sequential) │
|
||||
│ │ Phase 3: Mixdown → S3 upload │
|
||||
│ │ Phase 4: Waveform generation │
|
||||
│ │ Phase 5: Transcription (N GPU calls, serial!) │
|
||||
│ │ Phase 6: Topic Detection (C LLM calls) │
|
||||
│ │ Phase 7a: Title Generation (1 LLM call) │
|
||||
│ │ Phase 7b: Summary Generation (2+2M LLM calls) │
|
||||
│ │ Phase 8: Finalize status │
|
||||
│ └─────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
chain(cleanup → zulip → webhook).delay() ← Celery chain (3 tasks)
|
||||
```
|
||||
|
||||
### Problem: Monolithic `pipeline.process()`
|
||||
|
||||
The heavy lifting happens inside a single Python function call. Celery only sees:
|
||||
- Task started
|
||||
- Task succeeded/failed
|
||||
|
||||
It cannot see or control the 8 internal phases.
|
||||
|
||||
---
|
||||
|
||||
## Target Architecture
|
||||
|
||||
### Decomposed Conductor Workflow
|
||||
|
||||
```
|
||||
┌─────────────────────┐
|
||||
│ get_recording │ ← Daily API
|
||||
│ get_participants │
|
||||
└──────────┬──────────┘
|
||||
│
|
||||
┌──────────────────┼──────────────────┐
|
||||
▼ ▼ ▼
|
||||
┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ pad_tk_0 │ │ pad_tk_1 │ │ pad_tk_N │ ← FORK (parallel)
|
||||
└────┬─────┘ └────┬─────┘ └────┬─────┘
|
||||
└──────────────────┼──────────────────┘
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ mixdown_tracks │ ← PyAV → S3
|
||||
└──────────┬──────────┘
|
||||
│
|
||||
┌──────────┴──────────┐
|
||||
▼ ▼
|
||||
┌───────────────┐ ┌───────────────┐
|
||||
│generate_wave │ │ (continue) │ ← waveform parallel with transcription setup
|
||||
└───────────────┘ └───────────────┘
|
||||
│
|
||||
┌──────────────────┼──────────────────┐
|
||||
▼ ▼ ▼
|
||||
┌────────────┐ ┌────────────┐ ┌────────────┐
|
||||
│transcribe_0│ │transcribe_1│ │transcribe_N│ ← FORK (parallel GPU!)
|
||||
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
|
||||
└──────────────────┼──────────────────┘
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ merge_transcripts │
|
||||
└──────────┬──────────┘
|
||||
│
|
||||
┌──────────┴──────────┐
|
||||
▼ ▼
|
||||
┌───────────────┐ ┌───────────────┐
|
||||
│detect_topics │ │ (or) │ ← topic detection
|
||||
└───────┬───────┘ └───────────────┘
|
||||
│
|
||||
┌──────────────┴──────────────┐
|
||||
▼ ▼
|
||||
┌─────────────┐ ┌─────────────┐
|
||||
│generate_title│ │gen_summary │ ← FORK (parallel LLM)
|
||||
└──────┬──────┘ └──────┬──────┘
|
||||
└──────────────┬─────────────┘
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ finalize │
|
||||
└──────────┬──────────┘
|
||||
│
|
||||
┌──────────────┼──────────────┐
|
||||
▼ ▼ ▼
|
||||
┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ consent │──▶│ zulip │──▶│ webhook │ ← sequential chain
|
||||
└──────────┘ └──────────┘ └──────────┘
|
||||
```
|
||||
|
||||
### Key Improvements
|
||||
|
||||
| Aspect | Current (Celery) | Target (Conductor) |
|
||||
|--------|------------------|-------------------|
|
||||
| Transcription parallelism | Serial (N × 30s) | Parallel (max 30s) |
|
||||
| Failure granularity | Restart all | Retry failed step only |
|
||||
| Progress visibility | None | Per-step status in UI |
|
||||
| Timeout control | Entire pipeline | Per-step timeouts |
|
||||
| Audit trail | Scattered logs | Unified workflow history |
|
||||
|
||||
---
|
||||
|
||||
## Scope of Work
|
||||
|
||||
### Module 1: Conductor Infrastructure Setup
|
||||
|
||||
**Files to Create/Modify:**
|
||||
- `docker-compose.yml` - Add Conductor server container
|
||||
- `server/reflector/conductor/` - New module for Conductor client
|
||||
- Environment configuration for Conductor URL
|
||||
|
||||
**Tasks:**
|
||||
- [ ] Add `conductoross/conductor-standalone:3.15.0` to docker-compose
|
||||
- [ ] Create Conductor client wrapper (Python `conductor-python` SDK)
|
||||
- [ ] Configure health checks and service dependencies
|
||||
- [ ] Document Conductor UI access (port 8127)
|
||||
|
||||
### Module 2: Task Decomposition - Worker Definitions
|
||||
|
||||
**Files to Create:**
|
||||
- `server/reflector/conductor/workers/` directory with:
|
||||
- `get_recording.py` - Daily API recording fetch
|
||||
- `get_participants.py` - Daily API participant fetch
|
||||
- `pad_track.py` - Single track padding (PyAV)
|
||||
- `mixdown_tracks.py` - Multi-track mixdown
|
||||
- `generate_waveform.py` - Waveform generation
|
||||
- `transcribe_track.py` - Single track GPU transcription
|
||||
- `merge_transcripts.py` - Combine transcriptions
|
||||
- `detect_topics.py` - LLM topic detection
|
||||
- `generate_title.py` - LLM title generation
|
||||
- `generate_summary.py` - LLM summary generation
|
||||
- `finalize.py` - Status update and cleanup
|
||||
- `cleanup_consent.py` - Consent check
|
||||
- `post_zulip.py` - Zulip notification
|
||||
- `send_webhook.py` - External webhook
|
||||
- `generate_dynamic_fork_tasks.py` - Helper for FORK_JOIN_DYNAMIC task generation
|
||||
|
||||
**Reference Files (Current Implementation):**
|
||||
- `server/reflector/pipelines/main_multitrack_pipeline.py`
|
||||
- `server/reflector/worker/process.py`
|
||||
- `server/reflector/worker/webhook.py`
|
||||
|
||||
**Key Considerations:**
|
||||
- Each worker receives input from previous step via Conductor
|
||||
- Workers must be idempotent (same input → same output)
|
||||
- State serialization between steps (JSON-compatible types)
|
||||
|
||||
### Module 3: Workflow Definition
|
||||
|
||||
**Files to Create:**
|
||||
- `server/reflector/conductor/workflows/diarization_pipeline.json`
|
||||
- `server/reflector/conductor/workflows/register.py` - Registration script
|
||||
|
||||
**Workflow Structure:**
|
||||
```json
|
||||
{
|
||||
"name": "daily_diarization_pipeline",
|
||||
"version": 1,
|
||||
"tasks": [
|
||||
{"name": "get_recording", "type": "SIMPLE"},
|
||||
{"name": "get_participants", "type": "SIMPLE"},
|
||||
{
|
||||
"name": "fork_padding",
|
||||
"type": "FORK_JOIN_DYNAMIC",
|
||||
"dynamicForkTasksParam": "track_keys"
|
||||
},
|
||||
{"name": "mixdown_tracks", "type": "SIMPLE"},
|
||||
{"name": "generate_waveform", "type": "SIMPLE"},
|
||||
{
|
||||
"name": "fork_transcription",
|
||||
"type": "FORK_JOIN_DYNAMIC",
|
||||
"dynamicForkTasksParam": "padded_urls"
|
||||
},
|
||||
{"name": "merge_transcripts", "type": "SIMPLE"},
|
||||
{"name": "detect_topics", "type": "SIMPLE"},
|
||||
{
|
||||
"name": "fork_generation",
|
||||
"type": "FORK_JOIN",
|
||||
"forkTasks": [["generate_title"], ["generate_summary"]]
|
||||
},
|
||||
{"name": "finalize", "type": "SIMPLE"},
|
||||
{"name": "cleanup_consent", "type": "SIMPLE"},
|
||||
{"name": "post_zulip", "type": "SIMPLE"},
|
||||
{"name": "send_webhook", "type": "SIMPLE"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Key Considerations:**
|
||||
- Dynamic FORK for variable number of tracks (N)
|
||||
- Timeout configuration per task type
|
||||
- Retry policies with exponential backoff
|
||||
|
||||
### Module 4: Pipeline Trigger Migration
|
||||
|
||||
**Files to Modify:**
|
||||
- `server/reflector/worker/process.py`
|
||||
|
||||
**Changes:**
|
||||
- Replace `task_pipeline_multitrack_process.delay()` with Conductor workflow start
|
||||
- Store workflow ID on Recording for status tracking
|
||||
- Handle Conductor API errors
|
||||
- Keep `process_multitrack_recording` as-is (creates DB entities before workflow)
|
||||
|
||||
**Note:** Both webhook AND polling entry points converge at `process_multitrack_recording`,
|
||||
which then calls `task_pipeline_multitrack_process.delay()`. By modifying this single call site,
|
||||
we capture both entry paths without duplicating integration logic.
|
||||
|
||||
### Module 5: Task Definition Registration
|
||||
|
||||
**Files to Create:**
|
||||
- `server/reflector/conductor/tasks/definitions.py`
|
||||
|
||||
**Task Definitions with Timeouts:**
|
||||
|
||||
| Task | Timeout | Response Timeout | Retry Count |
|
||||
|------|---------|------------------|-------------|
|
||||
| get_recording | 60s | 30s | 3 |
|
||||
| get_participants | 60s | 30s | 3 |
|
||||
| pad_track | 300s | 120s | 3 |
|
||||
| mixdown_tracks | 600s | 300s | 3 |
|
||||
| generate_waveform | 120s | 60s | 3 |
|
||||
| transcribe_track | 1800s | 900s | 3 |
|
||||
| merge_transcripts | 60s | 30s | 3 |
|
||||
| detect_topics | 300s | 120s | 3 |
|
||||
| generate_title | 60s | 30s | 3 |
|
||||
| generate_summary | 300s | 120s | 3 |
|
||||
| finalize | 60s | 30s | 3 |
|
||||
| cleanup_consent | 60s | 30s | 3 |
|
||||
| post_zulip | 60s | 30s | 5 |
|
||||
| send_webhook | 60s | 30s | 30 |
|
||||
| generate_dynamic_fork_tasks | 30s | 15s | 3 |
|
||||
|
||||
### Module 6: Frontend Integration
|
||||
|
||||
**WebSocket Events (Already Defined):**
|
||||
|
||||
Events continue to be broadcast as today. No change to event structure.
|
||||
|
||||
| Event | Triggered By Task | Payload |
|
||||
|-------|-------------------|---------|
|
||||
| STATUS | finalize | `{value: "processing"\|"ended"\|"error"}` |
|
||||
| DURATION | mixdown_tracks | `{duration: float}` |
|
||||
| WAVEFORM | generate_waveform | `{waveform: float[]}` |
|
||||
| TRANSCRIPT | merge_transcripts | `{text: string, translation: string\|null}` |
|
||||
| TOPIC | detect_topics | `{id, title, summary, timestamp, duration}` |
|
||||
| FINAL_TITLE | generate_title | `{title: string}` |
|
||||
| FINAL_LONG_SUMMARY | generate_summary | `{long_summary: string}` |
|
||||
| FINAL_SHORT_SUMMARY | generate_summary | `{short_summary: string}` |
|
||||
|
||||
**New: Progress Events**
|
||||
|
||||
Add new event type for granular progress:
|
||||
|
||||
```python
|
||||
# PipelineProgressEvent
|
||||
{
|
||||
"event": "PIPELINE_PROGRESS",
|
||||
"data": {
|
||||
"workflow_id": str,
|
||||
"current_step": str,
|
||||
"step_index": int,
|
||||
"total_steps": int,
|
||||
"step_status": "pending" | "in_progress" | "completed" | "failed"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Module 7: State Management & Checkpointing
|
||||
|
||||
**Current State Storage:**
|
||||
- `transcript.status` - High-level status
|
||||
- `transcript.events[]` - Append-only event log
|
||||
- `transcript.topics[]` - Topic results
|
||||
- `transcript.title`, `transcript.long_summary`, etc.
|
||||
|
||||
**Conductor State Storage:**
|
||||
- Workflow execution state in Conductor database
|
||||
- Per-task input/output in Conductor
|
||||
|
||||
**Checkpointing Strategy:**
|
||||
1. Each task reads required state from DB (not previous task output for large data)
|
||||
2. Each task writes results to DB before returning
|
||||
3. Task output contains references (IDs, URLs) not large payloads
|
||||
4. On retry, task can check DB for existing results (idempotency)
|
||||
|
||||
---
|
||||
|
||||
## Data Flow Between Tasks
|
||||
|
||||
### Input/Output Contracts
|
||||
|
||||
```
|
||||
get_recording
|
||||
Input: { recording_id: string }
|
||||
Output: { id, mtg_session_id, room_name, duration }
|
||||
|
||||
get_participants
|
||||
Input: { mtg_session_id: string }
|
||||
Output: { participants: [{participant_id, user_name}] }
|
||||
|
||||
pad_track
|
||||
Input: { track_index: number, s3_key: string }
|
||||
Output: { padded_url: string, size: number }
|
||||
|
||||
mixdown_tracks
|
||||
Input: { padded_urls: string[] }
|
||||
Output: { audio_key: string, duration: number }
|
||||
|
||||
generate_waveform
|
||||
Input: { audio_key: string }
|
||||
Output: { waveform: number[] }
|
||||
|
||||
transcribe_track
|
||||
Input: { track_index: number, audio_url: string }
|
||||
Output: { words: Word[] }
|
||||
|
||||
merge_transcripts
|
||||
Input: { transcripts: Word[][] }
|
||||
Output: { all_words: Word[], word_count: number }
|
||||
|
||||
detect_topics
|
||||
Input: { words: Word[] }
|
||||
Output: { topics: Topic[] }
|
||||
|
||||
generate_title
|
||||
Input: { topics: Topic[] }
|
||||
Output: { title: string }
|
||||
|
||||
generate_summary
|
||||
Input: { words: Word[], topics: Topic[] }
|
||||
Output: { summary: string, short_summary: string }
|
||||
|
||||
finalize
|
||||
Input: { recording_id, title, summary, duration }
|
||||
Output: { status: "COMPLETED" }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## External API Calls Summary
|
||||
|
||||
### Per-Step External Dependencies
|
||||
|
||||
| Task | External Service | Calls | Notes |
|
||||
|------|------------------|-------|-------|
|
||||
| get_recording | Daily.co API | 1 | GET /recordings/{id} |
|
||||
| get_participants | Daily.co API | 1 | GET /meetings/{id}/participants |
|
||||
| pad_track | S3 | 2 | presign read + PUT padded |
|
||||
| mixdown_tracks | S3 | 1 | PUT audio.mp3 |
|
||||
| transcribe_track | Modal.com GPU | 1 | POST /transcriptions |
|
||||
| detect_topics | LLM (OpenAI) | C | C = ceil(words/300) |
|
||||
| generate_title | LLM (OpenAI) | 1 | - |
|
||||
| generate_summary | LLM (OpenAI) | 2+2M | M = subjects (max 6) |
|
||||
| post_zulip | Zulip API | 1 | POST or PATCH |
|
||||
| send_webhook | External | 1 | Customer webhook URL |
|
||||
|
||||
### Cost Attribution Enabled
|
||||
|
||||
With decomposed tasks, costs can be attributed:
|
||||
- **GPU costs**: Sum of `transcribe_track` durations
|
||||
- **LLM costs**: Sum of `detect_topics` + `generate_title` + `generate_summary` token usage
|
||||
- **S3 costs**: Bytes uploaded by `pad_track` + `mixdown_tracks`
|
||||
|
||||
---
|
||||
|
||||
## Idempotency Requirements
|
||||
|
||||
### By Task
|
||||
|
||||
| Task | Idempotent? | Strategy |
|
||||
|------|-------------|----------|
|
||||
| get_recording | ✅ | Read-only API call |
|
||||
| get_participants | ✅ | Read-only API call |
|
||||
| pad_track | ⚠️ | Overwrite same S3 key |
|
||||
| mixdown_tracks | ⚠️ | Overwrite same S3 key |
|
||||
| generate_waveform | ✅ | Deterministic from audio |
|
||||
| transcribe_track | ❌ | Cache by hash(audio_url) |
|
||||
| detect_topics | ❌ | Cache by hash(words) |
|
||||
| generate_title | ❌ | Cache by hash(topic_titles) |
|
||||
| generate_summary | ❌ | Cache by hash(words+topics) |
|
||||
| finalize | ✅ | Upsert status |
|
||||
| cleanup_consent | ✅ | Idempotent deletes |
|
||||
| post_zulip | ⚠️ | Use message_id for updates |
|
||||
| send_webhook | ⚠️ | Receiver's responsibility |
|
||||
|
||||
### Caching Strategy for LLM/GPU Calls
|
||||
|
||||
```python
|
||||
class TaskCache:
|
||||
async def get(self, input_hash: str) -> Optional[Output]: ...
|
||||
async def set(self, input_hash: str, output: Output) -> None: ...
|
||||
|
||||
# Before calling external service:
|
||||
cached = await cache.get(hash(input))
|
||||
if cached:
|
||||
return cached
|
||||
|
||||
result = await external_service.call(input)
|
||||
await cache.set(hash(input), result)
|
||||
return result
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Migration Strategy
|
||||
|
||||
### Phase 1: Infrastructure (No Behavior Change)
|
||||
- Add Conductor container to docker-compose
|
||||
- Create Conductor client library
|
||||
- Verify Conductor UI accessible
|
||||
|
||||
### Phase 2: Parallel Implementation
|
||||
- Implement all worker tasks
|
||||
- Register workflow definition
|
||||
- Test with synthetic recordings
|
||||
|
||||
### Phase 3: Shadow Mode
|
||||
- Trigger both Celery and Conductor pipelines
|
||||
- Compare results for consistency
|
||||
- Monitor Conductor execution in UI
|
||||
|
||||
### Phase 4: Cutover
|
||||
- Disable Celery pipeline trigger
|
||||
- Enable Conductor-only execution
|
||||
- Monitor error rates and performance
|
||||
|
||||
### Phase 5: Cleanup
|
||||
- Remove Celery task definitions
|
||||
- Remove old pipeline code
|
||||
- Update documentation
|
||||
|
||||
---
|
||||
|
||||
## Risks & Mitigations
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| Conductor server downtime | Health checks, failover to Celery (Phase 3) |
|
||||
| Worker serialization issues | Extensive testing with real data |
|
||||
| Performance regression | Benchmark parallel vs serial transcription |
|
||||
| Data loss on migration | Shadow mode comparison (Phase 3) |
|
||||
| Learning curve for team | Documentation, Conductor UI training |
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
| Metric | Current | Target |
|
||||
|--------|---------|--------|
|
||||
| Pipeline visibility | 3 states | 14+ steps visible |
|
||||
| Transcription latency (N tracks) | N × 30s | ~30s (parallel) |
|
||||
| Retry granularity | Entire pipeline | Single step |
|
||||
| Cost attribution | None | Per-step breakdown |
|
||||
| Debug time for failures | ~30 min | ~5 min (UI trace) |
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Conductor Mock Implementation
|
||||
|
||||
A working Python mock demonstrating the target workflow structure is available at:
|
||||
`docs/conductor-pipeline-mock/`
|
||||
|
||||
To run:
|
||||
```bash
|
||||
cd docs/conductor-pipeline-mock
|
||||
docker compose up --build
|
||||
./test_workflow.sh
|
||||
```
|
||||
|
||||
UI: http://localhost:8127
|
||||
|
||||
This mock validates:
|
||||
- Workflow definition structure
|
||||
- FORK_JOIN parallelism
|
||||
- Worker task patterns
|
||||
- Conductor SDK usage
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- Diarization Pipeline Diagram: `DIARIZATION_PIPELINE_DIAGRAM.md`
|
||||
- Current Celery Implementation: `server/reflector/pipelines/main_multitrack_pipeline.py`
|
||||
- Conductor OSS Documentation: https://conductor-oss.github.io/conductor/
|
||||
- Conductor Python SDK: https://github.com/conductor-sdk/conductor-python
|
||||
Reference in New Issue
Block a user