mirror of
https://github.com/Monadical-SAS/reflector.git
synced 2025-12-20 20:29:06 +00:00
1978 lines
58 KiB
Markdown
1978 lines
58 KiB
Markdown
# Conductor Migration Tasks
|
|
|
|
This document defines atomic, isolated work items for migrating the Daily.co multitrack diarization pipeline from Celery to Conductor. Each task is self-contained with clear dependencies, acceptance criteria, and references to the codebase.
|
|
|
|
---
|
|
|
|
## Task Index
|
|
|
|
| ID | Title | Phase | Dependencies | Complexity |
|
|
|----|-------|-------|--------------|------------|
|
|
| INFRA-001 | Add Conductor container to docker-compose | 1 | None | Low |
|
|
| INFRA-002 | Create Conductor Python client wrapper | 1 | INFRA-001 | Medium |
|
|
| INFRA-003 | Add Conductor environment configuration | 1 | INFRA-001 | Low |
|
|
| INFRA-004 | Create health check endpoint for Conductor | 1 | INFRA-002 | Low |
|
|
| TASK-001 | Create task definitions registry module | 2 | INFRA-002 | Medium |
|
|
| TASK-002 | Implement get_recording worker | 2 | TASK-001 | Low |
|
|
| TASK-003 | Implement get_participants worker | 2 | TASK-001 | Low |
|
|
| TASK-004a | Implement pad_track: extract stream metadata | 2 | TASK-001 | Medium |
|
|
| TASK-004b | Implement pad_track: PyAV padding filter | 2 | TASK-004a | Medium |
|
|
| TASK-004c | Implement pad_track: S3 upload padded file | 2 | TASK-004b | Low |
|
|
| TASK-005a | Implement mixdown_tracks: build filter graph | 2 | TASK-001 | Medium |
|
|
| TASK-005b | Implement mixdown_tracks: S3 streaming + upload | 2 | TASK-005a | Medium |
|
|
| TASK-006 | Implement generate_waveform worker | 2 | TASK-001 | Medium |
|
|
| TASK-007 | Implement transcribe_track worker | 2 | TASK-001 | Medium |
|
|
| TASK-008 | Implement merge_transcripts worker | 2 | TASK-001 | Medium |
|
|
| TASK-009 | Implement detect_topics worker | 2 | TASK-001 | Medium |
|
|
| TASK-010 | Implement generate_title worker | 2 | TASK-001 | Low |
|
|
| TASK-011 | Implement generate_summary worker | 2 | TASK-001 | Medium |
|
|
| TASK-012 | Implement finalize worker | 2 | TASK-001 | Medium |
|
|
| TASK-013 | Implement cleanup_consent worker | 2 | TASK-001 | Low |
|
|
| TASK-014 | Implement post_zulip worker | 2 | TASK-001 | Low |
|
|
| TASK-015 | Implement send_webhook worker | 2 | TASK-001 | Low |
|
|
| TASK-016 | Implement generate_dynamic_fork_tasks helper | 2 | TASK-001 | Low |
|
|
| STATE-001 | Add workflow_id to Recording model | 2 | INFRA-002 | Low |
|
|
| WFLOW-001 | Create workflow definition JSON with FORK_JOIN_DYNAMIC | 3 | TASK-002..015 | High |
|
|
| WFLOW-002 | Implement workflow registration script | 3 | WFLOW-001 | Medium |
|
|
| EVENT-001 | Add PIPELINE_PROGRESS WebSocket event (requires frontend ticket) | 2 | None | Medium |
|
|
| EVENT-002 | Emit progress events from workers (requires frontend ticket) | 2 | EVENT-001, TASK-002..015 | Medium |
|
|
| INTEG-001 | Modify pipeline trigger to start Conductor workflow | 4 | WFLOW-002, STATE-001 | Medium |
|
|
| SHADOW-001 | Implement shadow mode toggle | 4 | INTEG-001 | Medium |
|
|
| SHADOW-002 | Add result comparison: content fields | 4 | SHADOW-001 | Medium |
|
|
| CUTOVER-001 | Create feature flag for Conductor-only mode | 5 | SHADOW-001 | Low |
|
|
| CUTOVER-002 | Add fallback to Celery on Conductor failure | 5 | CUTOVER-001 | Medium |
|
|
| CLEANUP-001 | Remove deprecated Celery task code | 6 | CUTOVER-001 | Medium |
|
|
| CLEANUP-002 | Update documentation | 6 | CLEANUP-001 | Low |
|
|
| TEST-001a | Integration tests: API workers (defer to human if complex) | 2 | TASK-002, TASK-003 | Low |
|
|
| TEST-001b | Integration tests: audio workers (defer to human if complex) | 2 | TASK-004c, TASK-005b, TASK-006 | Medium |
|
|
| TEST-001c | Integration tests: transcription workers (defer to human if complex) | 2 | TASK-007, TASK-008 | Medium |
|
|
| TEST-001d | Integration tests: LLM workers (defer to human if complex) | 2 | TASK-009..011 | Medium |
|
|
| TEST-001e | Integration tests: finalization workers (defer to human if complex) | 2 | TASK-012..015 | Low |
|
|
| TEST-002 | E2E test for complete workflow (defer to human if complex) | 3 | WFLOW-002 | High |
|
|
| TEST-003 | Shadow mode comparison tests (defer to human tester if too complex) | 4 | SHADOW-002 | Medium |
|
|
|
|
---
|
|
|
|
## Phase 1: Infrastructure Setup
|
|
|
|
### INFRA-001: Add Conductor Container to docker-compose
|
|
|
|
**Description:**
|
|
Add the Conductor OSS standalone container to the docker-compose configuration.
|
|
|
|
**Files to Modify:**
|
|
- `docker-compose.yml`
|
|
|
|
**Implementation Details:**
|
|
```yaml
|
|
conductor:
|
|
image: conductoross/conductor-standalone:3.15.0
|
|
ports:
|
|
- 8127:8080
|
|
- 5001:5000
|
|
environment:
|
|
- conductor.db.type=memory # Use postgres in production
|
|
healthcheck:
|
|
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 5
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Conductor UI accessible at http://localhost:8127
|
|
- [ ] Swagger docs available at http://localhost:8127/swagger-ui/index.html
|
|
- [ ] Health endpoint returns 200
|
|
|
|
**Dependencies:** None
|
|
|
|
**Reference Files:**
|
|
- `docs/conductor-pipeline-mock/docker-compose.yml`
|
|
|
|
---
|
|
|
|
### INFRA-002: Create Conductor Python Client Wrapper
|
|
|
|
**Description:**
|
|
Create a reusable client wrapper module for interacting with the Conductor server using the `conductor-python` SDK.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/__init__.py`
|
|
- `server/reflector/conductor/client.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
# server/reflector/conductor/client.py
|
|
from conductor.client.configuration.configuration import Configuration
|
|
from conductor.client.orkes_clients import OrkesClients
|
|
from conductor.client.workflow_client import WorkflowClient
|
|
from reflector.settings import settings
|
|
|
|
class ConductorClientManager:
|
|
_instance = None
|
|
|
|
@classmethod
|
|
def get_client(cls) -> WorkflowClient:
|
|
if cls._instance is None:
|
|
config = Configuration(
|
|
server_api_url=settings.CONDUCTOR_SERVER_URL,
|
|
debug=settings.CONDUCTOR_DEBUG,
|
|
)
|
|
cls._instance = OrkesClients(config)
|
|
return cls._instance.get_workflow_client()
|
|
|
|
@classmethod
|
|
def start_workflow(cls, name: str, version: int, input_data: dict) -> str:
|
|
"""Start a workflow and return the workflow ID."""
|
|
client = cls.get_client()
|
|
return client.start_workflow_by_name(name, input_data, version=version)
|
|
|
|
@classmethod
|
|
def get_workflow_status(cls, workflow_id: str) -> dict:
|
|
"""Get the current status of a workflow."""
|
|
client = cls.get_client()
|
|
return client.get_workflow(workflow_id, include_tasks=True)
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Can connect to Conductor server
|
|
- [ ] Can start a workflow
|
|
- [ ] Can retrieve workflow status
|
|
- [ ] Proper error handling for connection failures
|
|
|
|
**Dependencies:** INFRA-001
|
|
|
|
**Reference Files:**
|
|
- `docs/conductor-pipeline-mock/src/main.py`
|
|
- `docs/conductor-pipeline-mock/src/register_workflow.py`
|
|
|
|
---
|
|
|
|
### INFRA-003: Add Conductor Environment Configuration
|
|
|
|
**Description:**
|
|
Add environment variables for Conductor configuration to the settings module.
|
|
|
|
**Files to Modify:**
|
|
- `server/reflector/settings.py`
|
|
- `server/.env_template`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
# Add to settings.py
|
|
CONDUCTOR_SERVER_URL: str = "http://conductor:8080/api"
|
|
CONDUCTOR_DEBUG: bool = False
|
|
CONDUCTOR_ENABLED: bool = False # Feature flag
|
|
CONDUCTOR_SHADOW_MODE: bool = False # Run both Celery and Conductor
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Settings load from environment variables
|
|
- [ ] Default values work for local development
|
|
- [ ] Docker container uses internal hostname
|
|
|
|
**Dependencies:** INFRA-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/settings.py`
|
|
|
|
---
|
|
|
|
### INFRA-004: Create Health Check Endpoint for Conductor
|
|
|
|
**Description:**
|
|
Add an endpoint to check Conductor server connectivity and status.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/views/conductor.py`
|
|
|
|
**Files to Modify:**
|
|
- `server/reflector/app.py` (register router)
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
from fastapi import APIRouter
|
|
from reflector.conductor.client import ConductorClientManager
|
|
|
|
router = APIRouter(prefix="/conductor", tags=["conductor"])
|
|
|
|
@router.get("/health")
|
|
async def conductor_health():
|
|
try:
|
|
client = ConductorClientManager.get_client()
|
|
# Conductor SDK health check
|
|
return {"status": "healthy", "connected": True}
|
|
except Exception as e:
|
|
return {"status": "unhealthy", "error": str(e)}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Endpoint returns healthy when Conductor is up
|
|
- [ ] Endpoint returns unhealthy with error when Conductor is down
|
|
- [ ] Does not block on slow responses
|
|
|
|
**Dependencies:** INFRA-002
|
|
|
|
---
|
|
|
|
## Phase 2: Task Decomposition - Worker Definitions
|
|
|
|
### TASK-001: Create Task Definitions Registry Module
|
|
|
|
**Description:**
|
|
Create a module that registers all task definitions with the Conductor server on startup.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/tasks/__init__.py`
|
|
- `server/reflector/conductor/tasks/definitions.py`
|
|
- `server/reflector/conductor/tasks/register.py`
|
|
|
|
**Implementation Details:**
|
|
|
|
Task definition schema:
|
|
```python
|
|
TASK_DEFINITIONS = [
|
|
{
|
|
"name": "get_recording",
|
|
"retryCount": 3,
|
|
"timeoutSeconds": 60,
|
|
"responseTimeoutSeconds": 30,
|
|
"inputKeys": ["recording_id"],
|
|
"outputKeys": ["id", "mtg_session_id", "room_name", "duration"],
|
|
"ownerEmail": "reflector@example.com",
|
|
},
|
|
# ... all other tasks
|
|
]
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] All 16 task types defined with correct timeouts
|
|
- [ ] Registration script runs successfully
|
|
- [ ] Tasks visible in Conductor UI
|
|
|
|
**Dependencies:** INFRA-002
|
|
|
|
**Reference Files:**
|
|
- `docs/conductor-pipeline-mock/src/register_workflow.py` (lines 10-112)
|
|
- `CONDUCTOR_MIGRATION_REQUIREMENTS.md` (Module 5 section)
|
|
|
|
---
|
|
|
|
### TASK-002: Implement get_recording Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that fetches recording metadata from the Daily.co API.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/__init__.py`
|
|
- `server/reflector/conductor/workers/get_recording.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
from conductor.client.worker.worker_task import worker_task
|
|
from conductor.client.http.models import Task, TaskResult
|
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
|
from reflector.video_platforms.factory import create_platform_client
|
|
|
|
@worker_task(task_definition_name="get_recording")
|
|
async def get_recording(task: Task) -> TaskResult:
|
|
recording_id = task.input_data.get("recording_id")
|
|
|
|
async with create_platform_client("daily") as client:
|
|
recording = await client.get_recording(recording_id)
|
|
|
|
result = TaskResult(
|
|
task_id=task.task_id,
|
|
workflow_instance_id=task.workflow_instance_id,
|
|
worker_id=task.worker_id,
|
|
)
|
|
result.status = TaskResultStatus.COMPLETED
|
|
result.output_data = {
|
|
"id": recording.id,
|
|
"mtg_session_id": recording.mtgSessionId,
|
|
"room_name": recording.roomName,
|
|
"duration": recording.duration,
|
|
}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{"recording_id": "string"}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"id": "string", "mtg_session_id": "string", "room_name": "string", "duration": "number"}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Worker polls for tasks correctly
|
|
- [ ] Handles Daily.co API errors gracefully
|
|
- [ ] Returns correct output schema
|
|
- [ ] Timeout: 60s, Response timeout: 30s, Retries: 3
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/worker/process.py` (lines 218-294)
|
|
- `docs/conductor-pipeline-mock/src/workers.py` (lines 13-26)
|
|
|
|
---
|
|
|
|
### TASK-003: Implement get_participants Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that fetches meeting participants from the Daily.co API.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/get_participants.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="get_participants")
|
|
async def get_participants(task: Task) -> TaskResult:
|
|
mtg_session_id = task.input_data.get("mtg_session_id")
|
|
|
|
async with create_platform_client("daily") as client:
|
|
payload = await client.get_meeting_participants(mtg_session_id)
|
|
|
|
participants = [
|
|
{"participant_id": p.participant_id, "user_name": p.user_name, "user_id": p.user_id}
|
|
for p in payload.data
|
|
]
|
|
|
|
result = TaskResult(...)
|
|
result.output_data = {"participants": participants}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{"mtg_session_id": "string"}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"participants": [{"participant_id": "string", "user_name": "string", "user_id": "string|null"}]}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Fetches participants from Daily.co API
|
|
- [ ] Maps participant IDs to names correctly
|
|
- [ ] Handles missing mtg_session_id
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 513-596)
|
|
- `docs/conductor-pipeline-mock/src/workers.py` (lines 29-42)
|
|
|
|
---
|
|
|
|
### TASK-004a: Implement pad_track - Extract Stream Metadata
|
|
|
|
**Description:**
|
|
Extract stream.start_time from WebM container metadata for timestamp alignment.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/pad_track.py` (partial - metadata extraction)
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
def _extract_stream_start_time_from_container(source_url: str) -> float:
|
|
"""Extract start_time from WebM stream metadata using PyAV."""
|
|
container = av.open(source_url, options={
|
|
"reconnect": "1",
|
|
"reconnect_streamed": "1",
|
|
"reconnect_delay_max": "30",
|
|
})
|
|
audio_stream = container.streams.audio[0]
|
|
start_time = float(audio_stream.start_time * audio_stream.time_base)
|
|
container.close()
|
|
return start_time
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Opens WebM container from S3 presigned URL
|
|
- [ ] Extracts start_time from audio stream metadata
|
|
- [ ] Handles missing/invalid start_time (returns 0)
|
|
- [ ] Closes container properly
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 56-85)
|
|
- `_extract_stream_start_time_from_container()` method
|
|
|
|
---
|
|
|
|
### TASK-004b: Implement pad_track - PyAV Padding Filter
|
|
|
|
**Description:**
|
|
Apply adelay filter using PyAV filter graph to pad audio with silence.
|
|
|
|
**Files to Modify:**
|
|
- `server/reflector/conductor/workers/pad_track.py` (add filter logic)
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
def _apply_audio_padding_to_file(in_container, output_path: str, start_time_seconds: float):
|
|
"""Apply adelay filter to pad audio with silence."""
|
|
delay_ms = math.floor(start_time_seconds * 1000)
|
|
|
|
graph = av.filter.Graph()
|
|
src = graph.add("abuffer", args=abuf_args, name="src")
|
|
aresample_f = graph.add("aresample", args="async=1", name="ares")
|
|
delays_arg = f"{delay_ms}|{delay_ms}"
|
|
adelay_f = graph.add("adelay", args=f"delays={delays_arg}:all=1", name="delay")
|
|
sink = graph.add("abuffersink", name="sink")
|
|
|
|
src.link_to(aresample_f)
|
|
aresample_f.link_to(adelay_f)
|
|
adelay_f.link_to(sink)
|
|
graph.configure()
|
|
|
|
# Process frames through filter graph
|
|
# Write to output file
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Constructs correct filter graph chain
|
|
- [ ] Calculates delay_ms correctly (start_time * 1000)
|
|
- [ ] Handles stereo audio (delay per channel)
|
|
- [ ] Edge case: skip if start_time <= 0
|
|
|
|
**Dependencies:** TASK-004a
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 87-188)
|
|
- `_apply_audio_padding_to_file()` method
|
|
|
|
**Technical Notes:**
|
|
- Filter chain: `abuffer` -> `aresample` -> `adelay` -> `abuffersink`
|
|
- adelay format: `delays={ms}|{ms}:all=1`
|
|
|
|
---
|
|
|
|
### TASK-004c: Implement pad_track - S3 Upload
|
|
|
|
**Description:**
|
|
Complete the pad_track worker by uploading padded file to S3 and returning presigned URL.
|
|
|
|
**Files to Modify:**
|
|
- `server/reflector/conductor/workers/pad_track.py` (complete worker)
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="pad_track")
|
|
async def pad_track(task: Task) -> TaskResult:
|
|
track_index = task.input_data.get("track_index")
|
|
s3_key = task.input_data.get("s3_key")
|
|
bucket_name = task.input_data.get("bucket_name")
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
|
|
storage = get_transcripts_storage()
|
|
source_url = await storage.get_file_url(s3_key, expires_in=7200, bucket=bucket_name)
|
|
|
|
# Use helpers from 004a and 004b
|
|
start_time = _extract_stream_start_time_from_container(source_url)
|
|
padded_path = _apply_audio_padding_to_file(source_url, start_time)
|
|
|
|
# Upload to S3
|
|
storage_key = f"{transcript_id}/padded_track_{track_index}.webm"
|
|
await storage.put_file(storage_key, padded_path)
|
|
padded_url = await storage.get_file_url(storage_key, expires_in=7200)
|
|
|
|
result.output_data = {"padded_url": padded_url, "size": file_size, "track_index": track_index}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{"track_index": "number", "s3_key": "string", "bucket_name": "string", "transcript_id": "string"}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"padded_url": "string", "size": "number", "track_index": "number"}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Uploads padded file to S3
|
|
- [ ] Returns presigned URL (7200s expiry)
|
|
- [ ] Timeout: 300s, Response timeout: 120s, Retries: 3
|
|
|
|
**Dependencies:** TASK-004b
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 190-210)
|
|
|
|
---
|
|
|
|
### TASK-005a: Implement mixdown_tracks - Build Filter Graph
|
|
|
|
**Description:**
|
|
Build PyAV filter graph for mixing N audio tracks with amix filter.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/mixdown_tracks.py` (partial - filter graph)
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
def _build_mixdown_filter_graph(containers: list, out_stream) -> av.filter.Graph:
|
|
"""Build filter graph: N abuffer -> amix -> aformat -> sink."""
|
|
graph = av.filter.Graph()
|
|
|
|
# Create abuffer for each input
|
|
abuffers = []
|
|
for i, container in enumerate(containers):
|
|
audio_stream = container.streams.audio[0]
|
|
abuf_args = f"time_base={...}:sample_rate=48000:sample_fmt=fltp:channel_layout=stereo"
|
|
abuffers.append(graph.add("abuffer", args=abuf_args, name=f"src{i}"))
|
|
|
|
# amix with normalize=0 to prevent volume reduction
|
|
amix = graph.add("amix", args=f"inputs={len(containers)}:normalize=0", name="amix")
|
|
aformat = graph.add("aformat", args="sample_fmts=s16:channel_layouts=stereo", name="aformat")
|
|
sink = graph.add("abuffersink", name="sink")
|
|
|
|
# Link all sources to amix
|
|
for abuf in abuffers:
|
|
abuf.link_to(amix)
|
|
amix.link_to(aformat)
|
|
aformat.link_to(sink)
|
|
graph.configure()
|
|
return graph
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Creates abuffer per input track
|
|
- [ ] Uses amix with normalize=0
|
|
- [ ] Outputs stereo s16 format
|
|
- [ ] Handles variable number of inputs (1-N tracks)
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 324-420)
|
|
|
|
**Technical Notes:**
|
|
- amix normalize=0 prevents volume reduction when mixing
|
|
- Output format: stereo, s16 for MP3 encoding
|
|
|
|
---
|
|
|
|
### TASK-005b: Implement mixdown_tracks - S3 Streaming and Upload
|
|
|
|
**Description:**
|
|
Complete mixdown worker with S3 streaming input and upload output.
|
|
|
|
**Files to Modify:**
|
|
- `server/reflector/conductor/workers/mixdown_tracks.py` (complete worker)
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="mixdown_tracks")
|
|
async def mixdown_tracks(task: Task) -> TaskResult:
|
|
padded_urls = task.input_data.get("padded_urls", [])
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
|
|
# Open containers with reconnect options for S3 streaming
|
|
containers = []
|
|
for url in padded_urls:
|
|
containers.append(av.open(url, options={
|
|
"reconnect": "1", "reconnect_streamed": "1", "reconnect_delay_max": "30"
|
|
}))
|
|
|
|
# Build filter graph and process
|
|
graph = _build_mixdown_filter_graph(containers, ...)
|
|
# Encode to MP3 and upload
|
|
|
|
storage = get_transcripts_storage()
|
|
storage_path = f"{transcript_id}/audio.mp3"
|
|
await storage.put_file(storage_path, mp3_file)
|
|
|
|
result.output_data = {"audio_key": storage_path, "duration": duration, "size": file_size}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{"padded_urls": ["string"], "transcript_id": "string"}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"audio_key": "string", "duration": "number", "size": "number"}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Opens all padded tracks via presigned URLs
|
|
- [ ] Handles S3 streaming with reconnect options
|
|
- [ ] Encodes to MP3 format
|
|
- [ ] Uploads to `{transcript_id}/audio.mp3`
|
|
- [ ] Returns duration for broadcast
|
|
- [ ] Timeout: 600s, Response timeout: 300s, Retries: 3
|
|
|
|
**Dependencies:** TASK-005a
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 420-498)
|
|
|
|
---
|
|
|
|
### TASK-006: Implement generate_waveform Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that generates waveform visualization data from the mixed audio.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/generate_waveform.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="generate_waveform")
|
|
async def generate_waveform(task: Task) -> TaskResult:
|
|
audio_key = task.input_data.get("audio_key")
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
|
|
# Use AudioWaveformProcessor to generate peaks
|
|
# This processor uses librosa/scipy internally
|
|
|
|
result.output_data = {"waveform": waveform_peaks}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{"audio_key": "string", "transcript_id": "string"}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"waveform": ["number"]}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Generates waveform peaks array
|
|
- [ ] Broadcasts WAVEFORM event to WebSocket
|
|
- [ ] Stores waveform JSON locally
|
|
- [ ] Timeout: 120s, Response timeout: 60s, Retries: 3
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 670-678)
|
|
- `server/reflector/processors/audio_waveform_processor.py`
|
|
- `docs/conductor-pipeline-mock/src/workers.py` (lines 79-92)
|
|
|
|
---
|
|
|
|
### TASK-007: Implement transcribe_track Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that transcribes a single audio track using GPU (Modal.com) or local Whisper.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/transcribe_track.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="transcribe_track")
|
|
async def transcribe_track(task: Task) -> TaskResult:
|
|
track_index = task.input_data.get("track_index")
|
|
audio_url = task.input_data.get("audio_url")
|
|
language = task.input_data.get("language", "en")
|
|
|
|
transcript = await transcribe_file_with_processor(audio_url, language)
|
|
|
|
# Tag all words with speaker index
|
|
for word in transcript.words:
|
|
word.speaker = track_index
|
|
|
|
result.output_data = {
|
|
"words": [w.model_dump() for w in transcript.words],
|
|
"track_index": track_index,
|
|
}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{
|
|
"track_index": "number",
|
|
"audio_url": "string",
|
|
"language": "string"
|
|
}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{
|
|
"words": [{"word": "string", "start": "number", "end": "number", "speaker": "number"}],
|
|
"track_index": "number"
|
|
}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Calls Modal.com GPU transcription service
|
|
- [ ] Tags words with correct speaker index
|
|
- [ ] Handles empty transcription results
|
|
- [ ] Timeout: 1800s, Response timeout: 900s, Retries: 3
|
|
|
|
**Dependencies:** TASK-001, CACHE-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 747-748)
|
|
- `server/reflector/pipelines/transcription_helpers.py`
|
|
- `server/reflector/processors/file_transcript_auto.py`
|
|
- `docs/conductor-pipeline-mock/src/workers.py` (lines 95-109)
|
|
|
|
**Technical Notes:**
|
|
- This is the most expensive operation (GPU time)
|
|
- Should implement caching to avoid re-transcription on retries (see CACHE-002)
|
|
- Environment variable: `TRANSCRIPT_MODAL_API_KEY`
|
|
|
|
---
|
|
|
|
### TASK-008: Implement merge_transcripts Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that merges multiple track transcriptions into a single timeline sorted by timestamp.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/merge_transcripts.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="merge_transcripts")
|
|
async def merge_transcripts(task: Task) -> TaskResult:
|
|
transcripts = task.input_data.get("transcripts", [])
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
|
|
all_words = []
|
|
for t in transcripts:
|
|
if isinstance(t, dict) and "words" in t:
|
|
all_words.extend(t["words"])
|
|
|
|
# Sort by start timestamp
|
|
all_words.sort(key=lambda w: w.get("start", 0))
|
|
|
|
# Broadcast TRANSCRIPT event
|
|
await broadcast_transcript_event(transcript_id, all_words)
|
|
|
|
result.output_data = {
|
|
"all_words": all_words,
|
|
"word_count": len(all_words),
|
|
}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{
|
|
"transcripts": [{"words": [...]}],
|
|
"transcript_id": "string"
|
|
}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"all_words": [...], "word_count": "number"}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Merges words from all tracks
|
|
- [ ] Sorts by start timestamp
|
|
- [ ] Preserves speaker attribution
|
|
- [ ] Broadcasts TRANSCRIPT event
|
|
- [ ] Updates transcript.events in DB
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 727-736)
|
|
- `docs/conductor-pipeline-mock/src/workers.py` (lines 112-131)
|
|
|
|
---
|
|
|
|
### TASK-009: Implement detect_topics Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that detects topics using LLM calls.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/detect_topics.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="detect_topics")
|
|
async def detect_topics(task: Task) -> TaskResult:
|
|
words = task.input_data.get("words", [])
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
target_language = task.input_data.get("target_language", "en")
|
|
|
|
# Uses TranscriptTopicDetectorProcessor
|
|
# Chunks words into groups of 300, calls LLM per chunk
|
|
topics = await topic_processing.detect_topics(
|
|
TranscriptType(words=words),
|
|
target_language,
|
|
on_topic_callback=lambda t: broadcast_topic_event(transcript_id, t),
|
|
empty_pipeline=EmptyPipeline(logger),
|
|
)
|
|
|
|
result.output_data = {
|
|
"topics": [t.model_dump() for t in topics]
|
|
}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{
|
|
"words": [...],
|
|
"transcript_id": "string",
|
|
"target_language": "string"
|
|
}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"topics": [{"id": "string", "title": "string", "summary": "string", "timestamp": "number", "duration": "number"}]}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Chunks words in groups of 300
|
|
- [ ] Calls LLM for each chunk
|
|
- [ ] Broadcasts TOPIC event for each detected topic
|
|
- [ ] Returns complete topics list
|
|
- [ ] Timeout: 300s, Response timeout: 120s, Retries: 3
|
|
|
|
**Dependencies:** TASK-001, CACHE-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/topic_processing.py` (lines 34-63)
|
|
- `server/reflector/processors/transcript_topic_detector.py`
|
|
- `docs/conductor-pipeline-mock/src/workers.py` (lines 134-147)
|
|
|
|
**Technical Notes:**
|
|
- Number of LLM calls: `ceil(word_count / 300)`
|
|
- Uses `TranscriptTopicDetectorProcessor`
|
|
|
|
---
|
|
|
|
### TASK-010: Implement generate_title Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that generates a meeting title from detected topics using LLM.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/generate_title.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="generate_title")
|
|
async def generate_title(task: Task) -> TaskResult:
|
|
topics = task.input_data.get("topics", [])
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
|
|
if not topics:
|
|
result.output_data = {"title": "Untitled Meeting"}
|
|
return result
|
|
|
|
# Uses TranscriptFinalTitleProcessor
|
|
title = await topic_processing.generate_title(
|
|
topics,
|
|
on_title_callback=lambda t: broadcast_title_event(transcript_id, t),
|
|
empty_pipeline=EmptyPipeline(logger),
|
|
logger=logger,
|
|
)
|
|
|
|
result.output_data = {"title": title}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{"topics": [...], "transcript_id": "string"}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"title": "string"}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Generates title from topic summaries
|
|
- [ ] Broadcasts FINAL_TITLE event
|
|
- [ ] Updates transcript.title in DB
|
|
- [ ] Handles empty topics list
|
|
- [ ] Timeout: 60s, Response timeout: 30s, Retries: 3
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/topic_processing.py` (lines 66-84)
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 760-766)
|
|
- `docs/conductor-pipeline-mock/src/workers.py` (lines 150-163)
|
|
|
|
---
|
|
|
|
### TASK-011: Implement generate_summary Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that generates long and short summaries from topics and words using LLM.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/generate_summary.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="generate_summary")
|
|
async def generate_summary(task: Task) -> TaskResult:
|
|
words = task.input_data.get("words", [])
|
|
topics = task.input_data.get("topics", [])
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
|
|
transcript = await transcripts_controller.get_by_id(transcript_id)
|
|
|
|
# Uses TranscriptFinalSummaryProcessor
|
|
await topic_processing.generate_summaries(
|
|
topics, transcript,
|
|
on_long_summary_callback=lambda s: broadcast_long_summary_event(transcript_id, s),
|
|
on_short_summary_callback=lambda s: broadcast_short_summary_event(transcript_id, s),
|
|
empty_pipeline=EmptyPipeline(logger),
|
|
logger=logger,
|
|
)
|
|
|
|
result.output_data = {
|
|
"summary": long_summary,
|
|
"short_summary": short_summary,
|
|
}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{
|
|
"words": [...],
|
|
"topics": [...],
|
|
"transcript_id": "string"
|
|
}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"summary": "string", "short_summary": "string"}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Generates long summary
|
|
- [ ] Generates short summary
|
|
- [ ] Broadcasts FINAL_LONG_SUMMARY event
|
|
- [ ] Broadcasts FINAL_SHORT_SUMMARY event
|
|
- [ ] Updates transcript.long_summary and transcript.short_summary in DB
|
|
- [ ] Timeout: 300s, Response timeout: 120s, Retries: 3
|
|
|
|
**Dependencies:** TASK-001, CACHE-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/topic_processing.py` (lines 86-109)
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 768-777)
|
|
- `docs/conductor-pipeline-mock/src/workers.py` (lines 166-180)
|
|
|
|
**Technical Notes:**
|
|
- LLM calls: 2 + 2*M where M = number of subjects (max 6)
|
|
|
|
---
|
|
|
|
### TASK-012: Implement finalize Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that finalizes the transcript status and updates the database.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/finalize.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="finalize")
|
|
async def finalize(task: Task) -> TaskResult:
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
title = task.input_data.get("title")
|
|
summary = task.input_data.get("summary")
|
|
short_summary = task.input_data.get("short_summary")
|
|
duration = task.input_data.get("duration")
|
|
|
|
transcript = await transcripts_controller.get_by_id(transcript_id)
|
|
await transcripts_controller.update(transcript, {
|
|
"status": "ended",
|
|
"title": title,
|
|
"long_summary": summary,
|
|
"short_summary": short_summary,
|
|
"duration": duration,
|
|
})
|
|
|
|
# Broadcast STATUS event
|
|
await broadcast_status_event(transcript_id, "ended")
|
|
|
|
result.output_data = {"status": "COMPLETED"}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{
|
|
"transcript_id": "string",
|
|
"title": "string",
|
|
"summary": "string",
|
|
"short_summary": "string",
|
|
"duration": "number"
|
|
}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"status": "string"}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Updates transcript status to "ended"
|
|
- [ ] Persists title, summaries, duration
|
|
- [ ] Broadcasts STATUS event with "ended"
|
|
- [ ] Idempotent (can be retried safely)
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (lines 745, 787-791)
|
|
- `docs/conductor-pipeline-mock/src/workers.py` (lines 183-196)
|
|
|
|
---
|
|
|
|
### TASK-013: Implement cleanup_consent Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that checks participant consent and deletes audio if denied.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/cleanup_consent.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="cleanup_consent")
|
|
async def cleanup_consent(task: Task) -> TaskResult:
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
|
|
# Check if any participant denied consent
|
|
# Delete audio from S3 if so
|
|
# Implementation mirrors task_cleanup_consent from main_live_pipeline
|
|
|
|
result.output_data = {
|
|
"audio_deleted": deleted,
|
|
"reason": reason,
|
|
}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{"transcript_id": "string"}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"audio_deleted": "boolean", "reason": "string|null"}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Checks all participant consent statuses
|
|
- [ ] Deletes audio from S3 if any denied
|
|
- [ ] Updates transcript.audio_deleted flag
|
|
- [ ] Idempotent deletes
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_live_pipeline.py` - `task_cleanup_consent`
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (line 794)
|
|
|
|
---
|
|
|
|
### TASK-014: Implement post_zulip Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that posts or updates a Zulip message with the transcript summary.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/post_zulip.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="post_zulip")
|
|
async def post_zulip(task: Task) -> TaskResult:
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
|
|
# Uses existing Zulip integration
|
|
# Post new message or update existing using message_id
|
|
|
|
result.output_data = {"message_id": message_id}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{"transcript_id": "string"}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"message_id": "string|null"}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Posts to configured Zulip channel
|
|
- [ ] Updates existing message if message_id exists
|
|
- [ ] Handles Zulip API errors gracefully
|
|
- [ ] Timeout: 60s, Response timeout: 30s, Retries: 5
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/pipelines/main_live_pipeline.py` - `task_pipeline_post_to_zulip`
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (line 795)
|
|
- `server/reflector/zulip.py`
|
|
|
|
---
|
|
|
|
### TASK-015: Implement send_webhook Worker
|
|
|
|
**Description:**
|
|
Create a Conductor worker that sends the transcript completion webhook to the configured URL.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/send_webhook.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="send_webhook")
|
|
async def send_webhook(task: Task) -> TaskResult:
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
room_id = task.input_data.get("room_id")
|
|
|
|
# Uses existing webhook logic from webhook.py
|
|
# Includes HMAC signature if secret configured
|
|
|
|
result.output_data = {
|
|
"sent": success,
|
|
"status_code": status_code,
|
|
}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{"transcript_id": "string", "room_id": "string"}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{"sent": "boolean", "status_code": "number|null"}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Sends webhook with correct payload schema
|
|
- [ ] Includes HMAC signature
|
|
- [ ] Retries on 5xx, not on 4xx
|
|
- [ ] Timeout: 60s, Response timeout: 30s, Retries: 30
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/worker/webhook.py`
|
|
- `server/reflector/pipelines/main_file_pipeline.py` - `task_send_webhook_if_needed`
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` (line 796)
|
|
|
|
---
|
|
|
|
### TASK-016: Implement generate_dynamic_fork_tasks Helper
|
|
|
|
**Description:**
|
|
Create a helper worker that generates dynamic task definitions for FORK_JOIN_DYNAMIC. This is required because Conductor's FORK_JOIN_DYNAMIC needs pre-computed task lists and input maps.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workers/generate_dynamic_fork_tasks.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@worker_task(task_definition_name="generate_dynamic_fork_tasks")
|
|
def generate_dynamic_fork_tasks(task: Task) -> TaskResult:
|
|
tracks = task.input_data.get("tracks", [])
|
|
task_type = task.input_data.get("task_type") # "pad_track" or "transcribe_track"
|
|
transcript_id = task.input_data.get("transcript_id")
|
|
|
|
tasks = []
|
|
inputs = {}
|
|
for idx, track in enumerate(tracks):
|
|
ref_name = f"{task_type}_{idx}"
|
|
tasks.append({
|
|
"name": task_type,
|
|
"taskReferenceName": ref_name,
|
|
"type": "SIMPLE"
|
|
})
|
|
inputs[ref_name] = {
|
|
"track_index": idx,
|
|
"transcript_id": transcript_id,
|
|
# Additional task-specific inputs based on task_type
|
|
}
|
|
|
|
result.output_data = {"tasks": tasks, "inputs": inputs}
|
|
return result
|
|
```
|
|
|
|
**Input Contract:**
|
|
```json
|
|
{
|
|
"tracks": [{"s3_key": "string"}],
|
|
"task_type": "pad_track" | "transcribe_track",
|
|
"transcript_id": "string",
|
|
"bucket_name": "string"
|
|
}
|
|
```
|
|
|
|
**Output Contract:**
|
|
```json
|
|
{
|
|
"tasks": [{"name": "string", "taskReferenceName": "string", "type": "SIMPLE"}],
|
|
"inputs": {"ref_name": {...input_data...}}
|
|
}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Generates correct task list for variable track counts (1, 2, ... N)
|
|
- [ ] Generates correct input map with task-specific parameters
|
|
- [ ] Supports both pad_track and transcribe_track task types
|
|
- [ ] Timeout: 30s, Response timeout: 15s, Retries: 3
|
|
|
|
**Dependencies:** TASK-001
|
|
|
|
**Technical Notes:**
|
|
- This helper is required because FORK_JOIN_DYNAMIC expects `dynamicTasks` and `dynamicTasksInput` parameters
|
|
- The workflow uses this helper twice: once for padding, once for transcription
|
|
- Each invocation has different task_type and additional inputs
|
|
|
|
---
|
|
|
|
## Phase 2 (Continued): State Management
|
|
|
|
### STATE-001: Add workflow_id to Recording Model
|
|
|
|
**Description:**
|
|
Add a `workflow_id` field to the Recording model to track the Conductor workflow associated with each recording.
|
|
|
|
**Files to Modify:**
|
|
- `server/reflector/db/recordings.py`
|
|
- Create migration file
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
# In Recording model
|
|
workflow_id: Optional[str] = Column(String, nullable=True, index=True)
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Migration adds nullable workflow_id column
|
|
- [ ] Index created for workflow_id lookups
|
|
- [ ] Recording can be queried by workflow_id
|
|
|
|
**Dependencies:** INFRA-002
|
|
|
|
**Reference Files:**
|
|
- `CONDUCTOR_MIGRATION_REQUIREMENTS.md` (Module 7: State Management)
|
|
|
|
---
|
|
|
|
## Phase 3: Workflow Definition
|
|
|
|
### WFLOW-001: Create Workflow Definition JSON with FORK_JOIN_DYNAMIC
|
|
|
|
**Description:**
|
|
Define the complete workflow DAG in Conductor's workflow definition format, including dynamic forking for variable track counts.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workflows/diarization_pipeline.json`
|
|
|
|
**Implementation Details:**
|
|
|
|
The workflow must include:
|
|
1. Sequential: get_recording -> get_participants
|
|
2. FORK_JOIN_DYNAMIC: pad_track for each track
|
|
3. Sequential: mixdown_tracks -> generate_waveform
|
|
4. FORK_JOIN_DYNAMIC: transcribe_track for each track (parallel!)
|
|
5. Sequential: merge_transcripts -> detect_topics
|
|
6. FORK_JOIN: generate_title || generate_summary
|
|
7. Sequential: finalize -> cleanup_consent -> post_zulip -> send_webhook
|
|
|
|
**FORK_JOIN_DYNAMIC Pattern:**
|
|
```json
|
|
{
|
|
"name": "fork_track_padding",
|
|
"taskReferenceName": "fork_track_padding",
|
|
"type": "FORK_JOIN_DYNAMIC",
|
|
"inputParameters": {
|
|
"dynamicTasks": "${generate_padding_tasks.output.tasks}",
|
|
"dynamicTasksInput": "${generate_padding_tasks.output.inputs}"
|
|
},
|
|
"dynamicForkTasksParam": "dynamicTasks",
|
|
"dynamicForkTasksInputParamName": "dynamicTasksInput"
|
|
}
|
|
```
|
|
|
|
This requires a helper task that generates the dynamic fork structure based on track count.
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Valid Conductor workflow schema
|
|
- [ ] All task references match registered task definitions
|
|
- [ ] Input/output parameter mappings correct
|
|
- [ ] FORK_JOIN_DYNAMIC works with 1, 2, ... N tracks
|
|
- [ ] JOIN correctly collects all parallel results
|
|
- [ ] DAG renders correctly in Conductor UI
|
|
|
|
**Dependencies:** TASK-002 through TASK-015
|
|
|
|
**Reference Files:**
|
|
- `docs/conductor-pipeline-mock/src/register_workflow.py` (lines 125-304)
|
|
- `CONDUCTOR_MIGRATION_REQUIREMENTS.md` (Module 3 section, Target Architecture diagram)
|
|
|
|
---
|
|
|
|
### WFLOW-002: Implement Workflow Registration Script
|
|
|
|
**Description:**
|
|
Create a script that registers the workflow definition with the Conductor server.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/workflows/register.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
import requests
|
|
from reflector.settings import settings
|
|
|
|
def register_workflow():
|
|
with open("diarization_pipeline.json") as f:
|
|
workflow = json.load(f)
|
|
|
|
resp = requests.put(
|
|
f"{settings.CONDUCTOR_SERVER_URL}/metadata/workflow",
|
|
json=[workflow],
|
|
headers={"Content-Type": "application/json"},
|
|
)
|
|
resp.raise_for_status()
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Workflow visible in Conductor UI
|
|
- [ ] Can start workflow via API
|
|
- [ ] DAG renders correctly in UI
|
|
|
|
**Dependencies:** WFLOW-001
|
|
|
|
**Reference Files:**
|
|
- `docs/conductor-pipeline-mock/src/register_workflow.py` (lines 317-327)
|
|
|
|
---
|
|
|
|
## Phase 2 (Continued): WebSocket Events
|
|
|
|
### EVENT-001: Add PIPELINE_PROGRESS WebSocket Event
|
|
|
|
**Description:**
|
|
Define a new WebSocket event type for granular pipeline progress tracking.
|
|
|
|
**⚠️ Note:** Requires separate frontend ticket to add UI consumer for this event.
|
|
|
|
**Files to Modify:**
|
|
- `server/reflector/db/transcripts.py` (add event type)
|
|
- `server/reflector/ws_manager.py` (ensure broadcast support)
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
# New event schema
|
|
class PipelineProgressEvent(BaseModel):
|
|
event: str = "PIPELINE_PROGRESS"
|
|
data: PipelineProgressData
|
|
|
|
class PipelineProgressData(BaseModel):
|
|
workflow_id: str
|
|
current_step: str
|
|
step_index: int
|
|
total_steps: int
|
|
step_status: Literal["pending", "in_progress", "completed", "failed"]
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Event schema defined
|
|
- [ ] Works with existing WebSocket infrastructure
|
|
- [ ] Frontend ticket created for progress UI consumer
|
|
|
|
**Dependencies:** None
|
|
|
|
**Reference Files:**
|
|
- `CONDUCTOR_MIGRATION_REQUIREMENTS.md` (Module 6 section)
|
|
- `server/reflector/pipelines/main_live_pipeline.py` (broadcast_to_sockets decorator)
|
|
|
|
---
|
|
|
|
### EVENT-002: Emit Progress Events from Workers
|
|
|
|
**Description:**
|
|
Modify workers to emit PIPELINE_PROGRESS events at start and completion of each task.
|
|
|
|
**⚠️ Note:** Requires separate frontend ticket to add UI consumer (see EVENT-001).
|
|
|
|
**Files to Modify:**
|
|
- All worker files in `server/reflector/conductor/workers/`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
async def emit_progress(transcript_id: str, step: str, status: str, index: int, total: int):
|
|
ws_manager = get_ws_manager()
|
|
await ws_manager.send_json(
|
|
room_id=f"ts:{transcript_id}",
|
|
message={
|
|
"event": "PIPELINE_PROGRESS",
|
|
"data": {
|
|
"current_step": step,
|
|
"step_index": index,
|
|
"total_steps": total,
|
|
"step_status": status,
|
|
}
|
|
}
|
|
)
|
|
|
|
@worker_task(task_definition_name="transcribe_track")
|
|
async def transcribe_track(task: Task) -> TaskResult:
|
|
await emit_progress(transcript_id, "transcribe_track", "in_progress", 6, 14)
|
|
# ... processing ...
|
|
await emit_progress(transcript_id, "transcribe_track", "completed", 6, 14)
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Progress emitted at task start
|
|
- [ ] Progress emitted at task completion
|
|
|
|
**Dependencies:** EVENT-001, TASK-002 through TASK-015
|
|
|
|
---
|
|
|
|
## Phase 4: Integration
|
|
|
|
### INTEG-001: Modify Pipeline Trigger to Start Conductor Workflow
|
|
|
|
**Description:**
|
|
Replace `task_pipeline_multitrack_process.delay()` with Conductor workflow start in `process_multitrack_recording`.
|
|
This single change captures BOTH webhook AND polling entry paths, since both converge at this function.
|
|
|
|
**Files to Modify:**
|
|
- `server/reflector/worker/process.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
# In _process_multitrack_recording_inner(), around line 289
|
|
# Replace:
|
|
# task_pipeline_multitrack_process.delay(
|
|
# transcript_id=transcript.id,
|
|
# bucket_name=bucket_name,
|
|
# track_keys=filter_cam_audio_tracks(track_keys),
|
|
# )
|
|
# With:
|
|
|
|
if settings.CONDUCTOR_ENABLED:
|
|
from reflector.conductor.client import ConductorClientManager
|
|
from reflector.db.recordings import recordings_controller
|
|
|
|
workflow_id = ConductorClientManager.start_workflow(
|
|
name="diarization_pipeline",
|
|
version=1,
|
|
input_data={
|
|
"recording_id": recording_id,
|
|
"room_name": daily_room_name,
|
|
"tracks": [{"s3_key": k} for k in filter_cam_audio_tracks(track_keys)],
|
|
"bucket_name": bucket_name,
|
|
"transcript_id": transcript.id,
|
|
"room_id": room.id,
|
|
}
|
|
)
|
|
logger.info("Started Conductor workflow", workflow_id=workflow_id, transcript_id=transcript.id)
|
|
|
|
# Store workflow_id on recording for status tracking
|
|
await recordings_controller.update(recording, {"workflow_id": workflow_id})
|
|
|
|
if not settings.CONDUCTOR_SHADOW_MODE:
|
|
return # Don't trigger Celery
|
|
|
|
# Existing Celery trigger (runs in shadow mode or when Conductor disabled)
|
|
task_pipeline_multitrack_process.delay(
|
|
transcript_id=transcript.id,
|
|
bucket_name=bucket_name,
|
|
track_keys=filter_cam_audio_tracks(track_keys),
|
|
)
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Conductor workflow started from process_multitrack_recording
|
|
- [ ] Workflow ID stored on Recording model
|
|
- [ ] Both webhook and polling paths covered (single integration point)
|
|
- [ ] Celery still triggered in shadow mode
|
|
|
|
**Dependencies:** WFLOW-002, STATE-001
|
|
|
|
**Reference Files:**
|
|
- `server/reflector/worker/process.py` (lines 172-293)
|
|
- `CONDUCTOR_MIGRATION_REQUIREMENTS.md` (Module 4 section)
|
|
|
|
---
|
|
|
|
### SHADOW-001: Implement Shadow Mode Toggle
|
|
|
|
**Description:**
|
|
Add configuration and logic to run both Celery and Conductor pipelines simultaneously for comparison.
|
|
|
|
**Files to Modify:**
|
|
- `server/reflector/settings.py` (already has CONDUCTOR_SHADOW_MODE from INFRA-003)
|
|
- `server/reflector/worker/process.py` (INTEG-001 already implements shadow mode logic)
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
# settings.py (already done in INFRA-003)
|
|
CONDUCTOR_SHADOW_MODE: bool = False
|
|
|
|
# worker/process.py (in _process_multitrack_recording_inner)
|
|
if settings.CONDUCTOR_ENABLED:
|
|
workflow_id = ConductorClientManager.start_workflow(...)
|
|
await recordings_controller.update(recording, {"workflow_id": workflow_id})
|
|
|
|
if not settings.CONDUCTOR_SHADOW_MODE:
|
|
return # Conductor only - skip Celery
|
|
# If shadow mode, fall through to Celery trigger below
|
|
|
|
# Celery trigger (runs when Conductor disabled OR in shadow mode)
|
|
task_pipeline_multitrack_process.delay(...)
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Both pipelines triggered when CONDUCTOR_SHADOW_MODE=True
|
|
- [ ] Only Conductor triggered when CONDUCTOR_ENABLED=True and SHADOW_MODE=False
|
|
- [ ] Only Celery triggered when CONDUCTOR_ENABLED=False
|
|
- [ ] workflow_id stored on Recording model for comparison
|
|
|
|
**Dependencies:** INTEG-001
|
|
|
|
**Note:** INTEG-001 already implements the shadow mode toggle logic. This task verifies
|
|
the implementation and adds any missing comparison/monitoring infrastructure.
|
|
|
|
**Reference Files:**
|
|
- `CONDUCTOR_MIGRATION_REQUIREMENTS.md` (Phase 3: Shadow Mode)
|
|
|
|
---
|
|
|
|
### SHADOW-002: Add Result Comparison - Content Fields
|
|
|
|
**Description:**
|
|
Compare content fields (title, summaries, topics, word counts) between Celery and Conductor outputs.
|
|
|
|
**Files to Create:**
|
|
- `server/reflector/conductor/shadow_compare.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
async def compare_content_results(recording_id: str, workflow_id: str) -> dict:
|
|
"""Compare content results from Celery and Conductor pipelines."""
|
|
celery_transcript = await transcripts_controller.get_by_recording_id(recording_id)
|
|
workflow_status = ConductorClientManager.get_workflow_status(workflow_id)
|
|
|
|
differences = []
|
|
# Compare title
|
|
if celery_transcript.title != workflow_status.output.get("title"):
|
|
differences.append({"field": "title", ...})
|
|
# Compare summaries, topics, word_count
|
|
...
|
|
|
|
return {"match": len(differences) == 0, "differences": differences}
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Compares title, long_summary, short_summary
|
|
- [ ] Compares topic count and content
|
|
- [ ] Compares word_count
|
|
- [ ] Logs differences for debugging
|
|
|
|
**Dependencies:** SHADOW-001
|
|
|
|
---
|
|
|
|
## Phase 5: Cutover
|
|
|
|
### CUTOVER-001: Create Feature Flag for Conductor-Only Mode
|
|
|
|
**Description:**
|
|
Enable Conductor-only mode by setting environment variables. No code changes required.
|
|
|
|
**Files to Modify:**
|
|
- `.env` or environment configuration
|
|
|
|
**Implementation Details:**
|
|
```bash
|
|
# .env (production)
|
|
CONDUCTOR_ENABLED=true # Enable Conductor
|
|
CONDUCTOR_SHADOW_MODE=false # Disable shadow mode (Conductor only)
|
|
```
|
|
|
|
The logic is already implemented in INTEG-001:
|
|
```python
|
|
# worker/process.py (_process_multitrack_recording_inner)
|
|
if settings.CONDUCTOR_ENABLED:
|
|
workflow_id = ConductorClientManager.start_workflow(...)
|
|
if not settings.CONDUCTOR_SHADOW_MODE:
|
|
return # Conductor only - Celery not triggered
|
|
# Celery only reached if Conductor disabled or shadow mode enabled
|
|
task_pipeline_multitrack_process.delay(...)
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Set CONDUCTOR_ENABLED=true in production environment
|
|
- [ ] Set CONDUCTOR_SHADOW_MODE=false
|
|
- [ ] Verify Celery not triggered (check logs for "Started Conductor workflow")
|
|
- [ ] Can toggle back via environment variables without code changes
|
|
|
|
**Dependencies:** SHADOW-001
|
|
|
|
**Note:** This is primarily a configuration change. The code logic is already in place from INTEG-001.
|
|
|
|
---
|
|
|
|
### CUTOVER-002: Add Fallback to Celery on Conductor Failure
|
|
|
|
**Description:**
|
|
Implement automatic fallback to Celery pipeline if Conductor fails to start or process a workflow.
|
|
|
|
**Files to Modify:**
|
|
- `server/reflector/worker/process.py`
|
|
- `server/reflector/conductor/client.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
# In _process_multitrack_recording_inner()
|
|
if settings.CONDUCTOR_ENABLED:
|
|
try:
|
|
workflow_id = ConductorClientManager.start_workflow(
|
|
name="diarization_pipeline",
|
|
version=1,
|
|
input_data={...}
|
|
)
|
|
logger.info("Conductor workflow started", workflow_id=workflow_id, transcript_id=transcript.id)
|
|
await recordings_controller.update(recording, {"workflow_id": workflow_id})
|
|
|
|
if not settings.CONDUCTOR_SHADOW_MODE:
|
|
return # Success - don't trigger Celery
|
|
except Exception as e:
|
|
logger.error(
|
|
"Conductor workflow start failed, falling back to Celery",
|
|
error=str(e),
|
|
transcript_id=transcript.id,
|
|
exc_info=True,
|
|
)
|
|
# Fall through to Celery trigger below
|
|
|
|
# Celery fallback (runs on Conductor failure, or when disabled, or in shadow mode)
|
|
task_pipeline_multitrack_process.delay(
|
|
transcript_id=transcript.id,
|
|
bucket_name=bucket_name,
|
|
track_keys=filter_cam_audio_tracks(track_keys),
|
|
)
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Celery triggered on Conductor connection failure
|
|
- [ ] Celery triggered on workflow start failure
|
|
- [ ] Errors logged with full context for debugging
|
|
- [ ] workflow_id still stored if partially successful
|
|
|
|
**Dependencies:** CUTOVER-001
|
|
|
|
---
|
|
|
|
## Phase 6: Cleanup
|
|
|
|
### CLEANUP-001: Remove Deprecated Celery Task Code
|
|
|
|
**Description:**
|
|
After successful migration, remove the old Celery-based pipeline code.
|
|
|
|
**Files to Modify:**
|
|
- `server/reflector/pipelines/main_multitrack_pipeline.py` - Remove entire file
|
|
- `server/reflector/worker/process.py` - Remove `task_pipeline_multitrack_process.delay()` call
|
|
- `server/reflector/pipelines/main_live_pipeline.py` - Remove shared utilities if unused
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
# worker/process.py - Remove Celery fallback entirely
|
|
if settings.CONDUCTOR_ENABLED:
|
|
workflow_id = ConductorClientManager.start_workflow(...)
|
|
await recordings_controller.update(recording, {"workflow_id": workflow_id})
|
|
return # No Celery fallback
|
|
|
|
# Delete this:
|
|
# task_pipeline_multitrack_process.delay(...)
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] `main_multitrack_pipeline.py` deleted
|
|
- [ ] Celery trigger removed from `worker/process.py`
|
|
- [ ] Old task imports removed
|
|
- [ ] No new recordings processed via Celery
|
|
- [ ] Code removed after stability period (1-2 weeks)
|
|
|
|
**Dependencies:** CUTOVER-001
|
|
|
|
---
|
|
|
|
### CLEANUP-002: Update Documentation
|
|
|
|
**Description:**
|
|
Update all documentation to reflect the new Conductor-based architecture.
|
|
|
|
**Files to Modify:**
|
|
- `CLAUDE.md`
|
|
- `README.md`
|
|
- `docs/` (if applicable)
|
|
|
|
**Files to Archive:**
|
|
- `CONDUCTOR_MIGRATION_REQUIREMENTS.md` (move to docs/archive/)
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Architecture diagrams updated
|
|
- [ ] API documentation reflects new endpoints
|
|
- [ ] Runbooks updated for Conductor operations
|
|
|
|
**Dependencies:** CLEANUP-001
|
|
|
|
---
|
|
|
|
## Testing Tasks
|
|
|
|
**⚠️ Note:** All test tasks should be deferred to human tester if automated testing proves too complex or time-consuming.
|
|
|
|
### TEST-001a: Integration Tests - API Workers
|
|
|
|
**Description:**
|
|
Write integration tests for get_recording and get_participants workers.
|
|
|
|
**Files to Create:**
|
|
- `server/tests/conductor/test_workers_api.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@pytest.mark.asyncio
|
|
async def test_get_recording_worker():
|
|
with patch("reflector.conductor.workers.get_recording.create_platform_client") as mock:
|
|
mock.return_value.__aenter__.return_value.get_recording.return_value = MockRecording()
|
|
|
|
task = Task(input_data={"recording_id": "rec_123"})
|
|
result = await get_recording(task)
|
|
|
|
assert result.status == TaskResultStatus.COMPLETED
|
|
assert result.output_data["id"] == "rec_123"
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] get_recording worker tested with mock Daily.co API
|
|
- [ ] get_participants worker tested with mock response
|
|
- [ ] Error handling tested (API failures)
|
|
|
|
**Dependencies:** TASK-002, TASK-003
|
|
|
|
---
|
|
|
|
### TEST-001b: Integration Tests - Audio Processing Workers
|
|
|
|
**Description:**
|
|
Write integration tests for pad_track, mixdown_tracks, and generate_waveform workers.
|
|
|
|
**Files to Create:**
|
|
- `server/tests/conductor/test_workers_audio.py`
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] pad_track worker tested with mock S3 and sample WebM
|
|
- [ ] mixdown_tracks worker tested with mock audio streams
|
|
- [ ] generate_waveform worker tested
|
|
- [ ] PyAV filter graph execution verified
|
|
|
|
**Dependencies:** TASK-004c, TASK-005b, TASK-006
|
|
|
|
---
|
|
|
|
### TEST-001c: Integration Tests - Transcription Workers
|
|
|
|
**Description:**
|
|
Write integration tests for transcribe_track and merge_transcripts workers.
|
|
|
|
**Files to Create:**
|
|
- `server/tests/conductor/test_workers_transcription.py`
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] transcribe_track worker tested with mock Modal.com response
|
|
- [ ] merge_transcripts worker tested with multiple track inputs
|
|
- [ ] Word sorting by timestamp verified
|
|
|
|
**Dependencies:** TASK-007, TASK-008
|
|
|
|
---
|
|
|
|
### TEST-001d: Integration Tests - LLM Workers
|
|
|
|
**Description:**
|
|
Write integration tests for detect_topics, generate_title, and generate_summary workers.
|
|
|
|
**Files to Create:**
|
|
- `server/tests/conductor/test_workers_llm.py`
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] detect_topics worker tested with mock LLM response
|
|
- [ ] generate_title worker tested
|
|
- [ ] generate_summary worker tested
|
|
- [ ] WebSocket event broadcasting verified
|
|
|
|
**Dependencies:** TASK-009, TASK-010, TASK-011
|
|
|
|
---
|
|
|
|
### TEST-001e: Integration Tests - Finalization Workers
|
|
|
|
**Description:**
|
|
Write integration tests for finalize, cleanup_consent, post_zulip, and send_webhook workers.
|
|
|
|
**Files to Create:**
|
|
- `server/tests/conductor/test_workers_finalization.py`
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] finalize worker tested (DB update)
|
|
- [ ] cleanup_consent worker tested (S3 deletion)
|
|
- [ ] post_zulip worker tested with mock API
|
|
- [ ] send_webhook worker tested with HMAC verification
|
|
|
|
**Dependencies:** TASK-012, TASK-013, TASK-014, TASK-015
|
|
|
|
---
|
|
|
|
### TEST-002: E2E Test for Complete Workflow
|
|
|
|
**Description:**
|
|
Create an end-to-end test that runs the complete Conductor workflow with mock services.
|
|
|
|
**Files to Create:**
|
|
- `server/tests/conductor/test_workflow_e2e.py`
|
|
|
|
**Implementation Details:**
|
|
```python
|
|
@pytest.mark.asyncio
|
|
async def test_complete_diarization_workflow():
|
|
# Start Conductor in test mode
|
|
workflow_id = ConductorClientManager.start_workflow(
|
|
"diarization_pipeline", 1,
|
|
{"recording_id": "test_123", "tracks": [...]}
|
|
)
|
|
|
|
# Wait for completion
|
|
status = await wait_for_workflow(workflow_id, timeout=60)
|
|
|
|
assert status.status == "COMPLETED"
|
|
assert status.output["title"] is not None
|
|
```
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Complete workflow runs successfully
|
|
- [ ] All tasks execute in correct order
|
|
- [ ] FORK_JOIN_DYNAMIC parallelism works
|
|
- [ ] Output matches expected schema
|
|
|
|
**Dependencies:** WFLOW-002
|
|
|
|
---
|
|
|
|
### TEST-003: Shadow Mode Comparison Tests
|
|
|
|
**Description:**
|
|
Write tests that verify Celery and Conductor produce equivalent results.
|
|
|
|
**Files to Create:**
|
|
- `server/tests/conductor/test_shadow_compare.py`
|
|
|
|
**Acceptance Criteria:**
|
|
- [ ] Same input produces same output
|
|
- [ ] Timing differences documented
|
|
- [ ] Edge cases handled
|
|
|
|
**Dependencies:** SHADOW-002b
|
|
|
|
---
|
|
|
|
## Appendix: Task Timeout Reference
|
|
|
|
| Task | Timeout (s) | Response Timeout (s) | Retry Count |
|
|
|------|-------------|---------------------|-------------|
|
|
| get_recording | 60 | 30 | 3 |
|
|
| get_participants | 60 | 30 | 3 |
|
|
| pad_track | 300 | 120 | 3 |
|
|
| mixdown_tracks | 600 | 300 | 3 |
|
|
| generate_waveform | 120 | 60 | 3 |
|
|
| transcribe_track | 1800 | 900 | 3 |
|
|
| merge_transcripts | 60 | 30 | 3 |
|
|
| detect_topics | 300 | 120 | 3 |
|
|
| generate_title | 60 | 30 | 3 |
|
|
| generate_summary | 300 | 120 | 3 |
|
|
| finalize | 60 | 30 | 3 |
|
|
| cleanup_consent | 60 | 30 | 3 |
|
|
| post_zulip | 60 | 30 | 5 |
|
|
| send_webhook | 60 | 30 | 30 |
|
|
|
|
---
|
|
|
|
## Appendix: File Structure After Migration
|
|
|
|
```
|
|
server/reflector/
|
|
├── conductor/
|
|
│ ├── __init__.py
|
|
│ ├── client.py # Conductor SDK wrapper
|
|
│ ├── cache.py # Idempotency cache
|
|
│ ├── shadow_compare.py # Shadow mode comparison
|
|
│ ├── tasks/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── definitions.py # Task definitions with timeouts
|
|
│ │ └── register.py # Registration script
|
|
│ ├── workers/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── get_recording.py
|
|
│ │ ├── get_participants.py
|
|
│ │ ├── pad_track.py
|
|
│ │ ├── mixdown_tracks.py
|
|
│ │ ├── generate_waveform.py
|
|
│ │ ├── transcribe_track.py
|
|
│ │ ├── merge_transcripts.py
|
|
│ │ ├── detect_topics.py
|
|
│ │ ├── generate_title.py
|
|
│ │ ├── generate_summary.py
|
|
│ │ ├── finalize.py
|
|
│ │ ├── cleanup_consent.py
|
|
│ │ ├── post_zulip.py
|
|
│ │ └── send_webhook.py
|
|
│ └── workflows/
|
|
│ ├── diarization_pipeline.json
|
|
│ └── register.py
|
|
├── views/
|
|
│ └── conductor.py # Health & status endpoints
|
|
└── ...existing files...
|
|
```
|