mirror of
https://github.com/Monadical-SAS/reflector.git
synced 2025-12-20 20:29:06 +00:00
durable (no-mistakes)
This commit is contained in:
617
CONDUCTOR_MIGRATION_REQUIREMENTS.md
Normal file
617
CONDUCTOR_MIGRATION_REQUIREMENTS.md
Normal file
@@ -0,0 +1,617 @@
|
|||||||
|
# Conductor Migration Requirements: Daily.co Multitrack Pipeline
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
Migrate the Daily.co multitrack diarization pipeline from a monolithic Celery task to a decomposed Conductor workflow, enabling visual progress tracking, granular retries, and operational observability.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Business Value
|
||||||
|
|
||||||
|
### 1. Visibility: Where Are We Now? (UX, DevEx)
|
||||||
|
|
||||||
|
**Current State**: Users see only three states: `idle` → `processing` → `ended/error`. A 10-minute pipeline appears frozen with no feedback.
|
||||||
|
|
||||||
|
**Target State**: Real-time visibility into which step is executing:
|
||||||
|
- "Transcribing track 2 of 3"
|
||||||
|
- "Generating summary (step 8 of 9)"
|
||||||
|
- Visual DAG in admin UI showing completed/in-progress/pending nodes
|
||||||
|
|
||||||
|
**Business Impact**:
|
||||||
|
- Reduced support tickets ("is it stuck?")
|
||||||
|
- Engineers can instantly identify bottlenecks
|
||||||
|
- Users have confidence the system is working
|
||||||
|
|
||||||
|
### 2. Progress Tracking: What's Left? (UX, DevEx)
|
||||||
|
|
||||||
|
**Current State**: No indication of remaining work. A failure at step 8 gives same error as failure at step 1.
|
||||||
|
|
||||||
|
**Target State**:
|
||||||
|
- Progress percentage based on completed steps
|
||||||
|
- Clear step enumeration (e.g., "Step 5/9: Transcription")
|
||||||
|
- Frontend receives structured progress events with step metadata
|
||||||
|
|
||||||
|
**Business Impact**:
|
||||||
|
- Users can estimate completion time
|
||||||
|
- Frontend can render meaningful progress bars
|
||||||
|
- Error messages include context ("Failed during summary generation")
|
||||||
|
|
||||||
|
### 3. Audit Trail & Profiling (DevEx, Ops)
|
||||||
|
|
||||||
|
**Current State**: Logs scattered across Celery workers. No unified view of a single recording's journey. Resource consumption unknown per step.
|
||||||
|
|
||||||
|
**Target State**:
|
||||||
|
- Single workflow ID traces entire recording lifecycle
|
||||||
|
- Per-step execution times recorded
|
||||||
|
- Resource consumption (GPU seconds, LLM tokens) attributable to specific steps
|
||||||
|
- Conductor UI provides complete audit history
|
||||||
|
|
||||||
|
**Business Impact**:
|
||||||
|
- Debugging: "Recording X failed at step Y after Z seconds"
|
||||||
|
- Cost attribution: "Transcription costs $X, summarization costs $Y"
|
||||||
|
- Performance optimization: identify slowest steps
|
||||||
|
|
||||||
|
### 4. Clear Event Dictionary (DevEx)
|
||||||
|
|
||||||
|
**Current State**: Frontend receives WebSocket events (`TRANSCRIPT`, `TOPIC`, `FINAL_TITLE`, etc.) but mapping to pipeline phases is implicit. Adding new events requires tracing through Python code.
|
||||||
|
|
||||||
|
**Target State**:
|
||||||
|
- Each Conductor task explicitly defines its output events
|
||||||
|
- Event schema documented alongside task definition
|
||||||
|
- Frontend developers can reference task→event mapping directly
|
||||||
|
|
||||||
|
**Business Impact**:
|
||||||
|
- Faster frontend development
|
||||||
|
- Reduced miscommunication between backend/frontend teams
|
||||||
|
- Self-documenting pipeline
|
||||||
|
|
||||||
|
### 5. Restart Without Reprocessing (UX, DevEx)
|
||||||
|
|
||||||
|
**Current State**: Any failure restarts the entire pipeline. A timeout during summary generation re-runs transcription (wasting GPU costs).
|
||||||
|
|
||||||
|
**Target State**:
|
||||||
|
- Failures resume from last successful step
|
||||||
|
- Completed work is checkpointed (e.g., transcription results stored before summary)
|
||||||
|
- Manual retry triggers only failed step, not entire workflow
|
||||||
|
|
||||||
|
**Business Impact**:
|
||||||
|
- Reduced GPU/LLM costs on retries
|
||||||
|
- Faster recovery from transient failures
|
||||||
|
- Users don't wait for re-transcription on summary failures
|
||||||
|
|
||||||
|
### 6. Per-Step Timeouts (UX, DevEx)
|
||||||
|
|
||||||
|
**Current State**: Single task timeout for entire pipeline. A hung GPU call blocks everything. Killing the task loses all progress.
|
||||||
|
|
||||||
|
**Target State**:
|
||||||
|
- Each step has independent timeout (e.g., transcription: 5min, LLM: 30s)
|
||||||
|
- Timeout kills only the hung step
|
||||||
|
- Pipeline can retry just that step or fail gracefully
|
||||||
|
|
||||||
|
**Business Impact**:
|
||||||
|
- Faster detection of stuck external services
|
||||||
|
- Reduced blast radius from hung calls
|
||||||
|
- More granular SLAs per operation type
|
||||||
|
|
||||||
|
### 7. Native Retries with Backoff (DevEx, UX)
|
||||||
|
|
||||||
|
**Current State**: Celery retry logic is per-task, not per-external-call. Custom retry wrappers needed for each API call.
|
||||||
|
|
||||||
|
**Target State**:
|
||||||
|
- Conductor provides native retry policies per task
|
||||||
|
- Exponential backoff configured declaratively
|
||||||
|
- Retry state visible in UI (attempt 2/5)
|
||||||
|
|
||||||
|
**Business Impact**:
|
||||||
|
- Reduced boilerplate code
|
||||||
|
- Consistent retry behavior across all external calls
|
||||||
|
- Visibility into retry attempts for debugging
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current Architecture
|
||||||
|
|
||||||
|
### Daily.co Multitrack Pipeline Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
Daily webhook (recording.ready-to-download) Polling (every 3 min)
|
||||||
|
│ │
|
||||||
|
▼ ▼
|
||||||
|
_handle_recording_ready() poll_daily_recordings()
|
||||||
|
│ │
|
||||||
|
└──────────────┬─────────────────────────┘
|
||||||
|
▼
|
||||||
|
process_multitrack_recording.delay() ← Celery task #1
|
||||||
|
│
|
||||||
|
├── Daily API: GET /recordings/{id}
|
||||||
|
├── Daily API: GET /meetings/{mtgSessionId}/participants
|
||||||
|
├── DB: Create recording + transcript
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
task_pipeline_multitrack_process.delay() ← Celery task #2 (MONOLITH)
|
||||||
|
│
|
||||||
|
│ ┌─────────────────────────────────────────────────┐
|
||||||
|
│ │ pipeline.process() - ALL PHASES INSIDE HERE │
|
||||||
|
│ │ │
|
||||||
|
│ │ Phase 2: Track Padding (N tracks, sequential) │
|
||||||
|
│ │ Phase 3: Mixdown → S3 upload │
|
||||||
|
│ │ Phase 4: Waveform generation │
|
||||||
|
│ │ Phase 5: Transcription (N GPU calls, serial!) │
|
||||||
|
│ │ Phase 6: Topic Detection (C LLM calls) │
|
||||||
|
│ │ Phase 7a: Title Generation (1 LLM call) │
|
||||||
|
│ │ Phase 7b: Summary Generation (2+2M LLM calls) │
|
||||||
|
│ │ Phase 8: Finalize status │
|
||||||
|
│ └─────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
chain(cleanup → zulip → webhook).delay() ← Celery chain (3 tasks)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Problem: Monolithic `pipeline.process()`
|
||||||
|
|
||||||
|
The heavy lifting happens inside a single Python function call. Celery only sees:
|
||||||
|
- Task started
|
||||||
|
- Task succeeded/failed
|
||||||
|
|
||||||
|
It cannot see or control the 8 internal phases.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Target Architecture
|
||||||
|
|
||||||
|
### Decomposed Conductor Workflow
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ get_recording │ ← Daily API
|
||||||
|
│ get_participants │
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│
|
||||||
|
┌──────────────────┼──────────────────┐
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||||
|
│ pad_tk_0 │ │ pad_tk_1 │ │ pad_tk_N │ ← FORK (parallel)
|
||||||
|
└────┬─────┘ └────┬─────┘ └────┬─────┘
|
||||||
|
└──────────────────┼──────────────────┘
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ mixdown_tracks │ ← PyAV → S3
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│
|
||||||
|
┌──────────┴──────────┐
|
||||||
|
▼ ▼
|
||||||
|
┌───────────────┐ ┌───────────────┐
|
||||||
|
│generate_wave │ │ (continue) │ ← waveform parallel with transcription setup
|
||||||
|
└───────────────┘ └───────────────┘
|
||||||
|
│
|
||||||
|
┌──────────────────┼──────────────────┐
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌────────────┐ ┌────────────┐ ┌────────────┐
|
||||||
|
│transcribe_0│ │transcribe_1│ │transcribe_N│ ← FORK (parallel GPU!)
|
||||||
|
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
|
||||||
|
└──────────────────┼──────────────────┘
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ merge_transcripts │
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│
|
||||||
|
┌──────────┴──────────┐
|
||||||
|
▼ ▼
|
||||||
|
┌───────────────┐ ┌───────────────┐
|
||||||
|
│detect_topics │ │ (or) │ ← topic detection
|
||||||
|
└───────┬───────┘ └───────────────┘
|
||||||
|
│
|
||||||
|
┌──────────────┴──────────────┐
|
||||||
|
▼ ▼
|
||||||
|
┌─────────────┐ ┌─────────────┐
|
||||||
|
│generate_title│ │gen_summary │ ← FORK (parallel LLM)
|
||||||
|
└──────┬──────┘ └──────┬──────┘
|
||||||
|
└──────────────┬─────────────┘
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ finalize │
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│
|
||||||
|
┌──────────────┼──────────────┐
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||||
|
│ consent │──▶│ zulip │──▶│ webhook │ ← sequential chain
|
||||||
|
└──────────┘ └──────────┘ └──────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Improvements
|
||||||
|
|
||||||
|
| Aspect | Current (Celery) | Target (Conductor) |
|
||||||
|
|--------|------------------|-------------------|
|
||||||
|
| Transcription parallelism | Serial (N × 30s) | Parallel (max 30s) |
|
||||||
|
| Failure granularity | Restart all | Retry failed step only |
|
||||||
|
| Progress visibility | None | Per-step status in UI |
|
||||||
|
| Timeout control | Entire pipeline | Per-step timeouts |
|
||||||
|
| Audit trail | Scattered logs | Unified workflow history |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scope of Work
|
||||||
|
|
||||||
|
### Module 1: Conductor Infrastructure Setup
|
||||||
|
|
||||||
|
**Files to Create/Modify:**
|
||||||
|
- `docker-compose.yml` - Add Conductor server container
|
||||||
|
- `server/reflector/conductor/` - New module for Conductor client
|
||||||
|
- Environment configuration for Conductor URL
|
||||||
|
|
||||||
|
**Tasks:**
|
||||||
|
- [ ] Add `conductoross/conductor-standalone:3.15.0` to docker-compose
|
||||||
|
- [ ] Create Conductor client wrapper (Python `conductor-python` SDK)
|
||||||
|
- [ ] Configure health checks and service dependencies
|
||||||
|
- [ ] Document Conductor UI access (port 8127)
|
||||||
|
|
||||||
|
### Module 2: Task Decomposition - Worker Definitions
|
||||||
|
|
||||||
|
**Files to Create:**
|
||||||
|
- `server/reflector/conductor/workers/` directory with:
|
||||||
|
- `get_recording.py` - Daily API recording fetch
|
||||||
|
- `get_participants.py` - Daily API participant fetch
|
||||||
|
- `pad_track.py` - Single track padding (PyAV)
|
||||||
|
- `mixdown_tracks.py` - Multi-track mixdown
|
||||||
|
- `generate_waveform.py` - Waveform generation
|
||||||
|
- `transcribe_track.py` - Single track GPU transcription
|
||||||
|
- `merge_transcripts.py` - Combine transcriptions
|
||||||
|
- `detect_topics.py` - LLM topic detection
|
||||||
|
- `generate_title.py` - LLM title generation
|
||||||
|
- `generate_summary.py` - LLM summary generation
|
||||||
|
- `finalize.py` - Status update and cleanup
|
||||||
|
- `cleanup_consent.py` - Consent check
|
||||||
|
- `post_zulip.py` - Zulip notification
|
||||||
|
- `send_webhook.py` - External webhook
|
||||||
|
- `generate_dynamic_fork_tasks.py` - Helper for FORK_JOIN_DYNAMIC task generation
|
||||||
|
|
||||||
|
**Reference Files (Current Implementation):**
|
||||||
|
- `server/reflector/pipelines/main_multitrack_pipeline.py`
|
||||||
|
- `server/reflector/worker/process.py`
|
||||||
|
- `server/reflector/worker/webhook.py`
|
||||||
|
|
||||||
|
**Key Considerations:**
|
||||||
|
- Each worker receives input from previous step via Conductor
|
||||||
|
- Workers must be idempotent (same input → same output)
|
||||||
|
- State serialization between steps (JSON-compatible types)
|
||||||
|
|
||||||
|
### Module 3: Workflow Definition
|
||||||
|
|
||||||
|
**Files to Create:**
|
||||||
|
- `server/reflector/conductor/workflows/diarization_pipeline.json`
|
||||||
|
- `server/reflector/conductor/workflows/register.py` - Registration script
|
||||||
|
|
||||||
|
**Workflow Structure:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "daily_diarization_pipeline",
|
||||||
|
"version": 1,
|
||||||
|
"tasks": [
|
||||||
|
{"name": "get_recording", "type": "SIMPLE"},
|
||||||
|
{"name": "get_participants", "type": "SIMPLE"},
|
||||||
|
{
|
||||||
|
"name": "fork_padding",
|
||||||
|
"type": "FORK_JOIN_DYNAMIC",
|
||||||
|
"dynamicForkTasksParam": "track_keys"
|
||||||
|
},
|
||||||
|
{"name": "mixdown_tracks", "type": "SIMPLE"},
|
||||||
|
{"name": "generate_waveform", "type": "SIMPLE"},
|
||||||
|
{
|
||||||
|
"name": "fork_transcription",
|
||||||
|
"type": "FORK_JOIN_DYNAMIC",
|
||||||
|
"dynamicForkTasksParam": "padded_urls"
|
||||||
|
},
|
||||||
|
{"name": "merge_transcripts", "type": "SIMPLE"},
|
||||||
|
{"name": "detect_topics", "type": "SIMPLE"},
|
||||||
|
{
|
||||||
|
"name": "fork_generation",
|
||||||
|
"type": "FORK_JOIN",
|
||||||
|
"forkTasks": [["generate_title"], ["generate_summary"]]
|
||||||
|
},
|
||||||
|
{"name": "finalize", "type": "SIMPLE"},
|
||||||
|
{"name": "cleanup_consent", "type": "SIMPLE"},
|
||||||
|
{"name": "post_zulip", "type": "SIMPLE"},
|
||||||
|
{"name": "send_webhook", "type": "SIMPLE"}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key Considerations:**
|
||||||
|
- Dynamic FORK for variable number of tracks (N)
|
||||||
|
- Timeout configuration per task type
|
||||||
|
- Retry policies with exponential backoff
|
||||||
|
|
||||||
|
### Module 4: Pipeline Trigger Migration
|
||||||
|
|
||||||
|
**Files to Modify:**
|
||||||
|
- `server/reflector/worker/process.py`
|
||||||
|
|
||||||
|
**Changes:**
|
||||||
|
- Replace `task_pipeline_multitrack_process.delay()` with Conductor workflow start
|
||||||
|
- Store workflow ID on Recording for status tracking
|
||||||
|
- Handle Conductor API errors
|
||||||
|
- Keep `process_multitrack_recording` as-is (creates DB entities before workflow)
|
||||||
|
|
||||||
|
**Note:** Both webhook AND polling entry points converge at `process_multitrack_recording`,
|
||||||
|
which then calls `task_pipeline_multitrack_process.delay()`. By modifying this single call site,
|
||||||
|
we capture both entry paths without duplicating integration logic.
|
||||||
|
|
||||||
|
### Module 5: Task Definition Registration
|
||||||
|
|
||||||
|
**Files to Create:**
|
||||||
|
- `server/reflector/conductor/tasks/definitions.py`
|
||||||
|
|
||||||
|
**Task Definitions with Timeouts:**
|
||||||
|
|
||||||
|
| Task | Timeout | Response Timeout | Retry Count |
|
||||||
|
|------|---------|------------------|-------------|
|
||||||
|
| get_recording | 60s | 30s | 3 |
|
||||||
|
| get_participants | 60s | 30s | 3 |
|
||||||
|
| pad_track | 300s | 120s | 3 |
|
||||||
|
| mixdown_tracks | 600s | 300s | 3 |
|
||||||
|
| generate_waveform | 120s | 60s | 3 |
|
||||||
|
| transcribe_track | 1800s | 900s | 3 |
|
||||||
|
| merge_transcripts | 60s | 30s | 3 |
|
||||||
|
| detect_topics | 300s | 120s | 3 |
|
||||||
|
| generate_title | 60s | 30s | 3 |
|
||||||
|
| generate_summary | 300s | 120s | 3 |
|
||||||
|
| finalize | 60s | 30s | 3 |
|
||||||
|
| cleanup_consent | 60s | 30s | 3 |
|
||||||
|
| post_zulip | 60s | 30s | 5 |
|
||||||
|
| send_webhook | 60s | 30s | 30 |
|
||||||
|
| generate_dynamic_fork_tasks | 30s | 15s | 3 |
|
||||||
|
|
||||||
|
### Module 6: Frontend Integration
|
||||||
|
|
||||||
|
**WebSocket Events (Already Defined):**
|
||||||
|
|
||||||
|
Events continue to be broadcast as today. No change to event structure.
|
||||||
|
|
||||||
|
| Event | Triggered By Task | Payload |
|
||||||
|
|-------|-------------------|---------|
|
||||||
|
| STATUS | finalize | `{value: "processing"\|"ended"\|"error"}` |
|
||||||
|
| DURATION | mixdown_tracks | `{duration: float}` |
|
||||||
|
| WAVEFORM | generate_waveform | `{waveform: float[]}` |
|
||||||
|
| TRANSCRIPT | merge_transcripts | `{text: string, translation: string\|null}` |
|
||||||
|
| TOPIC | detect_topics | `{id, title, summary, timestamp, duration}` |
|
||||||
|
| FINAL_TITLE | generate_title | `{title: string}` |
|
||||||
|
| FINAL_LONG_SUMMARY | generate_summary | `{long_summary: string}` |
|
||||||
|
| FINAL_SHORT_SUMMARY | generate_summary | `{short_summary: string}` |
|
||||||
|
|
||||||
|
**New: Progress Events**
|
||||||
|
|
||||||
|
Add new event type for granular progress:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# PipelineProgressEvent
|
||||||
|
{
|
||||||
|
"event": "PIPELINE_PROGRESS",
|
||||||
|
"data": {
|
||||||
|
"workflow_id": str,
|
||||||
|
"current_step": str,
|
||||||
|
"step_index": int,
|
||||||
|
"total_steps": int,
|
||||||
|
"step_status": "pending" | "in_progress" | "completed" | "failed"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Module 7: State Management & Checkpointing
|
||||||
|
|
||||||
|
**Current State Storage:**
|
||||||
|
- `transcript.status` - High-level status
|
||||||
|
- `transcript.events[]` - Append-only event log
|
||||||
|
- `transcript.topics[]` - Topic results
|
||||||
|
- `transcript.title`, `transcript.long_summary`, etc.
|
||||||
|
|
||||||
|
**Conductor State Storage:**
|
||||||
|
- Workflow execution state in Conductor database
|
||||||
|
- Per-task input/output in Conductor
|
||||||
|
|
||||||
|
**Checkpointing Strategy:**
|
||||||
|
1. Each task reads required state from DB (not previous task output for large data)
|
||||||
|
2. Each task writes results to DB before returning
|
||||||
|
3. Task output contains references (IDs, URLs) not large payloads
|
||||||
|
4. On retry, task can check DB for existing results (idempotency)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Data Flow Between Tasks
|
||||||
|
|
||||||
|
### Input/Output Contracts
|
||||||
|
|
||||||
|
```
|
||||||
|
get_recording
|
||||||
|
Input: { recording_id: string }
|
||||||
|
Output: { id, mtg_session_id, room_name, duration }
|
||||||
|
|
||||||
|
get_participants
|
||||||
|
Input: { mtg_session_id: string }
|
||||||
|
Output: { participants: [{participant_id, user_name}] }
|
||||||
|
|
||||||
|
pad_track
|
||||||
|
Input: { track_index: number, s3_key: string }
|
||||||
|
Output: { padded_url: string, size: number }
|
||||||
|
|
||||||
|
mixdown_tracks
|
||||||
|
Input: { padded_urls: string[] }
|
||||||
|
Output: { audio_key: string, duration: number }
|
||||||
|
|
||||||
|
generate_waveform
|
||||||
|
Input: { audio_key: string }
|
||||||
|
Output: { waveform: number[] }
|
||||||
|
|
||||||
|
transcribe_track
|
||||||
|
Input: { track_index: number, audio_url: string }
|
||||||
|
Output: { words: Word[] }
|
||||||
|
|
||||||
|
merge_transcripts
|
||||||
|
Input: { transcripts: Word[][] }
|
||||||
|
Output: { all_words: Word[], word_count: number }
|
||||||
|
|
||||||
|
detect_topics
|
||||||
|
Input: { words: Word[] }
|
||||||
|
Output: { topics: Topic[] }
|
||||||
|
|
||||||
|
generate_title
|
||||||
|
Input: { topics: Topic[] }
|
||||||
|
Output: { title: string }
|
||||||
|
|
||||||
|
generate_summary
|
||||||
|
Input: { words: Word[], topics: Topic[] }
|
||||||
|
Output: { summary: string, short_summary: string }
|
||||||
|
|
||||||
|
finalize
|
||||||
|
Input: { recording_id, title, summary, duration }
|
||||||
|
Output: { status: "COMPLETED" }
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## External API Calls Summary
|
||||||
|
|
||||||
|
### Per-Step External Dependencies
|
||||||
|
|
||||||
|
| Task | External Service | Calls | Notes |
|
||||||
|
|------|------------------|-------|-------|
|
||||||
|
| get_recording | Daily.co API | 1 | GET /recordings/{id} |
|
||||||
|
| get_participants | Daily.co API | 1 | GET /meetings/{id}/participants |
|
||||||
|
| pad_track | S3 | 2 | presign read + PUT padded |
|
||||||
|
| mixdown_tracks | S3 | 1 | PUT audio.mp3 |
|
||||||
|
| transcribe_track | Modal.com GPU | 1 | POST /transcriptions |
|
||||||
|
| detect_topics | LLM (OpenAI) | C | C = ceil(words/300) |
|
||||||
|
| generate_title | LLM (OpenAI) | 1 | - |
|
||||||
|
| generate_summary | LLM (OpenAI) | 2+2M | M = subjects (max 6) |
|
||||||
|
| post_zulip | Zulip API | 1 | POST or PATCH |
|
||||||
|
| send_webhook | External | 1 | Customer webhook URL |
|
||||||
|
|
||||||
|
### Cost Attribution Enabled
|
||||||
|
|
||||||
|
With decomposed tasks, costs can be attributed:
|
||||||
|
- **GPU costs**: Sum of `transcribe_track` durations
|
||||||
|
- **LLM costs**: Sum of `detect_topics` + `generate_title` + `generate_summary` token usage
|
||||||
|
- **S3 costs**: Bytes uploaded by `pad_track` + `mixdown_tracks`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Idempotency Requirements
|
||||||
|
|
||||||
|
### By Task
|
||||||
|
|
||||||
|
| Task | Idempotent? | Strategy |
|
||||||
|
|------|-------------|----------|
|
||||||
|
| get_recording | ✅ | Read-only API call |
|
||||||
|
| get_participants | ✅ | Read-only API call |
|
||||||
|
| pad_track | ⚠️ | Overwrite same S3 key |
|
||||||
|
| mixdown_tracks | ⚠️ | Overwrite same S3 key |
|
||||||
|
| generate_waveform | ✅ | Deterministic from audio |
|
||||||
|
| transcribe_track | ❌ | Cache by hash(audio_url) |
|
||||||
|
| detect_topics | ❌ | Cache by hash(words) |
|
||||||
|
| generate_title | ❌ | Cache by hash(topic_titles) |
|
||||||
|
| generate_summary | ❌ | Cache by hash(words+topics) |
|
||||||
|
| finalize | ✅ | Upsert status |
|
||||||
|
| cleanup_consent | ✅ | Idempotent deletes |
|
||||||
|
| post_zulip | ⚠️ | Use message_id for updates |
|
||||||
|
| send_webhook | ⚠️ | Receiver's responsibility |
|
||||||
|
|
||||||
|
### Caching Strategy for LLM/GPU Calls
|
||||||
|
|
||||||
|
```python
|
||||||
|
class TaskCache:
|
||||||
|
async def get(self, input_hash: str) -> Optional[Output]: ...
|
||||||
|
async def set(self, input_hash: str, output: Output) -> None: ...
|
||||||
|
|
||||||
|
# Before calling external service:
|
||||||
|
cached = await cache.get(hash(input))
|
||||||
|
if cached:
|
||||||
|
return cached
|
||||||
|
|
||||||
|
result = await external_service.call(input)
|
||||||
|
await cache.set(hash(input), result)
|
||||||
|
return result
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Migration Strategy
|
||||||
|
|
||||||
|
### Phase 1: Infrastructure (No Behavior Change)
|
||||||
|
- Add Conductor container to docker-compose
|
||||||
|
- Create Conductor client library
|
||||||
|
- Verify Conductor UI accessible
|
||||||
|
|
||||||
|
### Phase 2: Parallel Implementation
|
||||||
|
- Implement all worker tasks
|
||||||
|
- Register workflow definition
|
||||||
|
- Test with synthetic recordings
|
||||||
|
|
||||||
|
### Phase 3: Shadow Mode
|
||||||
|
- Trigger both Celery and Conductor pipelines
|
||||||
|
- Compare results for consistency
|
||||||
|
- Monitor Conductor execution in UI
|
||||||
|
|
||||||
|
### Phase 4: Cutover
|
||||||
|
- Disable Celery pipeline trigger
|
||||||
|
- Enable Conductor-only execution
|
||||||
|
- Monitor error rates and performance
|
||||||
|
|
||||||
|
### Phase 5: Cleanup
|
||||||
|
- Remove Celery task definitions
|
||||||
|
- Remove old pipeline code
|
||||||
|
- Update documentation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks & Mitigations
|
||||||
|
|
||||||
|
| Risk | Mitigation |
|
||||||
|
|------|------------|
|
||||||
|
| Conductor server downtime | Health checks, failover to Celery (Phase 3) |
|
||||||
|
| Worker serialization issues | Extensive testing with real data |
|
||||||
|
| Performance regression | Benchmark parallel vs serial transcription |
|
||||||
|
| Data loss on migration | Shadow mode comparison (Phase 3) |
|
||||||
|
| Learning curve for team | Documentation, Conductor UI training |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Metrics
|
||||||
|
|
||||||
|
| Metric | Current | Target |
|
||||||
|
|--------|---------|--------|
|
||||||
|
| Pipeline visibility | 3 states | 14+ steps visible |
|
||||||
|
| Transcription latency (N tracks) | N × 30s | ~30s (parallel) |
|
||||||
|
| Retry granularity | Entire pipeline | Single step |
|
||||||
|
| Cost attribution | None | Per-step breakdown |
|
||||||
|
| Debug time for failures | ~30 min | ~5 min (UI trace) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Appendix: Conductor Mock Implementation
|
||||||
|
|
||||||
|
A working Python mock demonstrating the target workflow structure is available at:
|
||||||
|
`docs/conductor-pipeline-mock/`
|
||||||
|
|
||||||
|
To run:
|
||||||
|
```bash
|
||||||
|
cd docs/conductor-pipeline-mock
|
||||||
|
docker compose up --build
|
||||||
|
./test_workflow.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
UI: http://localhost:8127
|
||||||
|
|
||||||
|
This mock validates:
|
||||||
|
- Workflow definition structure
|
||||||
|
- FORK_JOIN parallelism
|
||||||
|
- Worker task patterns
|
||||||
|
- Conductor SDK usage
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- Diarization Pipeline Diagram: `DIARIZATION_PIPELINE_DIAGRAM.md`
|
||||||
|
- Current Celery Implementation: `server/reflector/pipelines/main_multitrack_pipeline.py`
|
||||||
|
- Conductor OSS Documentation: https://conductor-oss.github.io/conductor/
|
||||||
|
- Conductor Python SDK: https://github.com/conductor-sdk/conductor-python
|
||||||
@@ -34,6 +34,20 @@ services:
|
|||||||
environment:
|
environment:
|
||||||
ENTRYPOINT: beat
|
ENTRYPOINT: beat
|
||||||
|
|
||||||
|
conductor-worker:
|
||||||
|
build:
|
||||||
|
context: server
|
||||||
|
volumes:
|
||||||
|
- ./server/:/app/
|
||||||
|
- /app/.venv
|
||||||
|
env_file:
|
||||||
|
- ./server/.env
|
||||||
|
environment:
|
||||||
|
ENTRYPOINT: conductor-worker
|
||||||
|
depends_on:
|
||||||
|
conductor:
|
||||||
|
condition: service_healthy
|
||||||
|
|
||||||
redis:
|
redis:
|
||||||
image: redis:7.2
|
image: redis:7.2
|
||||||
ports:
|
ports:
|
||||||
@@ -64,6 +78,19 @@ services:
|
|||||||
volumes:
|
volumes:
|
||||||
- ./data/postgres:/var/lib/postgresql/data
|
- ./data/postgres:/var/lib/postgresql/data
|
||||||
|
|
||||||
|
conductor:
|
||||||
|
image: conductoross/conductor-standalone:3.15.0
|
||||||
|
ports:
|
||||||
|
- 8180:8080
|
||||||
|
- 5001:5000
|
||||||
|
environment:
|
||||||
|
- conductor.db.type=memory
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 5
|
||||||
|
|
||||||
networks:
|
networks:
|
||||||
default:
|
default:
|
||||||
attachable: true
|
attachable: true
|
||||||
|
|||||||
345
server/CONDUCTOR_LLM_OBSERVATIONS.md
Normal file
345
server/CONDUCTOR_LLM_OBSERVATIONS.md
Normal file
@@ -0,0 +1,345 @@
|
|||||||
|
# Conductor OSS Migration - LLM Debugging Observations
|
||||||
|
|
||||||
|
This document captures hard-won debugging insights from migrating the multitrack diarization pipeline from Celery to Conductor OSS. These observations are particularly relevant for LLM assistants working on this codebase.
|
||||||
|
|
||||||
|
## Architecture Context
|
||||||
|
|
||||||
|
- **Conductor Python SDK** uses multiprocessing: 1 parent process spawns 15 `TaskRunner` subprocesses
|
||||||
|
- Each task type gets its own subprocess that polls Conductor server
|
||||||
|
- Workers are identified by container hostname (e.g., `595f5ddc9711`)
|
||||||
|
- Shadow mode (`CONDUCTOR_SHADOW_MODE=true`) runs both Celery and Conductor in parallel
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Challenge 1: Ghost Workers - Multiple Containers Polling Same Tasks
|
||||||
|
|
||||||
|
### Symptoms
|
||||||
|
- Tasks complete but with wrong/empty output
|
||||||
|
- Worker logs show no execution for a task that API shows as COMPLETED
|
||||||
|
- `workerId` in Conductor API doesn't match expected container
|
||||||
|
|
||||||
|
### Root Cause
|
||||||
|
Multiple containers may be running Conductor workers:
|
||||||
|
- `reflector-conductor-worker-1` (dedicated worker)
|
||||||
|
- `reflector-server-1` (if shadow mode enabled or worker code imported)
|
||||||
|
|
||||||
|
### Debugging Steps
|
||||||
|
```bash
|
||||||
|
# 1. Get the mystery worker ID from Conductor API
|
||||||
|
curl -s "http://localhost:8180/api/workflow/{id}" | jq '.tasks[] | {ref: .referenceTaskName, workerId}'
|
||||||
|
|
||||||
|
# 2. Find which container has that hostname
|
||||||
|
docker ps -a | grep {workerId}
|
||||||
|
# or
|
||||||
|
docker ps -a --format "{{.ID}} {{.Names}}" | grep {first-12-chars}
|
||||||
|
|
||||||
|
# 3. Check that container's code version
|
||||||
|
docker exec {container} cat /app/reflector/conductor/workers/{worker}.py | head -50
|
||||||
|
```
|
||||||
|
|
||||||
|
### Resolution
|
||||||
|
Restart ALL containers that might be polling Conductor tasks:
|
||||||
|
```bash
|
||||||
|
docker compose restart conductor-worker server
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Insight
|
||||||
|
**Always verify `workerId` matches your expected container.** In distributed worker setups, know ALL containers that poll for tasks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Challenge 2: Multiprocessing + AsyncIO + Database Conflicts
|
||||||
|
|
||||||
|
### Symptoms
|
||||||
|
```
|
||||||
|
InterfaceError: cannot perform operation: another operation is in progress
|
||||||
|
RuntimeError: Task <Task pending...> running at /app/.../worker.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Root Cause
|
||||||
|
Conductor Python SDK forks subprocesses. When subprocess calls `asyncio.run()`:
|
||||||
|
1. New event loop is created
|
||||||
|
2. But `get_database()` returns cached connection from parent process context
|
||||||
|
3. Parent's connection is incompatible with child's event loop
|
||||||
|
|
||||||
|
### Resolution
|
||||||
|
Reset context and create fresh connection in each subprocess:
|
||||||
|
```python
|
||||||
|
async def _process():
|
||||||
|
import databases
|
||||||
|
from reflector.db import _database_context
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
# Reset context var - don't inherit from parent
|
||||||
|
_database_context.set(None)
|
||||||
|
db = databases.Database(settings.DATABASE_URL)
|
||||||
|
_database_context.set(db)
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
# ... rest of async code
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Insight
|
||||||
|
**Any singleton/cached resource (DB connections, S3 clients, HTTP sessions) must be recreated AFTER fork.** Never trust inherited state in multiprocessing workers.
|
||||||
|
|
||||||
|
### TODO: The Real Problem with get_database()
|
||||||
|
|
||||||
|
**Current solution is a hack.** The issue runs deeper than multiprocessing fork:
|
||||||
|
|
||||||
|
#### What's Actually Happening
|
||||||
|
1. Each Conductor subprocess calls `asyncio.run(_process())` repeatedly for each task
|
||||||
|
2. First `asyncio.run()`: creates DB connection, stores in ContextVar
|
||||||
|
3. First task completes, `asyncio.run()` exits, **event loop destroyed**
|
||||||
|
4. **But**: ContextVar still holds the connection reference (ContextVars persist across `asyncio.run()` calls)
|
||||||
|
5. Second `asyncio.run()`: Creates a **new event loop**
|
||||||
|
6. Code tries to use the **old connection** (from ContextVar) with the **new event loop**
|
||||||
|
7. Error: "another operation is in progress"
|
||||||
|
|
||||||
|
**Root issue**: `get_database()` as a global singleton is incompatible with repeated `asyncio.run()` calls in the same process.
|
||||||
|
|
||||||
|
#### Option 1: Explicit Connection Lifecycle (cleanest)
|
||||||
|
```python
|
||||||
|
async def _process():
|
||||||
|
import databases
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
# Don't use get_database() - create explicit connection
|
||||||
|
db = databases.Database(settings.DATABASE_URL)
|
||||||
|
|
||||||
|
try:
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
# Problem: transcripts_controller.get_by_id() uses get_database() internally
|
||||||
|
# Would need to refactor controllers to accept db parameter
|
||||||
|
# e.g., await transcripts_controller.get_by_id(transcript_id, db=db)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
await db.disconnect()
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros**: Clean separation, explicit lifecycle
|
||||||
|
**Cons**: Requires refactoring all controller methods to accept `db` parameter
|
||||||
|
|
||||||
|
#### Option 2: Reset ContextVar Properly (pragmatic)
|
||||||
|
```python
|
||||||
|
async def _process():
|
||||||
|
from reflector.db import _database_context, get_database
|
||||||
|
|
||||||
|
# Ensure fresh connection per task
|
||||||
|
old_db = _database_context.get()
|
||||||
|
if old_db and old_db.is_connected:
|
||||||
|
await old_db.disconnect()
|
||||||
|
_database_context.set(None)
|
||||||
|
|
||||||
|
# Now get_database() will create fresh connection
|
||||||
|
db = get_database()
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# ... work ...
|
||||||
|
finally:
|
||||||
|
await db.disconnect()
|
||||||
|
_database_context.set(None)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros**: Works with existing controller code
|
||||||
|
**Cons**: Still manipulating globals, cleanup needed in every worker
|
||||||
|
|
||||||
|
#### Option 3: Fix get_database() Itself (best long-term)
|
||||||
|
```python
|
||||||
|
# In reflector/db/__init__.py
|
||||||
|
def get_database() -> databases.Database:
|
||||||
|
"""Get database instance for current event loop"""
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
db = _database_context.get()
|
||||||
|
|
||||||
|
# Check if connection is valid for current event loop
|
||||||
|
if db is not None:
|
||||||
|
try:
|
||||||
|
loop = asyncio.get_running_loop()
|
||||||
|
# If connection's event loop differs, it's stale
|
||||||
|
if db._connection and hasattr(db._connection, '_loop'):
|
||||||
|
if db._connection._loop != loop:
|
||||||
|
# Stale connection from old event loop
|
||||||
|
db = None
|
||||||
|
except RuntimeError:
|
||||||
|
# No running loop
|
||||||
|
pass
|
||||||
|
|
||||||
|
if db is None:
|
||||||
|
db = databases.Database(settings.DATABASE_URL)
|
||||||
|
_database_context.set(db)
|
||||||
|
|
||||||
|
return db
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros**: Fixes root cause, no changes needed in workers
|
||||||
|
**Cons**: Relies on implementation details of `databases` library
|
||||||
|
|
||||||
|
#### Recommendation
|
||||||
|
- **Short-term**: Option 2 (explicit cleanup in workers that need DB)
|
||||||
|
- **Long-term**: Option 1 (refactor to dependency injection) is the only architecturally clean solution
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Challenge 3: Type Mismatches Across Serialization Boundary
|
||||||
|
|
||||||
|
### Symptoms
|
||||||
|
```
|
||||||
|
ValidationError: 1 validation error for TranscriptTopic
|
||||||
|
transcript
|
||||||
|
Input should be a valid string [type=string_type, input_value={'translation': None, 'words': [...]}]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Root Cause
|
||||||
|
Conductor JSON-serializes all task inputs/outputs. Complex Pydantic models get serialized to dicts:
|
||||||
|
- `TitleSummary.transcript: Transcript` becomes `{"translation": null, "words": [...]}`
|
||||||
|
- Next task expects `TranscriptTopic.transcript: str`
|
||||||
|
|
||||||
|
### Resolution
|
||||||
|
Explicitly reconstruct types when deserializing:
|
||||||
|
```python
|
||||||
|
from reflector.processors.types import TitleSummary, Transcript as TranscriptType, Word
|
||||||
|
|
||||||
|
def normalize_topic(t):
|
||||||
|
topic = dict(t)
|
||||||
|
transcript_data = topic.get("transcript")
|
||||||
|
if isinstance(transcript_data, dict):
|
||||||
|
words_list = transcript_data.get("words", [])
|
||||||
|
word_objects = [Word(**w) for w in words_list]
|
||||||
|
topic["transcript"] = TranscriptType(
|
||||||
|
words=word_objects,
|
||||||
|
translation=transcript_data.get("translation")
|
||||||
|
)
|
||||||
|
return topic
|
||||||
|
|
||||||
|
topic_objects = [TitleSummary(**normalize_topic(t)) for t in topics]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Insight
|
||||||
|
**Conductor task I/O is always JSON.** Design workers to handle dict inputs and reconstruct domain objects explicitly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Challenge 4: Conductor Health Check Failures
|
||||||
|
|
||||||
|
### Symptoms
|
||||||
|
```
|
||||||
|
dependency failed to start: container reflector-conductor-1 is unhealthy
|
||||||
|
```
|
||||||
|
|
||||||
|
### Root Cause
|
||||||
|
Conductor OSS standalone container health endpoint can be slow/flaky, especially during startup or under load.
|
||||||
|
|
||||||
|
### Resolution
|
||||||
|
Bypass docker-compose health check dependency:
|
||||||
|
```bash
|
||||||
|
# Instead of: docker compose up -d conductor-worker
|
||||||
|
docker start reflector-conductor-worker-1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Insight
|
||||||
|
For development, consider removing `depends_on.condition: service_healthy` or increasing health check timeout.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Challenge 5: JOIN Task Output Format
|
||||||
|
|
||||||
|
### Symptoms
|
||||||
|
`merge_transcripts` receives data but outputs `word_count: 0`
|
||||||
|
|
||||||
|
### Root Cause
|
||||||
|
FORK_JOIN_DYNAMIC's JOIN task outputs a **dict keyed by task reference names**, not an array:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"transcribe_track_0": {"words": [...], "track_index": 0},
|
||||||
|
"transcribe_track_1": {"words": [...], "track_index": 1}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Resolution
|
||||||
|
Handle both dict and array inputs:
|
||||||
|
```python
|
||||||
|
transcripts = task.input_data.get("transcripts", [])
|
||||||
|
|
||||||
|
# Handle JOIN output (dict with task refs as keys)
|
||||||
|
if isinstance(transcripts, dict):
|
||||||
|
transcripts = list(transcripts.values())
|
||||||
|
|
||||||
|
for t in transcripts:
|
||||||
|
if isinstance(t, dict) and "words" in t:
|
||||||
|
all_words.extend(t["words"])
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key Insight
|
||||||
|
**JOIN task output structure differs from FORK input.** Always log input types during debugging.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Debugging Workflow
|
||||||
|
|
||||||
|
### 1. Add DEBUG Prints with Flush
|
||||||
|
Multiprocessing buffers stdout. Force immediate output:
|
||||||
|
```python
|
||||||
|
import sys
|
||||||
|
print("[DEBUG] worker entered", flush=True)
|
||||||
|
sys.stdout.flush()
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Test Worker Functions Directly
|
||||||
|
Bypass Conductor entirely to verify logic:
|
||||||
|
```bash
|
||||||
|
docker compose exec conductor-worker uv run python -c "
|
||||||
|
from reflector.conductor.workers.merge_transcripts import merge_transcripts
|
||||||
|
from conductor.client.http.models import Task
|
||||||
|
|
||||||
|
mock_task = Task()
|
||||||
|
mock_task.input_data = {'transcripts': {...}, 'transcript_id': 'test'}
|
||||||
|
result = merge_transcripts(mock_task)
|
||||||
|
print(result.output_data)
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Check Task Timing
|
||||||
|
Suspiciously fast completion (e.g., 10ms) indicates:
|
||||||
|
- Cached result from previous run
|
||||||
|
- Wrong worker processed it
|
||||||
|
- Task completed without actual execution
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s "http://localhost:8180/api/workflow/{id}" | \
|
||||||
|
jq '.tasks[] | {ref: .referenceTaskName, duration: (.endTime - .startTime)}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Verify Container Code Version
|
||||||
|
```bash
|
||||||
|
docker compose exec conductor-worker cat /app/reflector/conductor/workers/{file}.py | head -50
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Use Conductor Retry API
|
||||||
|
Retry from specific failed task without re-running entire workflow:
|
||||||
|
```bash
|
||||||
|
curl -X POST "http://localhost:8180/api/workflow/{id}/retry"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Gotchas Summary
|
||||||
|
|
||||||
|
| Issue | Signal | Fix |
|
||||||
|
|-------|--------|-----|
|
||||||
|
| Wrong worker | `workerId` mismatch | Restart all worker containers |
|
||||||
|
| DB conflict | "another operation in progress" | Fresh DB connection per subprocess |
|
||||||
|
| Type mismatch | Pydantic validation error | Reconstruct objects from dicts |
|
||||||
|
| No logs | Task completes but no output | Check if different container processed |
|
||||||
|
| 0 results | JOIN output format | Convert dict.values() to list |
|
||||||
|
| Health check | Compose dependency fails | Use `docker start` directly |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files Most Likely to Need Conductor-Specific Handling
|
||||||
|
|
||||||
|
- `server/reflector/conductor/workers/*.py` - All workers need multiprocessing-safe patterns
|
||||||
|
- `server/reflector/db/__init__.py` - Database singleton, needs context reset
|
||||||
|
- `server/reflector/conductor/workflows/*.json` - Workflow definitions, check input/output mappings
|
||||||
@@ -0,0 +1,32 @@
|
|||||||
|
"""add workflow_id to recording
|
||||||
|
|
||||||
|
Revision ID: a326252ac554
|
||||||
|
Revises: bbafedfa510c
|
||||||
|
Create Date: 2025-12-14 11:34:22.137910
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
from typing import Sequence, Union
|
||||||
|
|
||||||
|
import sqlalchemy as sa
|
||||||
|
from alembic import op
|
||||||
|
|
||||||
|
# revision identifiers, used by Alembic.
|
||||||
|
revision: str = "a326252ac554"
|
||||||
|
down_revision: Union[str, None] = "bbafedfa510c"
|
||||||
|
branch_labels: Union[str, Sequence[str], None] = None
|
||||||
|
depends_on: Union[str, Sequence[str], None] = None
|
||||||
|
|
||||||
|
|
||||||
|
def upgrade() -> None:
|
||||||
|
with op.batch_alter_table("recording", schema=None) as batch_op:
|
||||||
|
batch_op.add_column(sa.Column("workflow_id", sa.String(), nullable=True))
|
||||||
|
batch_op.create_index(
|
||||||
|
"idx_recording_workflow_id", ["workflow_id"], unique=False
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def downgrade() -> None:
|
||||||
|
with op.batch_alter_table("recording", schema=None) as batch_op:
|
||||||
|
batch_op.drop_index("idx_recording_workflow_id")
|
||||||
|
batch_op.drop_column("workflow_id")
|
||||||
@@ -39,6 +39,7 @@ dependencies = [
|
|||||||
"pytest-env>=1.1.5",
|
"pytest-env>=1.1.5",
|
||||||
"webvtt-py>=0.5.0",
|
"webvtt-py>=0.5.0",
|
||||||
"icalendar>=6.0.0",
|
"icalendar>=6.0.0",
|
||||||
|
"conductor-python>=1.2.3",
|
||||||
]
|
]
|
||||||
|
|
||||||
[dependency-groups]
|
[dependency-groups]
|
||||||
|
|||||||
@@ -12,6 +12,7 @@ from reflector.events import subscribers_shutdown, subscribers_startup
|
|||||||
from reflector.logger import logger
|
from reflector.logger import logger
|
||||||
from reflector.metrics import metrics_init
|
from reflector.metrics import metrics_init
|
||||||
from reflector.settings import settings
|
from reflector.settings import settings
|
||||||
|
from reflector.views.conductor import router as conductor_router
|
||||||
from reflector.views.daily import router as daily_router
|
from reflector.views.daily import router as daily_router
|
||||||
from reflector.views.meetings import router as meetings_router
|
from reflector.views.meetings import router as meetings_router
|
||||||
from reflector.views.rooms import router as rooms_router
|
from reflector.views.rooms import router as rooms_router
|
||||||
@@ -98,6 +99,7 @@ app.include_router(user_ws_router, prefix="/v1")
|
|||||||
app.include_router(zulip_router, prefix="/v1")
|
app.include_router(zulip_router, prefix="/v1")
|
||||||
app.include_router(whereby_router, prefix="/v1")
|
app.include_router(whereby_router, prefix="/v1")
|
||||||
app.include_router(daily_router, prefix="/v1/daily")
|
app.include_router(daily_router, prefix="/v1/daily")
|
||||||
|
app.include_router(conductor_router, prefix="/v1")
|
||||||
add_pagination(app)
|
add_pagination(app)
|
||||||
|
|
||||||
# prepare celery
|
# prepare celery
|
||||||
|
|||||||
5
server/reflector/conductor/__init__.py
Normal file
5
server/reflector/conductor/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
"""Conductor workflow orchestration module."""
|
||||||
|
|
||||||
|
from reflector.conductor.client import ConductorClientManager
|
||||||
|
|
||||||
|
__all__ = ["ConductorClientManager"]
|
||||||
40
server/reflector/conductor/client.py
Normal file
40
server/reflector/conductor/client.py
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
"""Conductor Python client wrapper."""
|
||||||
|
|
||||||
|
from conductor.client.configuration.configuration import Configuration
|
||||||
|
from conductor.client.orkes_clients import OrkesClients
|
||||||
|
from conductor.client.workflow_client import WorkflowClient
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
|
||||||
|
class ConductorClientManager:
|
||||||
|
"""Singleton manager for Conductor client connections."""
|
||||||
|
|
||||||
|
_instance: OrkesClients | None = None
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_client(cls) -> WorkflowClient:
|
||||||
|
"""Get or create the workflow client."""
|
||||||
|
if cls._instance is None:
|
||||||
|
config = Configuration(
|
||||||
|
server_api_url=settings.CONDUCTOR_SERVER_URL,
|
||||||
|
debug=settings.CONDUCTOR_DEBUG,
|
||||||
|
)
|
||||||
|
cls._instance = OrkesClients(config)
|
||||||
|
return cls._instance.get_workflow_client()
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def start_workflow(cls, name: str, version: int, input_data: dict) -> str:
|
||||||
|
"""Start a workflow and return the workflow ID."""
|
||||||
|
client = cls.get_client()
|
||||||
|
return client.start_workflow_by_name(name, input_data, version=version)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_workflow_status(cls, workflow_id: str) -> dict:
|
||||||
|
"""Get the current status of a workflow."""
|
||||||
|
client = cls.get_client()
|
||||||
|
return client.get_workflow(workflow_id, include_tasks=True)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def reset(cls) -> None:
|
||||||
|
"""Reset the client instance (for testing)."""
|
||||||
|
cls._instance = None
|
||||||
103
server/reflector/conductor/progress.py
Normal file
103
server/reflector/conductor/progress.py
Normal file
@@ -0,0 +1,103 @@
|
|||||||
|
"""Progress event emission for Conductor workers."""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
from typing import Literal
|
||||||
|
|
||||||
|
from reflector.db.transcripts import PipelineProgressData
|
||||||
|
from reflector.logger import logger
|
||||||
|
from reflector.ws_manager import get_ws_manager
|
||||||
|
|
||||||
|
# Step mapping for progress tracking
|
||||||
|
# Maps task names to their index in the pipeline
|
||||||
|
PIPELINE_STEPS = {
|
||||||
|
"get_recording": 1,
|
||||||
|
"get_participants": 2,
|
||||||
|
"pad_track": 3, # Fork tasks share same step
|
||||||
|
"mixdown_tracks": 4,
|
||||||
|
"generate_waveform": 5,
|
||||||
|
"transcribe_track": 6, # Fork tasks share same step
|
||||||
|
"merge_transcripts": 7,
|
||||||
|
"detect_topics": 8,
|
||||||
|
"generate_title": 9, # Fork tasks share same step
|
||||||
|
"generate_summary": 9, # Fork tasks share same step
|
||||||
|
"finalize": 10,
|
||||||
|
"cleanup_consent": 11,
|
||||||
|
"post_zulip": 12,
|
||||||
|
"send_webhook": 13,
|
||||||
|
}
|
||||||
|
|
||||||
|
TOTAL_STEPS = 13
|
||||||
|
|
||||||
|
|
||||||
|
async def _emit_progress_async(
|
||||||
|
transcript_id: str,
|
||||||
|
step: str,
|
||||||
|
status: Literal["pending", "in_progress", "completed", "failed"],
|
||||||
|
workflow_id: str | None = None,
|
||||||
|
) -> None:
|
||||||
|
"""Async implementation of progress emission."""
|
||||||
|
ws_manager = get_ws_manager()
|
||||||
|
step_index = PIPELINE_STEPS.get(step, 0)
|
||||||
|
|
||||||
|
data = PipelineProgressData(
|
||||||
|
workflow_id=workflow_id,
|
||||||
|
current_step=step,
|
||||||
|
step_index=step_index,
|
||||||
|
total_steps=TOTAL_STEPS,
|
||||||
|
step_status=status,
|
||||||
|
)
|
||||||
|
|
||||||
|
await ws_manager.send_json(
|
||||||
|
room_id=f"ts:{transcript_id}",
|
||||||
|
message={
|
||||||
|
"event": "PIPELINE_PROGRESS",
|
||||||
|
"data": data.model_dump(),
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.debug(
|
||||||
|
"[Progress] Emitted",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
step=step,
|
||||||
|
status=status,
|
||||||
|
step_index=step_index,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def emit_progress(
|
||||||
|
transcript_id: str,
|
||||||
|
step: str,
|
||||||
|
status: Literal["pending", "in_progress", "completed", "failed"],
|
||||||
|
workflow_id: str | None = None,
|
||||||
|
) -> None:
|
||||||
|
"""Emit a pipeline progress event (sync wrapper for Conductor workers).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
transcript_id: The transcript ID to emit progress for
|
||||||
|
step: The current step name (e.g., "transcribe_track")
|
||||||
|
status: The step status
|
||||||
|
workflow_id: Optional workflow ID
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Get or create event loop for sync context
|
||||||
|
try:
|
||||||
|
loop = asyncio.get_running_loop()
|
||||||
|
except RuntimeError:
|
||||||
|
loop = None
|
||||||
|
|
||||||
|
if loop is not None and loop.is_running():
|
||||||
|
# Already in async context, schedule the coroutine
|
||||||
|
asyncio.create_task(
|
||||||
|
_emit_progress_async(transcript_id, step, status, workflow_id)
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
# Not in async context, run synchronously
|
||||||
|
asyncio.run(_emit_progress_async(transcript_id, step, status, workflow_id))
|
||||||
|
except Exception as e:
|
||||||
|
# Progress emission should never break the pipeline
|
||||||
|
logger.warning(
|
||||||
|
"[Progress] Failed to emit progress event",
|
||||||
|
error=str(e),
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
step=step,
|
||||||
|
)
|
||||||
58
server/reflector/conductor/run_workers.py
Normal file
58
server/reflector/conductor/run_workers.py
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
"""
|
||||||
|
Run Conductor workers for the diarization pipeline.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
uv run -m reflector.conductor.run_workers
|
||||||
|
|
||||||
|
# Or via docker:
|
||||||
|
docker compose exec server uv run -m reflector.conductor.run_workers
|
||||||
|
"""
|
||||||
|
|
||||||
|
import signal
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
|
||||||
|
from conductor.client.automator.task_handler import TaskHandler
|
||||||
|
from conductor.client.configuration.configuration import Configuration
|
||||||
|
from reflector.conductor import workers # noqa: F401 - registers workers via decorators
|
||||||
|
from reflector.logger import logger
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
"""Start Conductor worker polling."""
|
||||||
|
if not settings.CONDUCTOR_ENABLED:
|
||||||
|
logger.error("CONDUCTOR_ENABLED is False, not starting workers")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"Starting Conductor workers",
|
||||||
|
server_url=settings.CONDUCTOR_SERVER_URL,
|
||||||
|
)
|
||||||
|
|
||||||
|
config = Configuration(
|
||||||
|
server_api_url=settings.CONDUCTOR_SERVER_URL,
|
||||||
|
debug=settings.CONDUCTOR_DEBUG,
|
||||||
|
)
|
||||||
|
|
||||||
|
task_handler = TaskHandler(configuration=config)
|
||||||
|
|
||||||
|
# Handle graceful shutdown
|
||||||
|
def shutdown_handler(signum: int, frame) -> None:
|
||||||
|
logger.info("Received shutdown signal, stopping workers...")
|
||||||
|
task_handler.stop_processes()
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
|
signal.signal(signal.SIGINT, shutdown_handler)
|
||||||
|
signal.signal(signal.SIGTERM, shutdown_handler)
|
||||||
|
|
||||||
|
logger.info("Starting task polling...")
|
||||||
|
task_handler.start_processes()
|
||||||
|
|
||||||
|
# Keep alive
|
||||||
|
while True:
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
207
server/reflector/conductor/shadow_compare.py
Normal file
207
server/reflector/conductor/shadow_compare.py
Normal file
@@ -0,0 +1,207 @@
|
|||||||
|
"""Shadow mode comparison for Celery vs Conductor pipeline results."""
|
||||||
|
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from reflector.conductor.client import ConductorClientManager
|
||||||
|
from reflector.db.transcripts import Transcript, transcripts_controller
|
||||||
|
from reflector.logger import logger
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class FieldDifference:
|
||||||
|
"""A difference between Celery and Conductor field values."""
|
||||||
|
|
||||||
|
field: str
|
||||||
|
celery_value: Any
|
||||||
|
conductor_value: Any
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ComparisonResult:
|
||||||
|
"""Result of comparing Celery and Conductor outputs."""
|
||||||
|
|
||||||
|
match: bool
|
||||||
|
differences: list[FieldDifference]
|
||||||
|
celery_status: str
|
||||||
|
conductor_status: str
|
||||||
|
error: str | None = None
|
||||||
|
|
||||||
|
|
||||||
|
async def compare_content_results(
|
||||||
|
recording_id: str, workflow_id: str
|
||||||
|
) -> ComparisonResult:
|
||||||
|
"""
|
||||||
|
Compare content results from Celery and Conductor pipelines.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
recording_id: Recording ID to look up Celery transcript
|
||||||
|
workflow_id: Conductor workflow ID to get workflow output
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
ComparisonResult with match status and any differences
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Get Celery result from DB
|
||||||
|
celery_transcript = await transcripts_controller.get_by_recording_id(
|
||||||
|
recording_id
|
||||||
|
)
|
||||||
|
if not celery_transcript:
|
||||||
|
return ComparisonResult(
|
||||||
|
match=False,
|
||||||
|
differences=[],
|
||||||
|
celery_status="not_found",
|
||||||
|
conductor_status="unknown",
|
||||||
|
error=f"No transcript found for recording_id={recording_id}",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Get Conductor workflow status
|
||||||
|
workflow_status = ConductorClientManager.get_workflow_status(workflow_id)
|
||||||
|
conductor_status = workflow_status.status if workflow_status else "unknown"
|
||||||
|
|
||||||
|
# If workflow not completed, can't compare
|
||||||
|
if conductor_status != "COMPLETED":
|
||||||
|
return ComparisonResult(
|
||||||
|
match=False,
|
||||||
|
differences=[],
|
||||||
|
celery_status=celery_transcript.status,
|
||||||
|
conductor_status=conductor_status,
|
||||||
|
error=f"Conductor workflow not completed: {conductor_status}",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Extract output from workflow
|
||||||
|
workflow_output = (
|
||||||
|
workflow_status.output if hasattr(workflow_status, "output") else {}
|
||||||
|
)
|
||||||
|
|
||||||
|
differences = _compare_fields(celery_transcript, workflow_output)
|
||||||
|
|
||||||
|
result = ComparisonResult(
|
||||||
|
match=len(differences) == 0,
|
||||||
|
differences=differences,
|
||||||
|
celery_status=celery_transcript.status,
|
||||||
|
conductor_status=conductor_status,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Log comparison result
|
||||||
|
if result.match:
|
||||||
|
logger.info(
|
||||||
|
"Shadow mode comparison: MATCH",
|
||||||
|
recording_id=recording_id,
|
||||||
|
workflow_id=workflow_id,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logger.warning(
|
||||||
|
"Shadow mode comparison: MISMATCH",
|
||||||
|
recording_id=recording_id,
|
||||||
|
workflow_id=workflow_id,
|
||||||
|
differences=[
|
||||||
|
{
|
||||||
|
"field": d.field,
|
||||||
|
"celery": d.celery_value,
|
||||||
|
"conductor": d.conductor_value,
|
||||||
|
}
|
||||||
|
for d in differences
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(
|
||||||
|
"Shadow mode comparison failed",
|
||||||
|
recording_id=recording_id,
|
||||||
|
workflow_id=workflow_id,
|
||||||
|
error=str(e),
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
return ComparisonResult(
|
||||||
|
match=False,
|
||||||
|
differences=[],
|
||||||
|
celery_status="unknown",
|
||||||
|
conductor_status="unknown",
|
||||||
|
error=str(e),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _compare_fields(
|
||||||
|
celery_transcript: Transcript, workflow_output: dict
|
||||||
|
) -> list[FieldDifference]:
|
||||||
|
"""Compare specific content fields between Celery and Conductor."""
|
||||||
|
differences = []
|
||||||
|
|
||||||
|
# Compare title
|
||||||
|
conductor_title = workflow_output.get("title")
|
||||||
|
if celery_transcript.title != conductor_title:
|
||||||
|
differences.append(
|
||||||
|
FieldDifference(
|
||||||
|
field="title",
|
||||||
|
celery_value=celery_transcript.title,
|
||||||
|
conductor_value=conductor_title,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Compare short_summary
|
||||||
|
conductor_short_summary = workflow_output.get("short_summary")
|
||||||
|
if celery_transcript.short_summary != conductor_short_summary:
|
||||||
|
differences.append(
|
||||||
|
FieldDifference(
|
||||||
|
field="short_summary",
|
||||||
|
celery_value=celery_transcript.short_summary,
|
||||||
|
conductor_value=conductor_short_summary,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Compare long_summary
|
||||||
|
conductor_long_summary = workflow_output.get("summary")
|
||||||
|
if celery_transcript.long_summary != conductor_long_summary:
|
||||||
|
differences.append(
|
||||||
|
FieldDifference(
|
||||||
|
field="long_summary",
|
||||||
|
celery_value=celery_transcript.long_summary,
|
||||||
|
conductor_value=conductor_long_summary,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Compare topic count
|
||||||
|
celery_topics = celery_transcript.topics or []
|
||||||
|
conductor_topics = workflow_output.get("topics", [])
|
||||||
|
if len(celery_topics) != len(conductor_topics):
|
||||||
|
differences.append(
|
||||||
|
FieldDifference(
|
||||||
|
field="topic_count",
|
||||||
|
celery_value=len(celery_topics),
|
||||||
|
conductor_value=len(conductor_topics),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Compare word count from events
|
||||||
|
celery_events = celery_transcript.events or {}
|
||||||
|
celery_words = (
|
||||||
|
celery_events.get("words", []) if isinstance(celery_events, dict) else []
|
||||||
|
)
|
||||||
|
conductor_words = workflow_output.get("all_words", [])
|
||||||
|
if len(celery_words) != len(conductor_words):
|
||||||
|
differences.append(
|
||||||
|
FieldDifference(
|
||||||
|
field="word_count",
|
||||||
|
celery_value=len(celery_words),
|
||||||
|
conductor_value=len(conductor_words),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Compare duration
|
||||||
|
conductor_duration = workflow_output.get("duration")
|
||||||
|
if (
|
||||||
|
conductor_duration is not None
|
||||||
|
and celery_transcript.duration != conductor_duration
|
||||||
|
):
|
||||||
|
differences.append(
|
||||||
|
FieldDifference(
|
||||||
|
field="duration",
|
||||||
|
celery_value=celery_transcript.duration,
|
||||||
|
conductor_value=conductor_duration,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
return differences
|
||||||
6
server/reflector/conductor/tasks/__init__.py
Normal file
6
server/reflector/conductor/tasks/__init__.py
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
"""Conductor task definitions module."""
|
||||||
|
|
||||||
|
from reflector.conductor.tasks.definitions import TASK_DEFINITIONS
|
||||||
|
from reflector.conductor.tasks.register import register_task_definitions
|
||||||
|
|
||||||
|
__all__ = ["TASK_DEFINITIONS", "register_task_definitions"]
|
||||||
161
server/reflector/conductor/tasks/definitions.py
Normal file
161
server/reflector/conductor/tasks/definitions.py
Normal file
@@ -0,0 +1,161 @@
|
|||||||
|
"""Task definitions for Conductor workflow orchestration.
|
||||||
|
|
||||||
|
Timeout reference (from CONDUCTOR_MIGRATION_REQUIREMENTS.md):
|
||||||
|
| Task | Timeout (s) | Response Timeout (s) | Retry Count |
|
||||||
|
|-------------------|-------------|----------------------|-------------|
|
||||||
|
| get_recording | 60 | 30 | 3 |
|
||||||
|
| get_participants | 60 | 30 | 3 |
|
||||||
|
| pad_track | 300 | 120 | 3 |
|
||||||
|
| mixdown_tracks | 600 | 300 | 3 |
|
||||||
|
| generate_waveform | 120 | 60 | 3 |
|
||||||
|
| transcribe_track | 1800 | 900 | 3 |
|
||||||
|
| merge_transcripts | 60 | 30 | 3 |
|
||||||
|
| detect_topics | 300 | 120 | 3 |
|
||||||
|
| generate_title | 60 | 30 | 3 |
|
||||||
|
| generate_summary | 300 | 120 | 3 |
|
||||||
|
| finalize | 60 | 30 | 3 |
|
||||||
|
| cleanup_consent | 60 | 30 | 3 |
|
||||||
|
| post_zulip | 60 | 30 | 5 |
|
||||||
|
| send_webhook | 60 | 30 | 30 |
|
||||||
|
"""
|
||||||
|
|
||||||
|
OWNER_EMAIL = "reflector@example.com"
|
||||||
|
|
||||||
|
TASK_DEFINITIONS = [
|
||||||
|
{
|
||||||
|
"name": "get_recording",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 60,
|
||||||
|
"responseTimeoutSeconds": 30,
|
||||||
|
"inputKeys": ["recording_id"],
|
||||||
|
"outputKeys": ["id", "mtg_session_id", "room_name", "duration"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "get_participants",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 60,
|
||||||
|
"responseTimeoutSeconds": 30,
|
||||||
|
"inputKeys": ["mtg_session_id"],
|
||||||
|
"outputKeys": ["participants"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "pad_track",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 300,
|
||||||
|
"responseTimeoutSeconds": 120,
|
||||||
|
"inputKeys": ["track_index", "s3_key", "bucket_name", "transcript_id"],
|
||||||
|
"outputKeys": ["padded_url", "size", "track_index"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "mixdown_tracks",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 600,
|
||||||
|
"responseTimeoutSeconds": 300,
|
||||||
|
"inputKeys": ["padded_urls", "transcript_id"],
|
||||||
|
"outputKeys": ["audio_key", "duration", "size"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "generate_waveform",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 120,
|
||||||
|
"responseTimeoutSeconds": 60,
|
||||||
|
"inputKeys": ["audio_key", "transcript_id"],
|
||||||
|
"outputKeys": ["waveform"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "transcribe_track",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 1800,
|
||||||
|
"responseTimeoutSeconds": 900,
|
||||||
|
"inputKeys": ["track_index", "audio_url", "language"],
|
||||||
|
"outputKeys": ["words", "track_index"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "merge_transcripts",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 60,
|
||||||
|
"responseTimeoutSeconds": 30,
|
||||||
|
"inputKeys": ["transcripts", "transcript_id"],
|
||||||
|
"outputKeys": ["all_words", "word_count"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "detect_topics",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 300,
|
||||||
|
"responseTimeoutSeconds": 120,
|
||||||
|
"inputKeys": ["words", "transcript_id", "target_language"],
|
||||||
|
"outputKeys": ["topics"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "generate_title",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 60,
|
||||||
|
"responseTimeoutSeconds": 30,
|
||||||
|
"inputKeys": ["topics", "transcript_id"],
|
||||||
|
"outputKeys": ["title"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "generate_summary",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 300,
|
||||||
|
"responseTimeoutSeconds": 120,
|
||||||
|
"inputKeys": ["words", "topics", "transcript_id"],
|
||||||
|
"outputKeys": ["summary", "short_summary"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "finalize",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 60,
|
||||||
|
"responseTimeoutSeconds": 30,
|
||||||
|
"inputKeys": ["transcript_id", "title", "summary", "short_summary", "duration"],
|
||||||
|
"outputKeys": ["status"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "cleanup_consent",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 60,
|
||||||
|
"responseTimeoutSeconds": 30,
|
||||||
|
"inputKeys": ["transcript_id"],
|
||||||
|
"outputKeys": ["audio_deleted", "reason"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "post_zulip",
|
||||||
|
"retryCount": 5,
|
||||||
|
"timeoutSeconds": 60,
|
||||||
|
"responseTimeoutSeconds": 30,
|
||||||
|
"inputKeys": ["transcript_id"],
|
||||||
|
"outputKeys": ["message_id"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "send_webhook",
|
||||||
|
"retryCount": 30,
|
||||||
|
"timeoutSeconds": 60,
|
||||||
|
"responseTimeoutSeconds": 30,
|
||||||
|
"inputKeys": ["transcript_id", "room_id"],
|
||||||
|
"outputKeys": ["sent", "status_code"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "generate_dynamic_fork_tasks",
|
||||||
|
"retryCount": 3,
|
||||||
|
"timeoutSeconds": 30,
|
||||||
|
"responseTimeoutSeconds": 15,
|
||||||
|
"inputKeys": ["tracks", "task_type", "transcript_id", "bucket_name"],
|
||||||
|
"outputKeys": ["tasks", "inputs"],
|
||||||
|
"ownerEmail": OWNER_EMAIL,
|
||||||
|
"description": "Helper task to generate dynamic fork structure for variable track counts",
|
||||||
|
},
|
||||||
|
]
|
||||||
60
server/reflector/conductor/tasks/register.py
Normal file
60
server/reflector/conductor/tasks/register.py
Normal file
@@ -0,0 +1,60 @@
|
|||||||
|
"""Register task definitions with Conductor server."""
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
from reflector.conductor.tasks.definitions import TASK_DEFINITIONS
|
||||||
|
from reflector.logger import logger
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
|
||||||
|
def register_task_definitions() -> None:
|
||||||
|
"""Register all task definitions with Conductor server.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
httpx.HTTPStatusError: If registration fails.
|
||||||
|
"""
|
||||||
|
base_url = settings.CONDUCTOR_SERVER_URL.rstrip("/")
|
||||||
|
url = f"{base_url}/metadata/taskdefs"
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"Registering task definitions",
|
||||||
|
count=len(TASK_DEFINITIONS),
|
||||||
|
url=url,
|
||||||
|
)
|
||||||
|
|
||||||
|
with httpx.Client(timeout=30.0) as client:
|
||||||
|
resp = client.post(
|
||||||
|
url,
|
||||||
|
json=TASK_DEFINITIONS,
|
||||||
|
headers={"Content-Type": "application/json"},
|
||||||
|
)
|
||||||
|
resp.raise_for_status()
|
||||||
|
|
||||||
|
logger.info("Task definitions registered successfully")
|
||||||
|
|
||||||
|
|
||||||
|
async def register_task_definitions_async() -> None:
|
||||||
|
"""Async version of register_task_definitions."""
|
||||||
|
base_url = settings.CONDUCTOR_SERVER_URL.rstrip("/")
|
||||||
|
url = f"{base_url}/metadata/taskdefs"
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"Registering task definitions",
|
||||||
|
count=len(TASK_DEFINITIONS),
|
||||||
|
url=url,
|
||||||
|
)
|
||||||
|
|
||||||
|
async with httpx.AsyncClient(timeout=30.0) as client:
|
||||||
|
resp = await client.post(
|
||||||
|
url,
|
||||||
|
json=TASK_DEFINITIONS,
|
||||||
|
headers={"Content-Type": "application/json"},
|
||||||
|
)
|
||||||
|
resp.raise_for_status()
|
||||||
|
|
||||||
|
logger.info("Task definitions registered successfully")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
register_task_definitions()
|
||||||
|
print(f"Registered {len(TASK_DEFINITIONS)} task definitions")
|
||||||
37
server/reflector/conductor/workers/__init__.py
Normal file
37
server/reflector/conductor/workers/__init__.py
Normal file
@@ -0,0 +1,37 @@
|
|||||||
|
"""Conductor workers for the diarization pipeline."""
|
||||||
|
|
||||||
|
from reflector.conductor.workers.cleanup_consent import cleanup_consent
|
||||||
|
from reflector.conductor.workers.detect_topics import detect_topics
|
||||||
|
from reflector.conductor.workers.finalize import finalize
|
||||||
|
from reflector.conductor.workers.generate_dynamic_fork_tasks import (
|
||||||
|
generate_dynamic_fork_tasks,
|
||||||
|
)
|
||||||
|
from reflector.conductor.workers.generate_summary import generate_summary
|
||||||
|
from reflector.conductor.workers.generate_title import generate_title
|
||||||
|
from reflector.conductor.workers.generate_waveform import generate_waveform
|
||||||
|
from reflector.conductor.workers.get_participants import get_participants
|
||||||
|
from reflector.conductor.workers.get_recording import get_recording
|
||||||
|
from reflector.conductor.workers.merge_transcripts import merge_transcripts
|
||||||
|
from reflector.conductor.workers.mixdown_tracks import mixdown_tracks
|
||||||
|
from reflector.conductor.workers.pad_track import pad_track
|
||||||
|
from reflector.conductor.workers.post_zulip import post_zulip
|
||||||
|
from reflector.conductor.workers.send_webhook import send_webhook
|
||||||
|
from reflector.conductor.workers.transcribe_track import transcribe_track
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"get_recording",
|
||||||
|
"get_participants",
|
||||||
|
"pad_track",
|
||||||
|
"mixdown_tracks",
|
||||||
|
"generate_waveform",
|
||||||
|
"transcribe_track",
|
||||||
|
"merge_transcripts",
|
||||||
|
"detect_topics",
|
||||||
|
"generate_title",
|
||||||
|
"generate_summary",
|
||||||
|
"finalize",
|
||||||
|
"cleanup_consent",
|
||||||
|
"post_zulip",
|
||||||
|
"send_webhook",
|
||||||
|
"generate_dynamic_fork_tasks",
|
||||||
|
]
|
||||||
126
server/reflector/conductor/workers/cleanup_consent.py
Normal file
126
server/reflector/conductor/workers/cleanup_consent.py
Normal file
@@ -0,0 +1,126 @@
|
|||||||
|
"""Conductor worker: cleanup_consent - Check consent and delete audio if denied."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="cleanup_consent")
|
||||||
|
def cleanup_consent(task: Task) -> TaskResult:
|
||||||
|
"""Check participant consent and delete audio if denied.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
transcript_id: str - Transcript ID
|
||||||
|
|
||||||
|
Output:
|
||||||
|
audio_deleted: bool - Whether audio was deleted
|
||||||
|
reason: str | None - Reason for deletion
|
||||||
|
"""
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
|
||||||
|
logger.info("[Worker] cleanup_consent", transcript_id=transcript_id)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "cleanup_consent", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not transcript_id:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "Missing transcript_id"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _process():
|
||||||
|
import databases
|
||||||
|
|
||||||
|
from reflector.db import _database_context
|
||||||
|
from reflector.db.transcripts import transcripts_controller
|
||||||
|
from reflector.settings import settings
|
||||||
|
from reflector.storage import get_transcripts_storage
|
||||||
|
|
||||||
|
# Create fresh database connection for subprocess (not shared from parent)
|
||||||
|
_database_context.set(None)
|
||||||
|
db = databases.Database(settings.DATABASE_URL)
|
||||||
|
_database_context.set(db)
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
try:
|
||||||
|
transcript = await transcripts_controller.get_by_id(transcript_id)
|
||||||
|
if transcript is None:
|
||||||
|
raise ValueError(f"Transcript {transcript_id} not found in database")
|
||||||
|
|
||||||
|
# Check if any participant denied consent
|
||||||
|
# This mirrors the logic from main_live_pipeline.task_cleanup_consent
|
||||||
|
audio_deleted = False
|
||||||
|
reason = None
|
||||||
|
|
||||||
|
if transcript.participants:
|
||||||
|
for p in transcript.participants:
|
||||||
|
if hasattr(p, "consent") and p.consent == "denied":
|
||||||
|
audio_deleted = True
|
||||||
|
reason = f"Participant {p.name or p.id} denied consent"
|
||||||
|
break
|
||||||
|
|
||||||
|
if audio_deleted:
|
||||||
|
storage = get_transcripts_storage()
|
||||||
|
audio_key = f"{transcript_id}/audio.mp3"
|
||||||
|
try:
|
||||||
|
await storage.delete_file(audio_key)
|
||||||
|
await transcripts_controller.update(
|
||||||
|
transcript, {"audio_deleted": True}
|
||||||
|
)
|
||||||
|
logger.info(
|
||||||
|
"[Worker] cleanup_consent: audio deleted",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
reason=reason,
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(
|
||||||
|
"[Worker] cleanup_consent: failed to delete audio",
|
||||||
|
error=str(e),
|
||||||
|
)
|
||||||
|
|
||||||
|
return audio_deleted, reason
|
||||||
|
finally:
|
||||||
|
await db.disconnect()
|
||||||
|
_database_context.set(None)
|
||||||
|
|
||||||
|
try:
|
||||||
|
audio_deleted, reason = asyncio.run(_process())
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {
|
||||||
|
"audio_deleted": audio_deleted,
|
||||||
|
"reason": reason,
|
||||||
|
}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] cleanup_consent complete",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
audio_deleted=audio_deleted,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "cleanup_consent", "completed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] cleanup_consent failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "cleanup_consent", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
93
server/reflector/conductor/workers/detect_topics.py
Normal file
93
server/reflector/conductor/workers/detect_topics.py
Normal file
@@ -0,0 +1,93 @@
|
|||||||
|
"""Conductor worker: detect_topics - Detect topics using LLM."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="detect_topics")
|
||||||
|
def detect_topics(task: Task) -> TaskResult:
|
||||||
|
"""Detect topics using LLM.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
words: list[dict] - Transcribed words
|
||||||
|
transcript_id: str - Transcript ID
|
||||||
|
target_language: str - Target language code (default: "en")
|
||||||
|
|
||||||
|
Output:
|
||||||
|
topics: list[dict] - Detected topics
|
||||||
|
"""
|
||||||
|
words = task.input_data.get("words", [])
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
target_language = task.input_data.get("target_language", "en")
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] detect_topics",
|
||||||
|
word_count=len(words),
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "detect_topics", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _process():
|
||||||
|
from reflector.pipelines import topic_processing
|
||||||
|
from reflector.processors.types import Transcript as TranscriptType
|
||||||
|
from reflector.processors.types import Word
|
||||||
|
|
||||||
|
# Convert word dicts to Word objects
|
||||||
|
word_objects = [Word(**w) for w in words]
|
||||||
|
transcript = TranscriptType(words=word_objects)
|
||||||
|
|
||||||
|
empty_pipeline = topic_processing.EmptyPipeline(logger=logger)
|
||||||
|
|
||||||
|
async def noop_callback(t):
|
||||||
|
pass
|
||||||
|
|
||||||
|
topics = await topic_processing.detect_topics(
|
||||||
|
transcript,
|
||||||
|
target_language,
|
||||||
|
on_topic_callback=noop_callback,
|
||||||
|
empty_pipeline=empty_pipeline,
|
||||||
|
)
|
||||||
|
|
||||||
|
return [t.model_dump() for t in topics]
|
||||||
|
|
||||||
|
try:
|
||||||
|
topics = asyncio.run(_process())
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {"topics": topics}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] detect_topics complete",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
topic_count=len(topics),
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "detect_topics", "completed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] detect_topics failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "detect_topics", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
111
server/reflector/conductor/workers/finalize.py
Normal file
111
server/reflector/conductor/workers/finalize.py
Normal file
@@ -0,0 +1,111 @@
|
|||||||
|
"""Conductor worker: finalize - Finalize transcript status and update database."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="finalize")
|
||||||
|
def finalize(task: Task) -> TaskResult:
|
||||||
|
"""Finalize the transcript status and update the database.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
transcript_id: str - Transcript ID
|
||||||
|
title: str - Generated title
|
||||||
|
summary: str - Long summary
|
||||||
|
short_summary: str - Short summary
|
||||||
|
duration: float - Audio duration
|
||||||
|
|
||||||
|
Output:
|
||||||
|
status: str - "COMPLETED"
|
||||||
|
"""
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
title = task.input_data.get("title", "")
|
||||||
|
summary = task.input_data.get("summary", "")
|
||||||
|
short_summary = task.input_data.get("short_summary", "")
|
||||||
|
duration = task.input_data.get("duration", 0)
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] finalize",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
title=title,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "finalize", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not transcript_id:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "Missing transcript_id"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _process():
|
||||||
|
import databases
|
||||||
|
|
||||||
|
from reflector.db import _database_context
|
||||||
|
from reflector.db.transcripts import transcripts_controller
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
# Create fresh database connection for subprocess (not shared from parent)
|
||||||
|
_database_context.set(None)
|
||||||
|
db = databases.Database(settings.DATABASE_URL)
|
||||||
|
_database_context.set(db)
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
try:
|
||||||
|
transcript = await transcripts_controller.get_by_id(transcript_id)
|
||||||
|
if transcript is None:
|
||||||
|
raise ValueError(f"Transcript {transcript_id} not found in database")
|
||||||
|
|
||||||
|
await transcripts_controller.update(
|
||||||
|
transcript,
|
||||||
|
{
|
||||||
|
"status": "ended",
|
||||||
|
"title": title,
|
||||||
|
"long_summary": summary,
|
||||||
|
"short_summary": short_summary,
|
||||||
|
"duration": duration,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
return True
|
||||||
|
finally:
|
||||||
|
await db.disconnect()
|
||||||
|
_database_context.set(None)
|
||||||
|
|
||||||
|
try:
|
||||||
|
asyncio.run(_process())
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {"status": "COMPLETED"}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] finalize complete",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "finalize", "completed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] finalize failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "finalize", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
@@ -0,0 +1,110 @@
|
|||||||
|
"""Conductor worker: generate_dynamic_fork_tasks - Helper for FORK_JOIN_DYNAMIC."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.logger import logger
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="generate_dynamic_fork_tasks")
|
||||||
|
def generate_dynamic_fork_tasks(task: Task) -> TaskResult:
|
||||||
|
"""Generate dynamic fork task structure for variable track counts.
|
||||||
|
|
||||||
|
This helper task generates the task definitions and inputs needed for
|
||||||
|
FORK_JOIN_DYNAMIC to process N tracks in parallel.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
tracks: list[dict] - List of track info with s3_key
|
||||||
|
task_type: str - Either "pad_track" or "transcribe_track"
|
||||||
|
transcript_id: str - Transcript ID
|
||||||
|
bucket_name: str - S3 bucket name (for pad_track)
|
||||||
|
padded_urls: list[dict] - Padded track outputs (for transcribe_track)
|
||||||
|
|
||||||
|
Output:
|
||||||
|
tasks: list[dict] - Task definitions for dynamic fork
|
||||||
|
inputs: dict - Input parameters keyed by task reference name
|
||||||
|
"""
|
||||||
|
tracks = task.input_data.get("tracks", [])
|
||||||
|
task_type = task.input_data.get("task_type")
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
bucket_name = task.input_data.get("bucket_name")
|
||||||
|
padded_urls = task.input_data.get("padded_urls", {})
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] generate_dynamic_fork_tasks",
|
||||||
|
task_type=task_type,
|
||||||
|
track_count=len(tracks),
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not tracks or not task_type:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "Missing tracks or task_type"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
try:
|
||||||
|
tasks = []
|
||||||
|
inputs = {}
|
||||||
|
|
||||||
|
for idx, track in enumerate(tracks):
|
||||||
|
ref_name = f"{task_type}_{idx}"
|
||||||
|
|
||||||
|
# Task definition
|
||||||
|
tasks.append(
|
||||||
|
{
|
||||||
|
"name": task_type,
|
||||||
|
"taskReferenceName": ref_name,
|
||||||
|
"type": "SIMPLE",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# Task input based on type
|
||||||
|
if task_type == "pad_track":
|
||||||
|
inputs[ref_name] = {
|
||||||
|
"track_index": idx,
|
||||||
|
"s3_key": track.get("s3_key"),
|
||||||
|
"bucket_name": bucket_name,
|
||||||
|
"transcript_id": transcript_id,
|
||||||
|
}
|
||||||
|
elif task_type == "transcribe_track":
|
||||||
|
# Get padded URL from previous fork join output
|
||||||
|
padded_url = None
|
||||||
|
if isinstance(padded_urls, dict):
|
||||||
|
# Try to get from join output structure
|
||||||
|
pad_ref = f"pad_track_{idx}"
|
||||||
|
if pad_ref in padded_urls:
|
||||||
|
padded_url = padded_urls[pad_ref].get("padded_url")
|
||||||
|
elif "padded_url" in padded_urls:
|
||||||
|
# Single track case
|
||||||
|
padded_url = padded_urls.get("padded_url")
|
||||||
|
|
||||||
|
inputs[ref_name] = {
|
||||||
|
"track_index": idx,
|
||||||
|
"audio_url": padded_url,
|
||||||
|
"language": "en",
|
||||||
|
"transcript_id": transcript_id,
|
||||||
|
}
|
||||||
|
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {
|
||||||
|
"tasks": tasks,
|
||||||
|
"inputs": inputs,
|
||||||
|
}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] generate_dynamic_fork_tasks complete",
|
||||||
|
task_type=task_type,
|
||||||
|
task_count=len(tasks),
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] generate_dynamic_fork_tasks failed", error=str(e))
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
|
||||||
|
return task_result
|
||||||
150
server/reflector/conductor/workers/generate_summary.py
Normal file
150
server/reflector/conductor/workers/generate_summary.py
Normal file
@@ -0,0 +1,150 @@
|
|||||||
|
"""Conductor worker: generate_summary - Generate meeting summaries using LLM."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="generate_summary")
|
||||||
|
def generate_summary(task: Task) -> TaskResult:
|
||||||
|
"""Generate long and short summaries from topics and words using LLM.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
words: list[dict] - Transcribed words
|
||||||
|
topics: list[dict] - Detected topics
|
||||||
|
transcript_id: str - Transcript ID
|
||||||
|
|
||||||
|
Output:
|
||||||
|
summary: str - Long summary
|
||||||
|
short_summary: str - Short summary
|
||||||
|
"""
|
||||||
|
words = task.input_data.get("words", [])
|
||||||
|
topics = task.input_data.get("topics", [])
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] generate_summary",
|
||||||
|
word_count=len(words),
|
||||||
|
topic_count=len(topics),
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "generate_summary", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _process():
|
||||||
|
import databases
|
||||||
|
|
||||||
|
from reflector.db import _database_context
|
||||||
|
from reflector.db.transcripts import transcripts_controller
|
||||||
|
from reflector.pipelines import topic_processing
|
||||||
|
from reflector.processors.types import TitleSummary, Word
|
||||||
|
from reflector.processors.types import Transcript as TranscriptType
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
# Create fresh database connection for subprocess (not shared from parent)
|
||||||
|
# Reset context var to ensure we get a fresh connection
|
||||||
|
_database_context.set(None)
|
||||||
|
db = databases.Database(settings.DATABASE_URL)
|
||||||
|
_database_context.set(db)
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# detect_topics returns TitleSummary objects (with transcript: Transcript)
|
||||||
|
# When serialized, transcript becomes {translation, words} dict
|
||||||
|
# We need to reconstruct TitleSummary objects with proper Transcript
|
||||||
|
def normalize_topic(t):
|
||||||
|
topic = dict(t)
|
||||||
|
transcript_data = topic.get("transcript")
|
||||||
|
if isinstance(transcript_data, dict):
|
||||||
|
# Reconstruct Transcript object from serialized dict
|
||||||
|
words_list = transcript_data.get("words", [])
|
||||||
|
word_objects = [
|
||||||
|
Word(**w) if isinstance(w, dict) else w for w in words_list
|
||||||
|
]
|
||||||
|
topic["transcript"] = TranscriptType(
|
||||||
|
words=word_objects,
|
||||||
|
translation=transcript_data.get("translation"),
|
||||||
|
)
|
||||||
|
elif transcript_data is None:
|
||||||
|
topic["transcript"] = TranscriptType(words=[])
|
||||||
|
return topic
|
||||||
|
|
||||||
|
topic_objects = [TitleSummary(**normalize_topic(t)) for t in topics]
|
||||||
|
empty_pipeline = topic_processing.EmptyPipeline(logger=logger)
|
||||||
|
|
||||||
|
transcript = await transcripts_controller.get_by_id(transcript_id)
|
||||||
|
|
||||||
|
long_summary = ""
|
||||||
|
short_summary = ""
|
||||||
|
|
||||||
|
async def on_long(s):
|
||||||
|
nonlocal long_summary
|
||||||
|
# s is FinalLongSummary object
|
||||||
|
long_summary = s.long_summary if hasattr(s, "long_summary") else str(s)
|
||||||
|
|
||||||
|
async def on_short(s):
|
||||||
|
nonlocal short_summary
|
||||||
|
# s is FinalShortSummary object
|
||||||
|
short_summary = (
|
||||||
|
s.short_summary if hasattr(s, "short_summary") else str(s)
|
||||||
|
)
|
||||||
|
|
||||||
|
await topic_processing.generate_summaries(
|
||||||
|
topic_objects,
|
||||||
|
transcript,
|
||||||
|
on_long_summary_callback=on_long,
|
||||||
|
on_short_summary_callback=on_short,
|
||||||
|
empty_pipeline=empty_pipeline,
|
||||||
|
logger=logger,
|
||||||
|
)
|
||||||
|
|
||||||
|
return long_summary, short_summary
|
||||||
|
finally:
|
||||||
|
await db.disconnect()
|
||||||
|
_database_context.set(None)
|
||||||
|
|
||||||
|
try:
|
||||||
|
summary, short_summary = asyncio.run(_process())
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {
|
||||||
|
"summary": summary,
|
||||||
|
"short_summary": short_summary,
|
||||||
|
}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] generate_summary complete",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
summary_len=len(summary) if summary else 0,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id,
|
||||||
|
"generate_summary",
|
||||||
|
"completed",
|
||||||
|
task.workflow_instance_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] generate_summary failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "generate_summary", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
111
server/reflector/conductor/workers/generate_title.py
Normal file
111
server/reflector/conductor/workers/generate_title.py
Normal file
@@ -0,0 +1,111 @@
|
|||||||
|
"""Conductor worker: generate_title - Generate meeting title using LLM."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="generate_title")
|
||||||
|
def generate_title(task: Task) -> TaskResult:
|
||||||
|
"""Generate meeting title from detected topics using LLM.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
topics: list[dict] - Detected topics
|
||||||
|
transcript_id: str - Transcript ID
|
||||||
|
|
||||||
|
Output:
|
||||||
|
title: str - Generated title
|
||||||
|
"""
|
||||||
|
topics = task.input_data.get("topics", [])
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] generate_title",
|
||||||
|
topic_count=len(topics),
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "generate_title", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not topics:
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {"title": "Untitled Meeting"}
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _process():
|
||||||
|
from reflector.pipelines import topic_processing
|
||||||
|
from reflector.processors.types import TitleSummary, Word
|
||||||
|
from reflector.processors.types import Transcript as TranscriptType
|
||||||
|
|
||||||
|
# detect_topics returns TitleSummary objects (with transcript: Transcript)
|
||||||
|
# When serialized, transcript becomes {translation, words} dict
|
||||||
|
# We need to reconstruct TitleSummary objects with proper Transcript
|
||||||
|
def normalize_topic(t):
|
||||||
|
topic = dict(t)
|
||||||
|
transcript_data = topic.get("transcript")
|
||||||
|
if isinstance(transcript_data, dict):
|
||||||
|
# Reconstruct Transcript object from serialized dict
|
||||||
|
words_list = transcript_data.get("words", [])
|
||||||
|
word_objects = [
|
||||||
|
Word(**w) if isinstance(w, dict) else w for w in words_list
|
||||||
|
]
|
||||||
|
topic["transcript"] = TranscriptType(
|
||||||
|
words=word_objects, translation=transcript_data.get("translation")
|
||||||
|
)
|
||||||
|
elif transcript_data is None:
|
||||||
|
topic["transcript"] = TranscriptType(words=[])
|
||||||
|
return topic
|
||||||
|
|
||||||
|
topic_objects = [TitleSummary(**normalize_topic(t)) for t in topics]
|
||||||
|
empty_pipeline = topic_processing.EmptyPipeline(logger=logger)
|
||||||
|
|
||||||
|
async def noop_callback(t):
|
||||||
|
pass
|
||||||
|
|
||||||
|
title = await topic_processing.generate_title(
|
||||||
|
topic_objects,
|
||||||
|
on_title_callback=noop_callback,
|
||||||
|
empty_pipeline=empty_pipeline,
|
||||||
|
logger=logger,
|
||||||
|
)
|
||||||
|
return title
|
||||||
|
|
||||||
|
try:
|
||||||
|
title = asyncio.run(_process())
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {"title": title}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] generate_title complete",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
title=title,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "generate_title", "completed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] generate_title failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "generate_title", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
106
server/reflector/conductor/workers/generate_waveform.py
Normal file
106
server/reflector/conductor/workers/generate_waveform.py
Normal file
@@ -0,0 +1,106 @@
|
|||||||
|
"""Conductor worker: generate_waveform - Generate waveform visualization data."""
|
||||||
|
|
||||||
|
import tempfile
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
from reflector.storage import get_transcripts_storage
|
||||||
|
from reflector.utils.audio_waveform import get_audio_waveform
|
||||||
|
|
||||||
|
PRESIGNED_URL_EXPIRATION_SECONDS = 7200
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="generate_waveform")
|
||||||
|
def generate_waveform(task: Task) -> TaskResult:
|
||||||
|
"""Generate waveform visualization data from mixed audio.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
audio_key: str - S3 key of the audio file
|
||||||
|
transcript_id: str - Transcript ID
|
||||||
|
|
||||||
|
Output:
|
||||||
|
waveform: list[float] - Waveform peaks array
|
||||||
|
"""
|
||||||
|
audio_key = task.input_data.get("audio_key")
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] generate_waveform", audio_key=audio_key, transcript_id=transcript_id
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "generate_waveform", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not audio_key or not transcript_id:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "Missing audio_key or transcript_id"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _process():
|
||||||
|
storage = get_transcripts_storage()
|
||||||
|
audio_url = await storage.get_file_url(
|
||||||
|
audio_key,
|
||||||
|
operation="get_object",
|
||||||
|
expires_in=PRESIGNED_URL_EXPIRATION_SECONDS,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Download audio to temp file
|
||||||
|
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
|
||||||
|
tmp_path = Path(tmp.name)
|
||||||
|
async with httpx.AsyncClient() as client:
|
||||||
|
resp = await client.get(audio_url)
|
||||||
|
resp.raise_for_status()
|
||||||
|
tmp.write(resp.content)
|
||||||
|
|
||||||
|
try:
|
||||||
|
waveform = get_audio_waveform(tmp_path, segments_count=255)
|
||||||
|
finally:
|
||||||
|
tmp_path.unlink(missing_ok=True)
|
||||||
|
|
||||||
|
return waveform
|
||||||
|
|
||||||
|
try:
|
||||||
|
waveform = asyncio.run(_process())
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {"waveform": waveform}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] generate_waveform complete",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
peaks_count=len(waveform) if waveform else 0,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id,
|
||||||
|
"generate_waveform",
|
||||||
|
"completed",
|
||||||
|
task.workflow_instance_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] generate_waveform failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "generate_waveform", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
96
server/reflector/conductor/workers/get_participants.py
Normal file
96
server/reflector/conductor/workers/get_participants.py
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
"""Conductor worker: get_participants - Fetch meeting participants from Daily.co API."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.dailyco_api.client import DailyApiClient
|
||||||
|
from reflector.logger import logger
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="get_participants")
|
||||||
|
def get_participants(task: Task) -> TaskResult:
|
||||||
|
"""Fetch meeting participants from Daily.co API.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
mtg_session_id: str - Daily.co meeting session identifier
|
||||||
|
transcript_id: str - Transcript ID for progress tracking
|
||||||
|
|
||||||
|
Output:
|
||||||
|
participants: list[dict] - List of participant info
|
||||||
|
- participant_id: str
|
||||||
|
- user_name: str | None
|
||||||
|
- user_id: str | None
|
||||||
|
"""
|
||||||
|
mtg_session_id = task.input_data.get("mtg_session_id")
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
|
||||||
|
logger.info("[Worker] get_participants", mtg_session_id=mtg_session_id)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "get_participants", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not mtg_session_id:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "Missing mtg_session_id"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
if not settings.DAILY_API_KEY:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "DAILY_API_KEY not configured"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _fetch():
|
||||||
|
async with DailyApiClient(api_key=settings.DAILY_API_KEY) as client:
|
||||||
|
return await client.get_meeting_participants(mtg_session_id)
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = asyncio.run(_fetch())
|
||||||
|
|
||||||
|
participants = [
|
||||||
|
{
|
||||||
|
"participant_id": p.participant_id,
|
||||||
|
"user_name": p.user_name,
|
||||||
|
"user_id": p.user_id,
|
||||||
|
}
|
||||||
|
for p in response.data
|
||||||
|
]
|
||||||
|
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {"participants": participants}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] get_participants complete",
|
||||||
|
mtg_session_id=mtg_session_id,
|
||||||
|
count=len(participants),
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id,
|
||||||
|
"get_participants",
|
||||||
|
"completed",
|
||||||
|
task.workflow_instance_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] get_participants failed", error=str(e))
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "get_participants", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
90
server/reflector/conductor/workers/get_recording.py
Normal file
90
server/reflector/conductor/workers/get_recording.py
Normal file
@@ -0,0 +1,90 @@
|
|||||||
|
"""Conductor worker: get_recording - Fetch recording metadata from Daily.co API."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.dailyco_api.client import DailyApiClient
|
||||||
|
from reflector.logger import logger
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="get_recording")
|
||||||
|
def get_recording(task: Task) -> TaskResult:
|
||||||
|
"""Fetch recording metadata from Daily.co API.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
recording_id: str - Daily.co recording identifier
|
||||||
|
transcript_id: str - Transcript ID for progress tracking
|
||||||
|
|
||||||
|
Output:
|
||||||
|
id: str - Recording ID
|
||||||
|
mtg_session_id: str - Meeting session ID
|
||||||
|
room_name: str - Room name
|
||||||
|
duration: int - Recording duration in seconds
|
||||||
|
"""
|
||||||
|
recording_id = task.input_data.get("recording_id")
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
|
||||||
|
logger.info("[Worker] get_recording", recording_id=recording_id)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "get_recording", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not recording_id:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "Missing recording_id"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
if not settings.DAILY_API_KEY:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "DAILY_API_KEY not configured"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _fetch():
|
||||||
|
async with DailyApiClient(api_key=settings.DAILY_API_KEY) as client:
|
||||||
|
return await client.get_recording(recording_id)
|
||||||
|
|
||||||
|
try:
|
||||||
|
recording = asyncio.run(_fetch())
|
||||||
|
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {
|
||||||
|
"id": recording.id,
|
||||||
|
"mtg_session_id": recording.mtgSessionId,
|
||||||
|
"room_name": recording.room_name,
|
||||||
|
"duration": recording.duration,
|
||||||
|
}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] get_recording complete",
|
||||||
|
recording_id=recording_id,
|
||||||
|
room_name=recording.room_name,
|
||||||
|
duration=recording.duration,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "get_recording", "completed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] get_recording failed", error=str(e))
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "get_recording", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
89
server/reflector/conductor/workers/merge_transcripts.py
Normal file
89
server/reflector/conductor/workers/merge_transcripts.py
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
"""Conductor worker: merge_transcripts - Merge multiple track transcriptions."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="merge_transcripts")
|
||||||
|
def merge_transcripts(task: Task) -> TaskResult:
|
||||||
|
"""Merge multiple track transcriptions into single timeline sorted by timestamp.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
transcripts: list[dict] - List of transcription results with words
|
||||||
|
transcript_id: str - Transcript ID
|
||||||
|
|
||||||
|
Output:
|
||||||
|
all_words: list[dict] - Merged and sorted words
|
||||||
|
word_count: int - Total word count
|
||||||
|
"""
|
||||||
|
transcripts = task.input_data.get("transcripts", [])
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] merge_transcripts",
|
||||||
|
transcript_count=len(transcripts)
|
||||||
|
if isinstance(transcripts, (list, dict))
|
||||||
|
else 0,
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "merge_transcripts", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
all_words = []
|
||||||
|
|
||||||
|
# Handle JOIN output (dict with task refs as keys)
|
||||||
|
if isinstance(transcripts, dict):
|
||||||
|
transcripts = list(transcripts.values())
|
||||||
|
|
||||||
|
for t in transcripts:
|
||||||
|
if isinstance(t, list):
|
||||||
|
all_words.extend(t)
|
||||||
|
elif isinstance(t, dict) and "words" in t:
|
||||||
|
all_words.extend(t["words"])
|
||||||
|
|
||||||
|
# Sort by start timestamp
|
||||||
|
all_words.sort(key=lambda w: w.get("start", 0))
|
||||||
|
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {
|
||||||
|
"all_words": all_words,
|
||||||
|
"word_count": len(all_words),
|
||||||
|
}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] merge_transcripts complete",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
word_count=len(all_words),
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id,
|
||||||
|
"merge_transcripts",
|
||||||
|
"completed",
|
||||||
|
task.workflow_instance_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] merge_transcripts failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "merge_transcripts", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
278
server/reflector/conductor/workers/mixdown_tracks.py
Normal file
278
server/reflector/conductor/workers/mixdown_tracks.py
Normal file
@@ -0,0 +1,278 @@
|
|||||||
|
"""Conductor worker: mixdown_tracks - Mix multiple audio tracks into single file.
|
||||||
|
|
||||||
|
Builds PyAV filter graph with amix filter to combine N padded tracks into
|
||||||
|
a single stereo MP3 file.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import tempfile
|
||||||
|
from fractions import Fraction
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import av
|
||||||
|
from av.audio.resampler import AudioResampler
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
from reflector.storage import get_transcripts_storage
|
||||||
|
|
||||||
|
PRESIGNED_URL_EXPIRATION_SECONDS = 7200
|
||||||
|
MP3_BITRATE = 192000
|
||||||
|
|
||||||
|
|
||||||
|
def _build_mixdown_filter_graph(containers: list, target_sample_rate: int):
|
||||||
|
"""Build PyAV filter graph: N abuffer -> amix -> aformat -> sink.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
containers: List of PyAV containers for input tracks
|
||||||
|
target_sample_rate: Output sample rate
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (graph, inputs list, sink)
|
||||||
|
"""
|
||||||
|
graph = av.filter.Graph()
|
||||||
|
inputs = []
|
||||||
|
|
||||||
|
for idx in range(len(containers)):
|
||||||
|
args = (
|
||||||
|
f"time_base=1/{target_sample_rate}:"
|
||||||
|
f"sample_rate={target_sample_rate}:"
|
||||||
|
f"sample_fmt=s32:"
|
||||||
|
f"channel_layout=stereo"
|
||||||
|
)
|
||||||
|
in_ctx = graph.add("abuffer", args=args, name=f"in{idx}")
|
||||||
|
inputs.append(in_ctx)
|
||||||
|
|
||||||
|
# amix with normalize=0 to prevent volume reduction
|
||||||
|
mixer = graph.add("amix", args=f"inputs={len(containers)}:normalize=0", name="mix")
|
||||||
|
fmt = graph.add(
|
||||||
|
"aformat",
|
||||||
|
args=f"sample_fmts=s16:channel_layouts=stereo:sample_rates={target_sample_rate}",
|
||||||
|
name="fmt",
|
||||||
|
)
|
||||||
|
sink = graph.add("abuffersink", name="out")
|
||||||
|
|
||||||
|
for idx, in_ctx in enumerate(inputs):
|
||||||
|
in_ctx.link_to(mixer, 0, idx)
|
||||||
|
mixer.link_to(fmt)
|
||||||
|
fmt.link_to(sink)
|
||||||
|
graph.configure()
|
||||||
|
|
||||||
|
return graph, inputs, sink
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="mixdown_tracks")
|
||||||
|
def mixdown_tracks(task: Task) -> TaskResult:
|
||||||
|
"""Mix multiple audio tracks into single stereo file.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
padded_urls: list[str] - Presigned URLs of padded tracks
|
||||||
|
transcript_id: str - Transcript ID for storage path
|
||||||
|
|
||||||
|
Output:
|
||||||
|
audio_key: str - S3 key of mixed audio file
|
||||||
|
duration: float - Audio duration in seconds
|
||||||
|
size: int - File size in bytes
|
||||||
|
"""
|
||||||
|
padded_urls = task.input_data.get("padded_urls", [])
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] mixdown_tracks",
|
||||||
|
track_count=len(padded_urls),
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "mixdown_tracks", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not padded_urls or not transcript_id:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "Missing padded_urls or transcript_id"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _process():
|
||||||
|
storage = get_transcripts_storage()
|
||||||
|
|
||||||
|
# Determine target sample rate from first track
|
||||||
|
target_sample_rate = None
|
||||||
|
for url in padded_urls:
|
||||||
|
if not url:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
with av.open(url) as container:
|
||||||
|
for frame in container.decode(audio=0):
|
||||||
|
target_sample_rate = frame.sample_rate
|
||||||
|
break
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if target_sample_rate:
|
||||||
|
break
|
||||||
|
|
||||||
|
if not target_sample_rate:
|
||||||
|
raise Exception("Mixdown failed: No decodable audio frames in any track")
|
||||||
|
|
||||||
|
# Open all containers with reconnect options for S3 streaming
|
||||||
|
containers = []
|
||||||
|
valid_urls = [url for url in padded_urls if url]
|
||||||
|
|
||||||
|
for url in valid_urls:
|
||||||
|
try:
|
||||||
|
c = av.open(
|
||||||
|
url,
|
||||||
|
options={
|
||||||
|
"reconnect": "1",
|
||||||
|
"reconnect_streamed": "1",
|
||||||
|
"reconnect_delay_max": "5",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
containers.append(c)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(
|
||||||
|
"Mixdown: failed to open container", url=url[:50], error=str(e)
|
||||||
|
)
|
||||||
|
|
||||||
|
if not containers:
|
||||||
|
raise Exception("Mixdown failed: Could not open any track containers")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Build filter graph
|
||||||
|
graph, inputs, sink = _build_mixdown_filter_graph(
|
||||||
|
containers, target_sample_rate
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create temp file for output
|
||||||
|
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as temp_file:
|
||||||
|
temp_path = temp_file.name
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Open output container for MP3
|
||||||
|
with av.open(temp_path, "w", format="mp3") as out_container:
|
||||||
|
out_stream = out_container.add_stream(
|
||||||
|
"libmp3lame", rate=target_sample_rate
|
||||||
|
)
|
||||||
|
out_stream.bit_rate = MP3_BITRATE
|
||||||
|
|
||||||
|
decoders = [c.decode(audio=0) for c in containers]
|
||||||
|
active = [True] * len(decoders)
|
||||||
|
resamplers = [
|
||||||
|
AudioResampler(
|
||||||
|
format="s32", layout="stereo", rate=target_sample_rate
|
||||||
|
)
|
||||||
|
for _ in decoders
|
||||||
|
]
|
||||||
|
|
||||||
|
duration_samples = 0
|
||||||
|
|
||||||
|
while any(active):
|
||||||
|
for i, (dec, is_active) in enumerate(zip(decoders, active)):
|
||||||
|
if not is_active:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
frame = next(dec)
|
||||||
|
except StopIteration:
|
||||||
|
active[i] = False
|
||||||
|
inputs[i].push(None)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if frame.sample_rate != target_sample_rate:
|
||||||
|
continue
|
||||||
|
|
||||||
|
out_frames = resamplers[i].resample(frame) or []
|
||||||
|
for rf in out_frames:
|
||||||
|
rf.sample_rate = target_sample_rate
|
||||||
|
rf.time_base = Fraction(1, target_sample_rate)
|
||||||
|
inputs[i].push(rf)
|
||||||
|
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
mixed = sink.pull()
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
mixed.sample_rate = target_sample_rate
|
||||||
|
mixed.time_base = Fraction(1, target_sample_rate)
|
||||||
|
duration_samples += mixed.samples
|
||||||
|
for packet in out_stream.encode(mixed):
|
||||||
|
out_container.mux(packet)
|
||||||
|
|
||||||
|
# Flush remaining
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
mixed = sink.pull()
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
mixed.sample_rate = target_sample_rate
|
||||||
|
mixed.time_base = Fraction(1, target_sample_rate)
|
||||||
|
duration_samples += mixed.samples
|
||||||
|
for packet in out_stream.encode(mixed):
|
||||||
|
out_container.mux(packet)
|
||||||
|
|
||||||
|
for packet in out_stream.encode(None):
|
||||||
|
out_container.mux(packet)
|
||||||
|
|
||||||
|
# Get file size and duration
|
||||||
|
file_size = Path(temp_path).stat().st_size
|
||||||
|
duration = (
|
||||||
|
duration_samples / target_sample_rate if target_sample_rate else 0
|
||||||
|
)
|
||||||
|
|
||||||
|
# Upload to S3
|
||||||
|
storage_path = f"{transcript_id}/audio.mp3"
|
||||||
|
with open(temp_path, "rb") as mp3_file:
|
||||||
|
await storage.put_file(storage_path, mp3_file)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
Path(temp_path).unlink(missing_ok=True)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
for c in containers:
|
||||||
|
try:
|
||||||
|
c.close()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return {
|
||||||
|
"audio_key": storage_path,
|
||||||
|
"duration": duration,
|
||||||
|
"size": file_size,
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = asyncio.run(_process())
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = result
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] mixdown_tracks complete",
|
||||||
|
audio_key=result["audio_key"],
|
||||||
|
duration=result["duration"],
|
||||||
|
size=result["size"],
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "mixdown_tracks", "completed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] mixdown_tracks failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "mixdown_tracks", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
322
server/reflector/conductor/workers/pad_track.py
Normal file
322
server/reflector/conductor/workers/pad_track.py
Normal file
@@ -0,0 +1,322 @@
|
|||||||
|
"""Conductor worker: pad_track - Pad audio track with silence for alignment.
|
||||||
|
|
||||||
|
This worker extracts stream.start_time from WebM container metadata and applies
|
||||||
|
silence padding using PyAV filter graph (adelay). The padded audio is uploaded
|
||||||
|
to S3 and a presigned URL is returned.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import math
|
||||||
|
import tempfile
|
||||||
|
from fractions import Fraction
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import av
|
||||||
|
from av.audio.resampler import AudioResampler
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
|
||||||
|
# Audio constants matching existing pipeline
|
||||||
|
OPUS_STANDARD_SAMPLE_RATE = 48000
|
||||||
|
OPUS_DEFAULT_BIT_RATE = 64000
|
||||||
|
PRESIGNED_URL_EXPIRATION_SECONDS = 7200
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_stream_start_time_from_container(container, track_idx: int) -> float:
|
||||||
|
"""Extract meeting-relative start time from WebM stream metadata.
|
||||||
|
|
||||||
|
Uses PyAV to read stream.start_time from WebM container.
|
||||||
|
More accurate than filename timestamps by ~209ms due to network/encoding delays.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
container: PyAV container object
|
||||||
|
track_idx: Track index for logging
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Start time in seconds (0.0 if not found)
|
||||||
|
"""
|
||||||
|
start_time_seconds = 0.0
|
||||||
|
try:
|
||||||
|
audio_streams = [s for s in container.streams if s.type == "audio"]
|
||||||
|
stream = audio_streams[0] if audio_streams else container.streams[0]
|
||||||
|
|
||||||
|
# 1) Try stream-level start_time (most reliable for Daily.co tracks)
|
||||||
|
if stream.start_time is not None and stream.time_base is not None:
|
||||||
|
start_time_seconds = float(stream.start_time * stream.time_base)
|
||||||
|
|
||||||
|
# 2) Fallback to container-level start_time (in av.time_base units)
|
||||||
|
if (start_time_seconds <= 0) and (container.start_time is not None):
|
||||||
|
start_time_seconds = float(container.start_time * av.time_base)
|
||||||
|
|
||||||
|
# 3) Fallback to first packet DTS in stream.time_base
|
||||||
|
if start_time_seconds <= 0:
|
||||||
|
for packet in container.demux(stream):
|
||||||
|
if packet.dts is not None:
|
||||||
|
start_time_seconds = float(packet.dts * stream.time_base)
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(
|
||||||
|
"PyAV metadata read failed; assuming 0 start_time",
|
||||||
|
track_idx=track_idx,
|
||||||
|
error=str(e),
|
||||||
|
)
|
||||||
|
start_time_seconds = 0.0
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
f"Track {track_idx} stream metadata: start_time={start_time_seconds:.3f}s",
|
||||||
|
track_idx=track_idx,
|
||||||
|
)
|
||||||
|
return start_time_seconds
|
||||||
|
|
||||||
|
|
||||||
|
def _apply_audio_padding_to_file(
|
||||||
|
in_container,
|
||||||
|
output_path: str,
|
||||||
|
start_time_seconds: float,
|
||||||
|
track_idx: int,
|
||||||
|
) -> None:
|
||||||
|
"""Apply silence padding to audio track using PyAV filter graph.
|
||||||
|
|
||||||
|
Filter chain: abuffer -> aresample -> adelay -> abuffersink
|
||||||
|
|
||||||
|
Args:
|
||||||
|
in_container: PyAV input container
|
||||||
|
output_path: Path to write padded output
|
||||||
|
start_time_seconds: Amount of silence to prepend
|
||||||
|
track_idx: Track index for logging
|
||||||
|
"""
|
||||||
|
delay_ms = math.floor(start_time_seconds * 1000)
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
f"Padding track {track_idx} with {delay_ms}ms delay using PyAV",
|
||||||
|
track_idx=track_idx,
|
||||||
|
delay_ms=delay_ms,
|
||||||
|
)
|
||||||
|
|
||||||
|
with av.open(output_path, "w", format="webm") as out_container:
|
||||||
|
in_stream = next((s for s in in_container.streams if s.type == "audio"), None)
|
||||||
|
if in_stream is None:
|
||||||
|
raise Exception("No audio stream in input")
|
||||||
|
|
||||||
|
out_stream = out_container.add_stream("libopus", rate=OPUS_STANDARD_SAMPLE_RATE)
|
||||||
|
out_stream.bit_rate = OPUS_DEFAULT_BIT_RATE
|
||||||
|
graph = av.filter.Graph()
|
||||||
|
|
||||||
|
abuf_args = (
|
||||||
|
f"time_base=1/{OPUS_STANDARD_SAMPLE_RATE}:"
|
||||||
|
f"sample_rate={OPUS_STANDARD_SAMPLE_RATE}:"
|
||||||
|
f"sample_fmt=s16:"
|
||||||
|
f"channel_layout=stereo"
|
||||||
|
)
|
||||||
|
src = graph.add("abuffer", args=abuf_args, name="src")
|
||||||
|
aresample_f = graph.add("aresample", args="async=1", name="ares")
|
||||||
|
# adelay requires one delay value per channel separated by '|'
|
||||||
|
delays_arg = f"{delay_ms}|{delay_ms}"
|
||||||
|
adelay_f = graph.add("adelay", args=f"delays={delays_arg}:all=1", name="delay")
|
||||||
|
sink = graph.add("abuffersink", name="sink")
|
||||||
|
|
||||||
|
src.link_to(aresample_f)
|
||||||
|
aresample_f.link_to(adelay_f)
|
||||||
|
adelay_f.link_to(sink)
|
||||||
|
graph.configure()
|
||||||
|
|
||||||
|
resampler = AudioResampler(
|
||||||
|
format="s16", layout="stereo", rate=OPUS_STANDARD_SAMPLE_RATE
|
||||||
|
)
|
||||||
|
|
||||||
|
# Decode -> resample -> push through graph -> encode Opus
|
||||||
|
for frame in in_container.decode(in_stream):
|
||||||
|
out_frames = resampler.resample(frame) or []
|
||||||
|
for rframe in out_frames:
|
||||||
|
rframe.sample_rate = OPUS_STANDARD_SAMPLE_RATE
|
||||||
|
rframe.time_base = Fraction(1, OPUS_STANDARD_SAMPLE_RATE)
|
||||||
|
src.push(rframe)
|
||||||
|
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
f_out = sink.pull()
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
f_out.sample_rate = OPUS_STANDARD_SAMPLE_RATE
|
||||||
|
f_out.time_base = Fraction(1, OPUS_STANDARD_SAMPLE_RATE)
|
||||||
|
for packet in out_stream.encode(f_out):
|
||||||
|
out_container.mux(packet)
|
||||||
|
|
||||||
|
# Flush remaining frames
|
||||||
|
src.push(None)
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
f_out = sink.pull()
|
||||||
|
except Exception:
|
||||||
|
break
|
||||||
|
f_out.sample_rate = OPUS_STANDARD_SAMPLE_RATE
|
||||||
|
f_out.time_base = Fraction(1, OPUS_STANDARD_SAMPLE_RATE)
|
||||||
|
for packet in out_stream.encode(f_out):
|
||||||
|
out_container.mux(packet)
|
||||||
|
|
||||||
|
for packet in out_stream.encode(None):
|
||||||
|
out_container.mux(packet)
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="pad_track")
|
||||||
|
def pad_track(task: Task) -> TaskResult:
|
||||||
|
"""Pad audio track with silence for alignment.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
track_index: int - Index of the track
|
||||||
|
s3_key: str - S3 key of the source audio file
|
||||||
|
bucket_name: str - S3 bucket name
|
||||||
|
transcript_id: str - Transcript ID for storage path
|
||||||
|
|
||||||
|
Output:
|
||||||
|
padded_url: str - Presigned URL of padded track
|
||||||
|
size: int - File size in bytes
|
||||||
|
track_index: int - Track index (echoed back)
|
||||||
|
"""
|
||||||
|
track_index = task.input_data.get("track_index", 0)
|
||||||
|
s3_key = task.input_data.get("s3_key")
|
||||||
|
bucket_name = task.input_data.get("bucket_name")
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] pad_track",
|
||||||
|
track_index=track_index,
|
||||||
|
s3_key=s3_key,
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "pad_track", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not s3_key or not transcript_id:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "Missing s3_key or transcript_id"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _process():
|
||||||
|
# Create fresh storage instance to avoid aioboto3 fork issues
|
||||||
|
from reflector.settings import settings
|
||||||
|
from reflector.storage.storage_aws import AwsStorage
|
||||||
|
|
||||||
|
storage = AwsStorage(
|
||||||
|
aws_bucket_name=settings.TRANSCRIPT_STORAGE_AWS_BUCKET_NAME,
|
||||||
|
aws_region=settings.TRANSCRIPT_STORAGE_AWS_REGION,
|
||||||
|
aws_access_key_id=settings.TRANSCRIPT_STORAGE_AWS_ACCESS_KEY_ID,
|
||||||
|
aws_secret_access_key=settings.TRANSCRIPT_STORAGE_AWS_SECRET_ACCESS_KEY,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Get presigned URL for source file
|
||||||
|
source_url = await storage.get_file_url(
|
||||||
|
s3_key,
|
||||||
|
operation="get_object",
|
||||||
|
expires_in=PRESIGNED_URL_EXPIRATION_SECONDS,
|
||||||
|
bucket=bucket_name,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Open container and extract start time
|
||||||
|
with av.open(source_url) as in_container:
|
||||||
|
start_time_seconds = _extract_stream_start_time_from_container(
|
||||||
|
in_container, track_index
|
||||||
|
)
|
||||||
|
|
||||||
|
# If no padding needed, return original URL
|
||||||
|
if start_time_seconds <= 0:
|
||||||
|
logger.info(
|
||||||
|
f"Track {track_index} requires no padding",
|
||||||
|
track_index=track_index,
|
||||||
|
)
|
||||||
|
return {
|
||||||
|
"padded_url": source_url,
|
||||||
|
"size": 0,
|
||||||
|
"track_index": track_index,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Create temp file for padded output
|
||||||
|
with tempfile.NamedTemporaryFile(suffix=".webm", delete=False) as temp_file:
|
||||||
|
temp_path = temp_file.name
|
||||||
|
|
||||||
|
try:
|
||||||
|
_apply_audio_padding_to_file(
|
||||||
|
in_container, temp_path, start_time_seconds, track_index
|
||||||
|
)
|
||||||
|
|
||||||
|
# Get file size
|
||||||
|
file_size = Path(temp_path).stat().st_size
|
||||||
|
|
||||||
|
# Upload using storage layer (use separate path in shadow mode to avoid conflicts)
|
||||||
|
storage_path = f"file_pipeline_conductor/{transcript_id}/tracks/padded_{track_index}.webm"
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
f"About to upload padded track",
|
||||||
|
key=storage_path,
|
||||||
|
size=file_size,
|
||||||
|
)
|
||||||
|
|
||||||
|
with open(temp_path, "rb") as padded_file:
|
||||||
|
upload_result = await storage.put_file(storage_path, padded_file)
|
||||||
|
logger.info(
|
||||||
|
f"storage.put_file returned",
|
||||||
|
result=str(upload_result),
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
f"Uploaded padded track to S3",
|
||||||
|
key=storage_path,
|
||||||
|
size=file_size,
|
||||||
|
)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
Path(temp_path).unlink(missing_ok=True)
|
||||||
|
|
||||||
|
# Get presigned URL for padded file
|
||||||
|
padded_url = await storage.get_file_url(
|
||||||
|
storage_path,
|
||||||
|
operation="get_object",
|
||||||
|
expires_in=PRESIGNED_URL_EXPIRATION_SECONDS,
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"padded_url": padded_url,
|
||||||
|
"size": file_size,
|
||||||
|
"track_index": track_index,
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = asyncio.run(_process())
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = result
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] pad_track complete",
|
||||||
|
track_index=track_index,
|
||||||
|
padded_url=result["padded_url"][:50] + "...",
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "pad_track", "completed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] pad_track failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "pad_track", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
101
server/reflector/conductor/workers/post_zulip.py
Normal file
101
server/reflector/conductor/workers/post_zulip.py
Normal file
@@ -0,0 +1,101 @@
|
|||||||
|
"""Conductor worker: post_zulip - Post or update Zulip message with transcript summary."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="post_zulip")
|
||||||
|
def post_zulip(task: Task) -> TaskResult:
|
||||||
|
"""Post or update a Zulip message with the transcript summary.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
transcript_id: str - Transcript ID
|
||||||
|
|
||||||
|
Output:
|
||||||
|
message_id: str | None - Zulip message ID
|
||||||
|
"""
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
|
||||||
|
logger.info("[Worker] post_zulip", transcript_id=transcript_id)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "post_zulip", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not transcript_id:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "Missing transcript_id"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
# Check if Zulip is configured
|
||||||
|
if not settings.ZULIP_REALM or not settings.ZULIP_API_KEY:
|
||||||
|
logger.info("[Worker] post_zulip: Zulip not configured, skipping")
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {"message_id": None}
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _process():
|
||||||
|
import databases
|
||||||
|
|
||||||
|
from reflector.db import _database_context
|
||||||
|
from reflector.db.transcripts import transcripts_controller
|
||||||
|
from reflector.settings import settings as app_settings
|
||||||
|
from reflector.zulip import post_transcript_to_zulip
|
||||||
|
|
||||||
|
# Create fresh database connection for subprocess (not shared from parent)
|
||||||
|
_database_context.set(None)
|
||||||
|
db = databases.Database(app_settings.DATABASE_URL)
|
||||||
|
_database_context.set(db)
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
try:
|
||||||
|
transcript = await transcripts_controller.get_by_id(transcript_id)
|
||||||
|
if transcript is None:
|
||||||
|
raise ValueError(f"Transcript {transcript_id} not found in database")
|
||||||
|
message_id = await post_transcript_to_zulip(transcript)
|
||||||
|
return message_id
|
||||||
|
finally:
|
||||||
|
await db.disconnect()
|
||||||
|
_database_context.set(None)
|
||||||
|
|
||||||
|
try:
|
||||||
|
message_id = asyncio.run(_process())
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {
|
||||||
|
"message_id": str(message_id) if message_id else None
|
||||||
|
}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] post_zulip complete",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
message_id=message_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "post_zulip", "completed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] post_zulip failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "post_zulip", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
115
server/reflector/conductor/workers/send_webhook.py
Normal file
115
server/reflector/conductor/workers/send_webhook.py
Normal file
@@ -0,0 +1,115 @@
|
|||||||
|
"""Conductor worker: send_webhook - Send transcript completion webhook."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="send_webhook")
|
||||||
|
def send_webhook(task: Task) -> TaskResult:
|
||||||
|
"""Send the transcript completion webhook to the configured URL.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
transcript_id: str - Transcript ID
|
||||||
|
room_id: str - Room ID
|
||||||
|
|
||||||
|
Output:
|
||||||
|
sent: bool - Whether webhook was sent
|
||||||
|
status_code: int | None - HTTP status code
|
||||||
|
"""
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
room_id = task.input_data.get("room_id")
|
||||||
|
|
||||||
|
logger.info("[Worker] send_webhook", transcript_id=transcript_id, room_id=room_id)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "send_webhook", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not transcript_id:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "Missing transcript_id"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _process():
|
||||||
|
import databases
|
||||||
|
|
||||||
|
from reflector.db import _database_context
|
||||||
|
from reflector.db.rooms import rooms_controller
|
||||||
|
from reflector.db.transcripts import transcripts_controller
|
||||||
|
from reflector.settings import settings
|
||||||
|
from reflector.worker.webhook import send_transcript_webhook
|
||||||
|
|
||||||
|
# Create fresh database connection for subprocess (not shared from parent)
|
||||||
|
_database_context.set(None)
|
||||||
|
db = databases.Database(settings.DATABASE_URL)
|
||||||
|
_database_context.set(db)
|
||||||
|
await db.connect()
|
||||||
|
|
||||||
|
try:
|
||||||
|
transcript = await transcripts_controller.get_by_id(transcript_id)
|
||||||
|
if transcript is None:
|
||||||
|
raise ValueError(f"Transcript {transcript_id} not found in database")
|
||||||
|
|
||||||
|
# Get room for webhook URL
|
||||||
|
room = None
|
||||||
|
if room_id:
|
||||||
|
try:
|
||||||
|
room = await rooms_controller.get_by_id(room_id)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
if not room or not room.webhook_url:
|
||||||
|
logger.info(
|
||||||
|
"[Worker] send_webhook: No webhook URL configured",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
)
|
||||||
|
return False, None
|
||||||
|
|
||||||
|
status_code = await send_transcript_webhook(transcript, room)
|
||||||
|
return True, status_code
|
||||||
|
finally:
|
||||||
|
await db.disconnect()
|
||||||
|
_database_context.set(None)
|
||||||
|
|
||||||
|
try:
|
||||||
|
sent, status_code = asyncio.run(_process())
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {
|
||||||
|
"sent": sent,
|
||||||
|
"status_code": status_code,
|
||||||
|
}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] send_webhook complete",
|
||||||
|
transcript_id=transcript_id,
|
||||||
|
sent=sent,
|
||||||
|
status_code=status_code,
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "send_webhook", "completed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] send_webhook failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "send_webhook", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
96
server/reflector/conductor/workers/transcribe_track.py
Normal file
96
server/reflector/conductor/workers/transcribe_track.py
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
"""Conductor worker: transcribe_track - Transcribe audio track using GPU service."""
|
||||||
|
|
||||||
|
from conductor.client.http.models import Task, TaskResult
|
||||||
|
from conductor.client.http.models.task_result_status import TaskResultStatus
|
||||||
|
from conductor.client.worker.worker_task import worker_task
|
||||||
|
from reflector.conductor.progress import emit_progress
|
||||||
|
from reflector.logger import logger
|
||||||
|
|
||||||
|
|
||||||
|
@worker_task(task_definition_name="transcribe_track")
|
||||||
|
def transcribe_track(task: Task) -> TaskResult:
|
||||||
|
"""Transcribe audio track using GPU (Modal.com) or local Whisper.
|
||||||
|
|
||||||
|
Input:
|
||||||
|
track_index: int - Index of the track
|
||||||
|
audio_url: str - Presigned URL of the audio file
|
||||||
|
language: str - Language code (default: "en")
|
||||||
|
transcript_id: str - Transcript ID for progress tracking
|
||||||
|
|
||||||
|
Output:
|
||||||
|
words: list[dict] - List of transcribed words with timestamps and speaker
|
||||||
|
track_index: int - Track index (echoed back)
|
||||||
|
"""
|
||||||
|
track_index = task.input_data.get("track_index", 0)
|
||||||
|
audio_url = task.input_data.get("audio_url")
|
||||||
|
language = task.input_data.get("language", "en")
|
||||||
|
transcript_id = task.input_data.get("transcript_id")
|
||||||
|
|
||||||
|
logger.info("[Worker] transcribe_track", track_index=track_index, language=language)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "transcribe_track", "in_progress", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
task_result = TaskResult(
|
||||||
|
task_id=task.task_id,
|
||||||
|
workflow_instance_id=task.workflow_instance_id,
|
||||||
|
worker_id=task.worker_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not audio_url:
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = "Missing audio_url"
|
||||||
|
return task_result
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
async def _process():
|
||||||
|
from reflector.pipelines.transcription_helpers import (
|
||||||
|
transcribe_file_with_processor,
|
||||||
|
)
|
||||||
|
|
||||||
|
transcript = await transcribe_file_with_processor(audio_url, language)
|
||||||
|
|
||||||
|
# Tag all words with speaker index
|
||||||
|
words = []
|
||||||
|
for word in transcript.words:
|
||||||
|
word_dict = word.model_dump()
|
||||||
|
word_dict["speaker"] = track_index
|
||||||
|
words.append(word_dict)
|
||||||
|
|
||||||
|
return words
|
||||||
|
|
||||||
|
try:
|
||||||
|
words = asyncio.run(_process())
|
||||||
|
task_result.status = TaskResultStatus.COMPLETED
|
||||||
|
task_result.output_data = {
|
||||||
|
"words": words,
|
||||||
|
"track_index": track_index,
|
||||||
|
}
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"[Worker] transcribe_track complete",
|
||||||
|
track_index=track_index,
|
||||||
|
word_count=len(words),
|
||||||
|
)
|
||||||
|
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id,
|
||||||
|
"transcribe_track",
|
||||||
|
"completed",
|
||||||
|
task.workflow_instance_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("[Worker] transcribe_track failed", error=str(e), exc_info=True)
|
||||||
|
task_result.status = TaskResultStatus.FAILED
|
||||||
|
task_result.reason_for_incompletion = str(e)
|
||||||
|
if transcript_id:
|
||||||
|
emit_progress(
|
||||||
|
transcript_id, "transcribe_track", "failed", task.workflow_instance_id
|
||||||
|
)
|
||||||
|
|
||||||
|
return task_result
|
||||||
205
server/reflector/conductor/workflows/diarization_pipeline.json
Normal file
205
server/reflector/conductor/workflows/diarization_pipeline.json
Normal file
@@ -0,0 +1,205 @@
|
|||||||
|
{
|
||||||
|
"name": "diarization_pipeline",
|
||||||
|
"description": "Reflector multitrack diarization pipeline",
|
||||||
|
"version": 1,
|
||||||
|
"schemaVersion": 2,
|
||||||
|
"inputParameters": [
|
||||||
|
"recording_id",
|
||||||
|
"room_name",
|
||||||
|
"tracks",
|
||||||
|
"bucket_name",
|
||||||
|
"transcript_id",
|
||||||
|
"room_id"
|
||||||
|
],
|
||||||
|
"tasks": [
|
||||||
|
{
|
||||||
|
"name": "get_recording",
|
||||||
|
"taskReferenceName": "get_recording",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"recording_id": "${workflow.input.recording_id}",
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "get_participants",
|
||||||
|
"taskReferenceName": "get_participants",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"mtg_session_id": "${get_recording.output.mtg_session_id}",
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "generate_dynamic_fork_tasks",
|
||||||
|
"taskReferenceName": "generate_padding_tasks",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"tracks": "${workflow.input.tracks}",
|
||||||
|
"task_type": "pad_track",
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}",
|
||||||
|
"bucket_name": "${workflow.input.bucket_name}"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "fork_track_padding",
|
||||||
|
"taskReferenceName": "fork_track_padding",
|
||||||
|
"type": "FORK_JOIN_DYNAMIC",
|
||||||
|
"inputParameters": {
|
||||||
|
"dynamicTasks": "${generate_padding_tasks.output.tasks}",
|
||||||
|
"dynamicTasksInput": "${generate_padding_tasks.output.inputs}"
|
||||||
|
},
|
||||||
|
"dynamicForkTasksParam": "dynamicTasks",
|
||||||
|
"dynamicForkTasksInputParamName": "dynamicTasksInput"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "join_padding",
|
||||||
|
"taskReferenceName": "join_padding",
|
||||||
|
"type": "JOIN"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "mixdown_tracks",
|
||||||
|
"taskReferenceName": "mixdown_tracks",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"padded_urls": "${join_padding.output..padded_url}",
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "generate_waveform",
|
||||||
|
"taskReferenceName": "generate_waveform",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"audio_key": "${mixdown_tracks.output.audio_key}",
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "generate_dynamic_fork_tasks",
|
||||||
|
"taskReferenceName": "generate_transcription_tasks",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"tracks": "${workflow.input.tracks}",
|
||||||
|
"task_type": "transcribe_track",
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}",
|
||||||
|
"padded_urls": "${join_padding.output}"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "fork_transcription",
|
||||||
|
"taskReferenceName": "fork_transcription",
|
||||||
|
"type": "FORK_JOIN_DYNAMIC",
|
||||||
|
"inputParameters": {
|
||||||
|
"dynamicTasks": "${generate_transcription_tasks.output.tasks}",
|
||||||
|
"dynamicTasksInput": "${generate_transcription_tasks.output.inputs}"
|
||||||
|
},
|
||||||
|
"dynamicForkTasksParam": "dynamicTasks",
|
||||||
|
"dynamicForkTasksInputParamName": "dynamicTasksInput"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "join_transcription",
|
||||||
|
"taskReferenceName": "join_transcription",
|
||||||
|
"type": "JOIN"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "merge_transcripts",
|
||||||
|
"taskReferenceName": "merge_transcripts",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"transcripts": "${join_transcription.output}",
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "detect_topics",
|
||||||
|
"taskReferenceName": "detect_topics",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"words": "${merge_transcripts.output.all_words}",
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}",
|
||||||
|
"target_language": "en"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "fork_generation",
|
||||||
|
"taskReferenceName": "fork_generation",
|
||||||
|
"type": "FORK_JOIN",
|
||||||
|
"forkTasks": [
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"name": "generate_title",
|
||||||
|
"taskReferenceName": "generate_title",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"topics": "${detect_topics.output.topics}",
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"name": "generate_summary",
|
||||||
|
"taskReferenceName": "generate_summary",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"words": "${merge_transcripts.output.all_words}",
|
||||||
|
"topics": "${detect_topics.output.topics}",
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "join_generation",
|
||||||
|
"taskReferenceName": "join_generation",
|
||||||
|
"type": "JOIN",
|
||||||
|
"joinOn": ["generate_title", "generate_summary"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "finalize",
|
||||||
|
"taskReferenceName": "finalize",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}",
|
||||||
|
"title": "${generate_title.output.title}",
|
||||||
|
"summary": "${generate_summary.output.summary}",
|
||||||
|
"short_summary": "${generate_summary.output.short_summary}",
|
||||||
|
"duration": "${mixdown_tracks.output.duration}"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "cleanup_consent",
|
||||||
|
"taskReferenceName": "cleanup_consent",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "post_zulip",
|
||||||
|
"taskReferenceName": "post_zulip",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "send_webhook",
|
||||||
|
"taskReferenceName": "send_webhook",
|
||||||
|
"type": "SIMPLE",
|
||||||
|
"inputParameters": {
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}",
|
||||||
|
"room_id": "${workflow.input.room_id}"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"outputParameters": {
|
||||||
|
"transcript_id": "${workflow.input.transcript_id}",
|
||||||
|
"title": "${generate_title.output.title}",
|
||||||
|
"summary": "${generate_summary.output.summary}",
|
||||||
|
"duration": "${mixdown_tracks.output.duration}",
|
||||||
|
"word_count": "${merge_transcripts.output.word_count}"
|
||||||
|
}
|
||||||
|
}
|
||||||
74
server/reflector/conductor/workflows/register.py
Normal file
74
server/reflector/conductor/workflows/register.py
Normal file
@@ -0,0 +1,74 @@
|
|||||||
|
"""Register workflow definition with Conductor server."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
from reflector.logger import logger
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
|
||||||
|
def register_workflow() -> None:
|
||||||
|
"""Register the diarization pipeline workflow with Conductor server.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
httpx.HTTPStatusError: If registration fails.
|
||||||
|
"""
|
||||||
|
workflow_path = Path(__file__).parent / "diarization_pipeline.json"
|
||||||
|
|
||||||
|
with open(workflow_path) as f:
|
||||||
|
workflow = json.load(f)
|
||||||
|
|
||||||
|
base_url = settings.CONDUCTOR_SERVER_URL.rstrip("/")
|
||||||
|
url = f"{base_url}/metadata/workflow"
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"Registering workflow",
|
||||||
|
name=workflow["name"],
|
||||||
|
version=workflow["version"],
|
||||||
|
url=url,
|
||||||
|
)
|
||||||
|
|
||||||
|
with httpx.Client(timeout=30.0) as client:
|
||||||
|
resp = client.put(
|
||||||
|
url,
|
||||||
|
json=[workflow],
|
||||||
|
headers={"Content-Type": "application/json"},
|
||||||
|
)
|
||||||
|
resp.raise_for_status()
|
||||||
|
|
||||||
|
logger.info("Workflow registered successfully", name=workflow["name"])
|
||||||
|
|
||||||
|
|
||||||
|
async def register_workflow_async() -> None:
|
||||||
|
"""Async version of register_workflow."""
|
||||||
|
workflow_path = Path(__file__).parent / "diarization_pipeline.json"
|
||||||
|
|
||||||
|
with open(workflow_path) as f:
|
||||||
|
workflow = json.load(f)
|
||||||
|
|
||||||
|
base_url = settings.CONDUCTOR_SERVER_URL.rstrip("/")
|
||||||
|
url = f"{base_url}/metadata/workflow"
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"Registering workflow",
|
||||||
|
name=workflow["name"],
|
||||||
|
version=workflow["version"],
|
||||||
|
url=url,
|
||||||
|
)
|
||||||
|
|
||||||
|
async with httpx.AsyncClient(timeout=30.0) as client:
|
||||||
|
resp = await client.put(
|
||||||
|
url,
|
||||||
|
json=[workflow],
|
||||||
|
headers={"Content-Type": "application/json"},
|
||||||
|
)
|
||||||
|
resp.raise_for_status()
|
||||||
|
|
||||||
|
logger.info("Workflow registered successfully", name=workflow["name"])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
register_workflow()
|
||||||
|
print("Workflow registration complete!")
|
||||||
@@ -22,7 +22,9 @@ recordings = sa.Table(
|
|||||||
),
|
),
|
||||||
sa.Column("meeting_id", sa.String),
|
sa.Column("meeting_id", sa.String),
|
||||||
sa.Column("track_keys", sa.JSON, nullable=True),
|
sa.Column("track_keys", sa.JSON, nullable=True),
|
||||||
|
sa.Column("workflow_id", sa.String, nullable=True),
|
||||||
sa.Index("idx_recording_meeting_id", "meeting_id"),
|
sa.Index("idx_recording_meeting_id", "meeting_id"),
|
||||||
|
sa.Index("idx_recording_workflow_id", "workflow_id"),
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -38,6 +40,8 @@ class Recording(BaseModel):
|
|||||||
# track_keys can be empty list [] if recording finished but no audio was captured (silence/muted)
|
# track_keys can be empty list [] if recording finished but no audio was captured (silence/muted)
|
||||||
# None means not a multitrack recording, [] means multitrack with no tracks
|
# None means not a multitrack recording, [] means multitrack with no tracks
|
||||||
track_keys: list[str] | None = None
|
track_keys: list[str] | None = None
|
||||||
|
# Conductor workflow ID for tracking pipeline execution
|
||||||
|
workflow_id: str | None = None
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def is_multitrack(self) -> bool:
|
def is_multitrack(self) -> bool:
|
||||||
|
|||||||
@@ -181,6 +181,16 @@ class TranscriptEvent(BaseModel):
|
|||||||
data: dict
|
data: dict
|
||||||
|
|
||||||
|
|
||||||
|
class PipelineProgressData(BaseModel):
|
||||||
|
"""Data payload for PIPELINE_PROGRESS WebSocket events."""
|
||||||
|
|
||||||
|
workflow_id: str | None = None
|
||||||
|
current_step: str
|
||||||
|
step_index: int
|
||||||
|
total_steps: int
|
||||||
|
step_status: Literal["pending", "in_progress", "completed", "failed"]
|
||||||
|
|
||||||
|
|
||||||
class TranscriptParticipant(BaseModel):
|
class TranscriptParticipant(BaseModel):
|
||||||
model_config = ConfigDict(from_attributes=True)
|
model_config = ConfigDict(from_attributes=True)
|
||||||
id: str = Field(default_factory=generate_uuid4)
|
id: str = Field(default_factory=generate_uuid4)
|
||||||
|
|||||||
@@ -12,12 +12,15 @@ from typing import Literal, Union, assert_never
|
|||||||
import celery
|
import celery
|
||||||
from celery.result import AsyncResult
|
from celery.result import AsyncResult
|
||||||
|
|
||||||
|
from reflector.conductor.client import ConductorClientManager
|
||||||
from reflector.db.recordings import recordings_controller
|
from reflector.db.recordings import recordings_controller
|
||||||
from reflector.db.transcripts import Transcript
|
from reflector.db.transcripts import Transcript
|
||||||
|
from reflector.logger import logger
|
||||||
from reflector.pipelines.main_file_pipeline import task_pipeline_file_process
|
from reflector.pipelines.main_file_pipeline import task_pipeline_file_process
|
||||||
from reflector.pipelines.main_multitrack_pipeline import (
|
from reflector.pipelines.main_multitrack_pipeline import (
|
||||||
task_pipeline_multitrack_process,
|
task_pipeline_multitrack_process,
|
||||||
)
|
)
|
||||||
|
from reflector.settings import settings
|
||||||
from reflector.utils.string import NonEmptyString
|
from reflector.utils.string import NonEmptyString
|
||||||
|
|
||||||
|
|
||||||
@@ -37,6 +40,8 @@ class MultitrackProcessingConfig:
|
|||||||
transcript_id: NonEmptyString
|
transcript_id: NonEmptyString
|
||||||
bucket_name: NonEmptyString
|
bucket_name: NonEmptyString
|
||||||
track_keys: list[str]
|
track_keys: list[str]
|
||||||
|
recording_id: NonEmptyString | None = None
|
||||||
|
room_id: NonEmptyString | None = None
|
||||||
mode: Literal["multitrack"] = "multitrack"
|
mode: Literal["multitrack"] = "multitrack"
|
||||||
|
|
||||||
|
|
||||||
@@ -110,12 +115,15 @@ async def validate_transcript_for_processing(
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
async def prepare_transcript_processing(validation: ValidationOk) -> PrepareResult:
|
async def prepare_transcript_processing(
|
||||||
|
validation: ValidationOk, room_id: str | None = None
|
||||||
|
) -> PrepareResult:
|
||||||
"""
|
"""
|
||||||
Determine processing mode from transcript/recording data.
|
Determine processing mode from transcript/recording data.
|
||||||
"""
|
"""
|
||||||
bucket_name: str | None = None
|
bucket_name: str | None = None
|
||||||
track_keys: list[str] | None = None
|
track_keys: list[str] | None = None
|
||||||
|
recording_id: str | None = validation.recording_id
|
||||||
|
|
||||||
if validation.recording_id:
|
if validation.recording_id:
|
||||||
recording = await recordings_controller.get_by_id(validation.recording_id)
|
recording = await recordings_controller.get_by_id(validation.recording_id)
|
||||||
@@ -137,6 +145,8 @@ async def prepare_transcript_processing(validation: ValidationOk) -> PrepareResu
|
|||||||
bucket_name=bucket_name, # type: ignore (validated above)
|
bucket_name=bucket_name, # type: ignore (validated above)
|
||||||
track_keys=track_keys,
|
track_keys=track_keys,
|
||||||
transcript_id=validation.transcript_id,
|
transcript_id=validation.transcript_id,
|
||||||
|
recording_id=recording_id,
|
||||||
|
room_id=room_id,
|
||||||
)
|
)
|
||||||
|
|
||||||
return FileProcessingConfig(
|
return FileProcessingConfig(
|
||||||
@@ -144,8 +154,32 @@ async def prepare_transcript_processing(validation: ValidationOk) -> PrepareResu
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def dispatch_transcript_processing(config: ProcessingConfig) -> AsyncResult:
|
def dispatch_transcript_processing(config: ProcessingConfig) -> AsyncResult | None:
|
||||||
if isinstance(config, MultitrackProcessingConfig):
|
if isinstance(config, MultitrackProcessingConfig):
|
||||||
|
# Start Conductor workflow if enabled
|
||||||
|
if settings.CONDUCTOR_ENABLED:
|
||||||
|
workflow_id = ConductorClientManager.start_workflow(
|
||||||
|
name="diarization_pipeline",
|
||||||
|
version=1,
|
||||||
|
input_data={
|
||||||
|
"recording_id": config.recording_id,
|
||||||
|
"room_name": None, # Not available in reprocess path
|
||||||
|
"tracks": [{"s3_key": k} for k in config.track_keys],
|
||||||
|
"bucket_name": config.bucket_name,
|
||||||
|
"transcript_id": config.transcript_id,
|
||||||
|
"room_id": config.room_id,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
logger.info(
|
||||||
|
"Started Conductor workflow (reprocess)",
|
||||||
|
workflow_id=workflow_id,
|
||||||
|
transcript_id=config.transcript_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not settings.CONDUCTOR_SHADOW_MODE:
|
||||||
|
return None # Conductor-only, no Celery result
|
||||||
|
|
||||||
|
# Celery pipeline (shadow mode or Conductor disabled)
|
||||||
return task_pipeline_multitrack_process.delay(
|
return task_pipeline_multitrack_process.delay(
|
||||||
transcript_id=config.transcript_id,
|
transcript_id=config.transcript_id,
|
||||||
bucket_name=config.bucket_name,
|
bucket_name=config.bucket_name,
|
||||||
|
|||||||
@@ -150,5 +150,11 @@ class Settings(BaseSettings):
|
|||||||
ZULIP_API_KEY: str | None = None
|
ZULIP_API_KEY: str | None = None
|
||||||
ZULIP_BOT_EMAIL: str | None = None
|
ZULIP_BOT_EMAIL: str | None = None
|
||||||
|
|
||||||
|
# Conductor workflow orchestration
|
||||||
|
CONDUCTOR_SERVER_URL: str = "http://conductor:8080/api"
|
||||||
|
CONDUCTOR_DEBUG: bool = False
|
||||||
|
CONDUCTOR_ENABLED: bool = False
|
||||||
|
CONDUCTOR_SHADOW_MODE: bool = False
|
||||||
|
|
||||||
|
|
||||||
settings = Settings()
|
settings = Settings()
|
||||||
|
|||||||
@@ -37,7 +37,7 @@ async def process_transcript_inner(
|
|||||||
) -> AsyncResult:
|
) -> AsyncResult:
|
||||||
validation = await validate_transcript_for_processing(transcript)
|
validation = await validate_transcript_for_processing(transcript)
|
||||||
on_validation(validation)
|
on_validation(validation)
|
||||||
config = await prepare_transcript_processing(validation)
|
config = await prepare_transcript_processing(validation, room_id=transcript.room_id)
|
||||||
on_preprocess(config)
|
on_preprocess(config)
|
||||||
return dispatch_transcript_processing(config)
|
return dispatch_transcript_processing(config)
|
||||||
|
|
||||||
|
|||||||
45
server/reflector/views/conductor.py
Normal file
45
server/reflector/views/conductor.py
Normal file
@@ -0,0 +1,45 @@
|
|||||||
|
"""Conductor health and status endpoints."""
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from fastapi import APIRouter
|
||||||
|
|
||||||
|
from reflector.settings import settings
|
||||||
|
|
||||||
|
router = APIRouter(prefix="/conductor", tags=["conductor"])
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/health")
|
||||||
|
async def conductor_health():
|
||||||
|
"""Check Conductor server connectivity and status."""
|
||||||
|
if not settings.CONDUCTOR_ENABLED:
|
||||||
|
return {"status": "disabled", "connected": False}
|
||||||
|
|
||||||
|
# Extract base URL (remove /api suffix for health check)
|
||||||
|
base_url = settings.CONDUCTOR_SERVER_URL.rstrip("/api").rstrip("/")
|
||||||
|
health_url = f"{base_url}/health"
|
||||||
|
|
||||||
|
try:
|
||||||
|
async with httpx.AsyncClient(timeout=5.0) as client:
|
||||||
|
resp = await client.get(health_url)
|
||||||
|
if resp.status_code == 200:
|
||||||
|
return {"status": "healthy", "connected": True}
|
||||||
|
else:
|
||||||
|
return {
|
||||||
|
"status": "unhealthy",
|
||||||
|
"connected": True,
|
||||||
|
"error": f"Health check returned {resp.status_code}",
|
||||||
|
}
|
||||||
|
except httpx.TimeoutException:
|
||||||
|
return {
|
||||||
|
"status": "unhealthy",
|
||||||
|
"connected": False,
|
||||||
|
"error": "Connection timeout",
|
||||||
|
}
|
||||||
|
except httpx.ConnectError as e:
|
||||||
|
return {
|
||||||
|
"status": "unhealthy",
|
||||||
|
"connected": False,
|
||||||
|
"error": f"Connection failed: {e}",
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
return {"status": "unhealthy", "connected": False, "error": str(e)}
|
||||||
@@ -45,7 +45,7 @@ async def transcript_process(
|
|||||||
else:
|
else:
|
||||||
assert_never(validation)
|
assert_never(validation)
|
||||||
|
|
||||||
config = await prepare_transcript_processing(validation)
|
config = await prepare_transcript_processing(validation, room_id=transcript.room_id)
|
||||||
|
|
||||||
if isinstance(config, ProcessError):
|
if isinstance(config, ProcessError):
|
||||||
raise HTTPException(status_code=500, detail=config.detail)
|
raise HTTPException(status_code=500, detail=config.detail)
|
||||||
|
|||||||
@@ -286,6 +286,35 @@ async def _process_multitrack_recording_inner(
|
|||||||
room_id=room.id,
|
room_id=room.id,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Start Conductor workflow if enabled
|
||||||
|
if settings.CONDUCTOR_ENABLED:
|
||||||
|
from reflector.conductor.client import ConductorClientManager # noqa: PLC0415
|
||||||
|
|
||||||
|
workflow_id = ConductorClientManager.start_workflow(
|
||||||
|
name="diarization_pipeline",
|
||||||
|
version=1,
|
||||||
|
input_data={
|
||||||
|
"recording_id": recording_id,
|
||||||
|
"room_name": daily_room_name,
|
||||||
|
"tracks": [{"s3_key": k} for k in filter_cam_audio_tracks(track_keys)],
|
||||||
|
"bucket_name": bucket_name,
|
||||||
|
"transcript_id": transcript.id,
|
||||||
|
"room_id": room.id,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
logger.info(
|
||||||
|
"Started Conductor workflow",
|
||||||
|
workflow_id=workflow_id,
|
||||||
|
transcript_id=transcript.id,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Store workflow_id on recording for status tracking
|
||||||
|
await recordings_controller.update(recording, {"workflow_id": workflow_id})
|
||||||
|
|
||||||
|
if not settings.CONDUCTOR_SHADOW_MODE:
|
||||||
|
return # Don't trigger Celery
|
||||||
|
|
||||||
|
# Celery pipeline (runs when Conductor disabled OR in shadow mode)
|
||||||
task_pipeline_multitrack_process.delay(
|
task_pipeline_multitrack_process.delay(
|
||||||
transcript_id=transcript.id,
|
transcript_id=transcript.id,
|
||||||
bucket_name=bucket_name,
|
bucket_name=bucket_name,
|
||||||
|
|||||||
@@ -7,6 +7,8 @@ elif [ "${ENTRYPOINT}" = "worker" ]; then
|
|||||||
uv run celery -A reflector.worker.app worker --loglevel=info
|
uv run celery -A reflector.worker.app worker --loglevel=info
|
||||||
elif [ "${ENTRYPOINT}" = "beat" ]; then
|
elif [ "${ENTRYPOINT}" = "beat" ]; then
|
||||||
uv run celery -A reflector.worker.app beat --loglevel=info
|
uv run celery -A reflector.worker.app beat --loglevel=info
|
||||||
|
elif [ "${ENTRYPOINT}" = "conductor-worker" ]; then
|
||||||
|
uv run python -m reflector.conductor.run_workers
|
||||||
else
|
else
|
||||||
echo "Unknown command"
|
echo "Unknown command"
|
||||||
fi
|
fi
|
||||||
|
|||||||
3379
server/uv.lock
generated
3379
server/uv.lock
generated
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user