mirror of
https://github.com/Monadical-SAS/reflector.git
synced 2025-12-22 21:29:05 +00:00
durable (no-mistakes)
This commit is contained in:
345
server/CONDUCTOR_LLM_OBSERVATIONS.md
Normal file
345
server/CONDUCTOR_LLM_OBSERVATIONS.md
Normal file
@@ -0,0 +1,345 @@
|
||||
# Conductor OSS Migration - LLM Debugging Observations
|
||||
|
||||
This document captures hard-won debugging insights from migrating the multitrack diarization pipeline from Celery to Conductor OSS. These observations are particularly relevant for LLM assistants working on this codebase.
|
||||
|
||||
## Architecture Context
|
||||
|
||||
- **Conductor Python SDK** uses multiprocessing: 1 parent process spawns 15 `TaskRunner` subprocesses
|
||||
- Each task type gets its own subprocess that polls Conductor server
|
||||
- Workers are identified by container hostname (e.g., `595f5ddc9711`)
|
||||
- Shadow mode (`CONDUCTOR_SHADOW_MODE=true`) runs both Celery and Conductor in parallel
|
||||
|
||||
---
|
||||
|
||||
## Challenge 1: Ghost Workers - Multiple Containers Polling Same Tasks
|
||||
|
||||
### Symptoms
|
||||
- Tasks complete but with wrong/empty output
|
||||
- Worker logs show no execution for a task that API shows as COMPLETED
|
||||
- `workerId` in Conductor API doesn't match expected container
|
||||
|
||||
### Root Cause
|
||||
Multiple containers may be running Conductor workers:
|
||||
- `reflector-conductor-worker-1` (dedicated worker)
|
||||
- `reflector-server-1` (if shadow mode enabled or worker code imported)
|
||||
|
||||
### Debugging Steps
|
||||
```bash
|
||||
# 1. Get the mystery worker ID from Conductor API
|
||||
curl -s "http://localhost:8180/api/workflow/{id}" | jq '.tasks[] | {ref: .referenceTaskName, workerId}'
|
||||
|
||||
# 2. Find which container has that hostname
|
||||
docker ps -a | grep {workerId}
|
||||
# or
|
||||
docker ps -a --format "{{.ID}} {{.Names}}" | grep {first-12-chars}
|
||||
|
||||
# 3. Check that container's code version
|
||||
docker exec {container} cat /app/reflector/conductor/workers/{worker}.py | head -50
|
||||
```
|
||||
|
||||
### Resolution
|
||||
Restart ALL containers that might be polling Conductor tasks:
|
||||
```bash
|
||||
docker compose restart conductor-worker server
|
||||
```
|
||||
|
||||
### Key Insight
|
||||
**Always verify `workerId` matches your expected container.** In distributed worker setups, know ALL containers that poll for tasks.
|
||||
|
||||
---
|
||||
|
||||
## Challenge 2: Multiprocessing + AsyncIO + Database Conflicts
|
||||
|
||||
### Symptoms
|
||||
```
|
||||
InterfaceError: cannot perform operation: another operation is in progress
|
||||
RuntimeError: Task <Task pending...> running at /app/.../worker.py
|
||||
```
|
||||
|
||||
### Root Cause
|
||||
Conductor Python SDK forks subprocesses. When subprocess calls `asyncio.run()`:
|
||||
1. New event loop is created
|
||||
2. But `get_database()` returns cached connection from parent process context
|
||||
3. Parent's connection is incompatible with child's event loop
|
||||
|
||||
### Resolution
|
||||
Reset context and create fresh connection in each subprocess:
|
||||
```python
|
||||
async def _process():
|
||||
import databases
|
||||
from reflector.db import _database_context
|
||||
from reflector.settings import settings
|
||||
|
||||
# Reset context var - don't inherit from parent
|
||||
_database_context.set(None)
|
||||
db = databases.Database(settings.DATABASE_URL)
|
||||
_database_context.set(db)
|
||||
await db.connect()
|
||||
|
||||
# ... rest of async code
|
||||
```
|
||||
|
||||
### Key Insight
|
||||
**Any singleton/cached resource (DB connections, S3 clients, HTTP sessions) must be recreated AFTER fork.** Never trust inherited state in multiprocessing workers.
|
||||
|
||||
### TODO: The Real Problem with get_database()
|
||||
|
||||
**Current solution is a hack.** The issue runs deeper than multiprocessing fork:
|
||||
|
||||
#### What's Actually Happening
|
||||
1. Each Conductor subprocess calls `asyncio.run(_process())` repeatedly for each task
|
||||
2. First `asyncio.run()`: creates DB connection, stores in ContextVar
|
||||
3. First task completes, `asyncio.run()` exits, **event loop destroyed**
|
||||
4. **But**: ContextVar still holds the connection reference (ContextVars persist across `asyncio.run()` calls)
|
||||
5. Second `asyncio.run()`: Creates a **new event loop**
|
||||
6. Code tries to use the **old connection** (from ContextVar) with the **new event loop**
|
||||
7. Error: "another operation is in progress"
|
||||
|
||||
**Root issue**: `get_database()` as a global singleton is incompatible with repeated `asyncio.run()` calls in the same process.
|
||||
|
||||
#### Option 1: Explicit Connection Lifecycle (cleanest)
|
||||
```python
|
||||
async def _process():
|
||||
import databases
|
||||
from reflector.settings import settings
|
||||
|
||||
# Don't use get_database() - create explicit connection
|
||||
db = databases.Database(settings.DATABASE_URL)
|
||||
|
||||
try:
|
||||
await db.connect()
|
||||
|
||||
# Problem: transcripts_controller.get_by_id() uses get_database() internally
|
||||
# Would need to refactor controllers to accept db parameter
|
||||
# e.g., await transcripts_controller.get_by_id(transcript_id, db=db)
|
||||
|
||||
finally:
|
||||
await db.disconnect()
|
||||
```
|
||||
|
||||
**Pros**: Clean separation, explicit lifecycle
|
||||
**Cons**: Requires refactoring all controller methods to accept `db` parameter
|
||||
|
||||
#### Option 2: Reset ContextVar Properly (pragmatic)
|
||||
```python
|
||||
async def _process():
|
||||
from reflector.db import _database_context, get_database
|
||||
|
||||
# Ensure fresh connection per task
|
||||
old_db = _database_context.get()
|
||||
if old_db and old_db.is_connected:
|
||||
await old_db.disconnect()
|
||||
_database_context.set(None)
|
||||
|
||||
# Now get_database() will create fresh connection
|
||||
db = get_database()
|
||||
await db.connect()
|
||||
|
||||
try:
|
||||
# ... work ...
|
||||
finally:
|
||||
await db.disconnect()
|
||||
_database_context.set(None)
|
||||
```
|
||||
|
||||
**Pros**: Works with existing controller code
|
||||
**Cons**: Still manipulating globals, cleanup needed in every worker
|
||||
|
||||
#### Option 3: Fix get_database() Itself (best long-term)
|
||||
```python
|
||||
# In reflector/db/__init__.py
|
||||
def get_database() -> databases.Database:
|
||||
"""Get database instance for current event loop"""
|
||||
import asyncio
|
||||
|
||||
db = _database_context.get()
|
||||
|
||||
# Check if connection is valid for current event loop
|
||||
if db is not None:
|
||||
try:
|
||||
loop = asyncio.get_running_loop()
|
||||
# If connection's event loop differs, it's stale
|
||||
if db._connection and hasattr(db._connection, '_loop'):
|
||||
if db._connection._loop != loop:
|
||||
# Stale connection from old event loop
|
||||
db = None
|
||||
except RuntimeError:
|
||||
# No running loop
|
||||
pass
|
||||
|
||||
if db is None:
|
||||
db = databases.Database(settings.DATABASE_URL)
|
||||
_database_context.set(db)
|
||||
|
||||
return db
|
||||
```
|
||||
|
||||
**Pros**: Fixes root cause, no changes needed in workers
|
||||
**Cons**: Relies on implementation details of `databases` library
|
||||
|
||||
#### Recommendation
|
||||
- **Short-term**: Option 2 (explicit cleanup in workers that need DB)
|
||||
- **Long-term**: Option 1 (refactor to dependency injection) is the only architecturally clean solution
|
||||
|
||||
---
|
||||
|
||||
## Challenge 3: Type Mismatches Across Serialization Boundary
|
||||
|
||||
### Symptoms
|
||||
```
|
||||
ValidationError: 1 validation error for TranscriptTopic
|
||||
transcript
|
||||
Input should be a valid string [type=string_type, input_value={'translation': None, 'words': [...]}]
|
||||
```
|
||||
|
||||
### Root Cause
|
||||
Conductor JSON-serializes all task inputs/outputs. Complex Pydantic models get serialized to dicts:
|
||||
- `TitleSummary.transcript: Transcript` becomes `{"translation": null, "words": [...]}`
|
||||
- Next task expects `TranscriptTopic.transcript: str`
|
||||
|
||||
### Resolution
|
||||
Explicitly reconstruct types when deserializing:
|
||||
```python
|
||||
from reflector.processors.types import TitleSummary, Transcript as TranscriptType, Word
|
||||
|
||||
def normalize_topic(t):
|
||||
topic = dict(t)
|
||||
transcript_data = topic.get("transcript")
|
||||
if isinstance(transcript_data, dict):
|
||||
words_list = transcript_data.get("words", [])
|
||||
word_objects = [Word(**w) for w in words_list]
|
||||
topic["transcript"] = TranscriptType(
|
||||
words=word_objects,
|
||||
translation=transcript_data.get("translation")
|
||||
)
|
||||
return topic
|
||||
|
||||
topic_objects = [TitleSummary(**normalize_topic(t)) for t in topics]
|
||||
```
|
||||
|
||||
### Key Insight
|
||||
**Conductor task I/O is always JSON.** Design workers to handle dict inputs and reconstruct domain objects explicitly.
|
||||
|
||||
---
|
||||
|
||||
## Challenge 4: Conductor Health Check Failures
|
||||
|
||||
### Symptoms
|
||||
```
|
||||
dependency failed to start: container reflector-conductor-1 is unhealthy
|
||||
```
|
||||
|
||||
### Root Cause
|
||||
Conductor OSS standalone container health endpoint can be slow/flaky, especially during startup or under load.
|
||||
|
||||
### Resolution
|
||||
Bypass docker-compose health check dependency:
|
||||
```bash
|
||||
# Instead of: docker compose up -d conductor-worker
|
||||
docker start reflector-conductor-worker-1
|
||||
```
|
||||
|
||||
### Key Insight
|
||||
For development, consider removing `depends_on.condition: service_healthy` or increasing health check timeout.
|
||||
|
||||
---
|
||||
|
||||
## Challenge 5: JOIN Task Output Format
|
||||
|
||||
### Symptoms
|
||||
`merge_transcripts` receives data but outputs `word_count: 0`
|
||||
|
||||
### Root Cause
|
||||
FORK_JOIN_DYNAMIC's JOIN task outputs a **dict keyed by task reference names**, not an array:
|
||||
```json
|
||||
{
|
||||
"transcribe_track_0": {"words": [...], "track_index": 0},
|
||||
"transcribe_track_1": {"words": [...], "track_index": 1}
|
||||
}
|
||||
```
|
||||
|
||||
### Resolution
|
||||
Handle both dict and array inputs:
|
||||
```python
|
||||
transcripts = task.input_data.get("transcripts", [])
|
||||
|
||||
# Handle JOIN output (dict with task refs as keys)
|
||||
if isinstance(transcripts, dict):
|
||||
transcripts = list(transcripts.values())
|
||||
|
||||
for t in transcripts:
|
||||
if isinstance(t, dict) and "words" in t:
|
||||
all_words.extend(t["words"])
|
||||
```
|
||||
|
||||
### Key Insight
|
||||
**JOIN task output structure differs from FORK input.** Always log input types during debugging.
|
||||
|
||||
---
|
||||
|
||||
## Debugging Workflow
|
||||
|
||||
### 1. Add DEBUG Prints with Flush
|
||||
Multiprocessing buffers stdout. Force immediate output:
|
||||
```python
|
||||
import sys
|
||||
print("[DEBUG] worker entered", flush=True)
|
||||
sys.stdout.flush()
|
||||
```
|
||||
|
||||
### 2. Test Worker Functions Directly
|
||||
Bypass Conductor entirely to verify logic:
|
||||
```bash
|
||||
docker compose exec conductor-worker uv run python -c "
|
||||
from reflector.conductor.workers.merge_transcripts import merge_transcripts
|
||||
from conductor.client.http.models import Task
|
||||
|
||||
mock_task = Task()
|
||||
mock_task.input_data = {'transcripts': {...}, 'transcript_id': 'test'}
|
||||
result = merge_transcripts(mock_task)
|
||||
print(result.output_data)
|
||||
"
|
||||
```
|
||||
|
||||
### 3. Check Task Timing
|
||||
Suspiciously fast completion (e.g., 10ms) indicates:
|
||||
- Cached result from previous run
|
||||
- Wrong worker processed it
|
||||
- Task completed without actual execution
|
||||
|
||||
```bash
|
||||
curl -s "http://localhost:8180/api/workflow/{id}" | \
|
||||
jq '.tasks[] | {ref: .referenceTaskName, duration: (.endTime - .startTime)}'
|
||||
```
|
||||
|
||||
### 4. Verify Container Code Version
|
||||
```bash
|
||||
docker compose exec conductor-worker cat /app/reflector/conductor/workers/{file}.py | head -50
|
||||
```
|
||||
|
||||
### 5. Use Conductor Retry API
|
||||
Retry from specific failed task without re-running entire workflow:
|
||||
```bash
|
||||
curl -X POST "http://localhost:8180/api/workflow/{id}/retry"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Gotchas Summary
|
||||
|
||||
| Issue | Signal | Fix |
|
||||
|-------|--------|-----|
|
||||
| Wrong worker | `workerId` mismatch | Restart all worker containers |
|
||||
| DB conflict | "another operation in progress" | Fresh DB connection per subprocess |
|
||||
| Type mismatch | Pydantic validation error | Reconstruct objects from dicts |
|
||||
| No logs | Task completes but no output | Check if different container processed |
|
||||
| 0 results | JOIN output format | Convert dict.values() to list |
|
||||
| Health check | Compose dependency fails | Use `docker start` directly |
|
||||
|
||||
---
|
||||
|
||||
## Files Most Likely to Need Conductor-Specific Handling
|
||||
|
||||
- `server/reflector/conductor/workers/*.py` - All workers need multiprocessing-safe patterns
|
||||
- `server/reflector/db/__init__.py` - Database singleton, needs context reset
|
||||
- `server/reflector/conductor/workflows/*.json` - Workflow definitions, check input/output mappings
|
||||
Reference in New Issue
Block a user