mirror of https://github.com/Monadical-SAS/reflector.git synced 2025-12-20 20:29:06 +00:00

Files

Igor Loskutov 243ff2177c durable (no-mistakes)

2025-12-15 12:18:47 -05:00

11 KiB

Raw Blame History

Conductor OSS Migration - LLM Debugging Observations

This document captures hard-won debugging insights from migrating the multitrack diarization pipeline from Celery to Conductor OSS. These observations are particularly relevant for LLM assistants working on this codebase.

Architecture Context

Conductor Python SDK uses multiprocessing: 1 parent process spawns 15 TaskRunner subprocesses
Each task type gets its own subprocess that polls Conductor server
Workers are identified by container hostname (e.g., 595f5ddc9711)
Shadow mode (CONDUCTOR_SHADOW_MODE=true) runs both Celery and Conductor in parallel

Challenge 1: Ghost Workers - Multiple Containers Polling Same Tasks

Symptoms

Tasks complete but with wrong/empty output
Worker logs show no execution for a task that API shows as COMPLETED
workerId in Conductor API doesn't match expected container

Root Cause

Multiple containers may be running Conductor workers:

reflector-conductor-worker-1 (dedicated worker)
reflector-server-1 (if shadow mode enabled or worker code imported)

Debugging Steps

# 1. Get the mystery worker ID from Conductor API
curl -s "http://localhost:8180/api/workflow/{id}" | jq '.tasks[] | {ref: .referenceTaskName, workerId}'

# 2. Find which container has that hostname
docker ps -a | grep {workerId}
# or
docker ps -a --format "{{.ID}} {{.Names}}" | grep {first-12-chars}

# 3. Check that container's code version
docker exec {container} cat /app/reflector/conductor/workers/{worker}.py | head -50

Resolution

Restart ALL containers that might be polling Conductor tasks:

docker compose restart conductor-worker server

Key Insight

Always verify workerId matches your expected container. In distributed worker setups, know ALL containers that poll for tasks.

Challenge 2: Multiprocessing + AsyncIO + Database Conflicts

Symptoms

InterfaceError: cannot perform operation: another operation is in progress
RuntimeError: Task <Task pending...> running at /app/.../worker.py

Root Cause

Conductor Python SDK forks subprocesses. When subprocess calls asyncio.run():

New event loop is created
But get_database() returns cached connection from parent process context
Parent's connection is incompatible with child's event loop

Resolution

Reset context and create fresh connection in each subprocess:

async def _process():
    import databases
    from reflector.db import _database_context
    from reflector.settings import settings

    # Reset context var - don't inherit from parent
    _database_context.set(None)
    db = databases.Database(settings.DATABASE_URL)
    _database_context.set(db)
    await db.connect()

    # ... rest of async code

Key Insight

Any singleton/cached resource (DB connections, S3 clients, HTTP sessions) must be recreated AFTER fork. Never trust inherited state in multiprocessing workers.

TODO: The Real Problem with get_database()

Current solution is a hack. The issue runs deeper than multiprocessing fork:

What's Actually Happening

Each Conductor subprocess calls asyncio.run(_process()) repeatedly for each task
First asyncio.run(): creates DB connection, stores in ContextVar
First task completes, asyncio.run() exits, event loop destroyed
But: ContextVar still holds the connection reference (ContextVars persist across asyncio.run() calls)
Second asyncio.run(): Creates a new event loop
Code tries to use the old connection (from ContextVar) with the new event loop
Error: "another operation is in progress"

Root issue: get_database() as a global singleton is incompatible with repeated asyncio.run() calls in the same process.

Option 1: Explicit Connection Lifecycle (cleanest)

async def _process():
    import databases
    from reflector.settings import settings

    # Don't use get_database() - create explicit connection
    db = databases.Database(settings.DATABASE_URL)

    try:
        await db.connect()

        # Problem: transcripts_controller.get_by_id() uses get_database() internally
        # Would need to refactor controllers to accept db parameter
        # e.g., await transcripts_controller.get_by_id(transcript_id, db=db)

    finally:
        await db.disconnect()

Pros: Clean separation, explicit lifecycle Cons: Requires refactoring all controller methods to accept db parameter

Option 2: Reset ContextVar Properly (pragmatic)

async def _process():
    from reflector.db import _database_context, get_database

    # Ensure fresh connection per task
    old_db = _database_context.get()
    if old_db and old_db.is_connected:
        await old_db.disconnect()
    _database_context.set(None)

    # Now get_database() will create fresh connection
    db = get_database()
    await db.connect()

    try:
        # ... work ...
    finally:
        await db.disconnect()
        _database_context.set(None)

Pros: Works with existing controller code Cons: Still manipulating globals, cleanup needed in every worker

Option 3: Fix get_database() Itself (best long-term)

# In reflector/db/__init__.py
def get_database() -> databases.Database:
    """Get database instance for current event loop"""
    import asyncio

    db = _database_context.get()

    # Check if connection is valid for current event loop
    if db is not None:
        try:
            loop = asyncio.get_running_loop()
            # If connection's event loop differs, it's stale
            if db._connection and hasattr(db._connection, '_loop'):
                if db._connection._loop != loop:
                    # Stale connection from old event loop
                    db = None
        except RuntimeError:
            # No running loop
            pass

    if db is None:
        db = databases.Database(settings.DATABASE_URL)
        _database_context.set(db)

    return db

Pros: Fixes root cause, no changes needed in workers Cons: Relies on implementation details of databases library

Recommendation

Short-term: Option 2 (explicit cleanup in workers that need DB)
Long-term: Option 1 (refactor to dependency injection) is the only architecturally clean solution

Challenge 3: Type Mismatches Across Serialization Boundary

Symptoms

ValidationError: 1 validation error for TranscriptTopic
transcript
  Input should be a valid string [type=string_type, input_value={'translation': None, 'words': [...]}]

Root Cause

Conductor JSON-serializes all task inputs/outputs. Complex Pydantic models get serialized to dicts:

TitleSummary.transcript: Transcript becomes {"translation": null, "words": [...]}
Next task expects TranscriptTopic.transcript: str

Resolution

Explicitly reconstruct types when deserializing:

from reflector.processors.types import TitleSummary, Transcript as TranscriptType, Word

def normalize_topic(t):
    topic = dict(t)
    transcript_data = topic.get("transcript")
    if isinstance(transcript_data, dict):
        words_list = transcript_data.get("words", [])
        word_objects = [Word(**w) for w in words_list]
        topic["transcript"] = TranscriptType(
            words=word_objects,
            translation=transcript_data.get("translation")
        )
    return topic

topic_objects = [TitleSummary(**normalize_topic(t)) for t in topics]

Key Insight

Conductor task I/O is always JSON. Design workers to handle dict inputs and reconstruct domain objects explicitly.

Challenge 4: Conductor Health Check Failures

Symptoms

dependency failed to start: container reflector-conductor-1 is unhealthy

Root Cause

Conductor OSS standalone container health endpoint can be slow/flaky, especially during startup or under load.

Resolution

Bypass docker-compose health check dependency:

# Instead of: docker compose up -d conductor-worker
docker start reflector-conductor-worker-1

Key Insight

For development, consider removing depends_on.condition: service_healthy or increasing health check timeout.

Challenge 5: JOIN Task Output Format

Symptoms

merge_transcripts receives data but outputs word_count: 0

Root Cause

FORK_JOIN_DYNAMIC's JOIN task outputs a dict keyed by task reference names, not an array:

{
  "transcribe_track_0": {"words": [...], "track_index": 0},
  "transcribe_track_1": {"words": [...], "track_index": 1}
}

Resolution

Handle both dict and array inputs:

transcripts = task.input_data.get("transcripts", [])

# Handle JOIN output (dict with task refs as keys)
if isinstance(transcripts, dict):
    transcripts = list(transcripts.values())

for t in transcripts:
    if isinstance(t, dict) and "words" in t:
        all_words.extend(t["words"])

Key Insight

JOIN task output structure differs from FORK input. Always log input types during debugging.

Debugging Workflow

1. Add DEBUG Prints with Flush

Multiprocessing buffers stdout. Force immediate output:

import sys
print("[DEBUG] worker entered", flush=True)
sys.stdout.flush()

2. Test Worker Functions Directly

Bypass Conductor entirely to verify logic:

docker compose exec conductor-worker uv run python -c "
from reflector.conductor.workers.merge_transcripts import merge_transcripts
from conductor.client.http.models import Task

mock_task = Task()
mock_task.input_data = {'transcripts': {...}, 'transcript_id': 'test'}
result = merge_transcripts(mock_task)
print(result.output_data)
"

3. Check Task Timing

Suspiciously fast completion (e.g., 10ms) indicates:

Cached result from previous run
Wrong worker processed it
Task completed without actual execution

curl -s "http://localhost:8180/api/workflow/{id}" | \
  jq '.tasks[] | {ref: .referenceTaskName, duration: (.endTime - .startTime)}'

4. Verify Container Code Version

docker compose exec conductor-worker cat /app/reflector/conductor/workers/{file}.py | head -50

5. Use Conductor Retry API

Retry from specific failed task without re-running entire workflow:

curl -X POST "http://localhost:8180/api/workflow/{id}/retry"

Common Gotchas Summary

Issue	Signal	Fix
Wrong worker	`workerId` mismatch	Restart all worker containers
DB conflict	"another operation in progress"	Fresh DB connection per subprocess
Type mismatch	Pydantic validation error	Reconstruct objects from dicts
No logs	Task completes but no output	Check if different container processed
0 results	JOIN output format	Convert dict.values() to list
Health check	Compose dependency fails	Use `docker start` directly

Files Most Likely to Need Conductor-Specific Handling

server/reflector/conductor/workers/*.py - All workers need multiprocessing-safe patterns
server/reflector/db/__init__.py - Database singleton, needs context reset
server/reflector/conductor/workflows/*.json - Workflow definitions, check input/output mappings

11 KiB Raw Blame History

Conductor OSS Migration - LLM Debugging Observations

Architecture Context

Challenge 1: Ghost Workers - Multiple Containers Polling Same Tasks

Symptoms

Root Cause

Debugging Steps

Resolution

Key Insight

Challenge 2: Multiprocessing + AsyncIO + Database Conflicts

Symptoms

Root Cause

Resolution

Key Insight

TODO: The Real Problem with get_database()

What's Actually Happening

Option 1: Explicit Connection Lifecycle (cleanest)

Option 2: Reset ContextVar Properly (pragmatic)

Option 3: Fix get_database() Itself (best long-term)

Recommendation

Challenge 3: Type Mismatches Across Serialization Boundary

Symptoms

Root Cause

Resolution

Key Insight

Challenge 4: Conductor Health Check Failures

Symptoms

Root Cause

Resolution

Key Insight

Challenge 5: JOIN Task Output Format

Symptoms

Root Cause

Resolution

Key Insight

Debugging Workflow

1. Add DEBUG Prints with Flush

2. Test Worker Functions Directly

3. Check Task Timing

4. Verify Container Code Version

5. Use Conductor Retry API

Common Gotchas Summary

Files Most Likely to Need Conductor-Specific Handling

11 KiB

Raw Blame History