Files
reflector/server/CONDUCTOR_LLM_OBSERVATIONS.md
2025-12-15 12:18:47 -05:00

11 KiB

Conductor OSS Migration - LLM Debugging Observations

This document captures hard-won debugging insights from migrating the multitrack diarization pipeline from Celery to Conductor OSS. These observations are particularly relevant for LLM assistants working on this codebase.

Architecture Context

  • Conductor Python SDK uses multiprocessing: 1 parent process spawns 15 TaskRunner subprocesses
  • Each task type gets its own subprocess that polls Conductor server
  • Workers are identified by container hostname (e.g., 595f5ddc9711)
  • Shadow mode (CONDUCTOR_SHADOW_MODE=true) runs both Celery and Conductor in parallel

Challenge 1: Ghost Workers - Multiple Containers Polling Same Tasks

Symptoms

  • Tasks complete but with wrong/empty output
  • Worker logs show no execution for a task that API shows as COMPLETED
  • workerId in Conductor API doesn't match expected container

Root Cause

Multiple containers may be running Conductor workers:

  • reflector-conductor-worker-1 (dedicated worker)
  • reflector-server-1 (if shadow mode enabled or worker code imported)

Debugging Steps

# 1. Get the mystery worker ID from Conductor API
curl -s "http://localhost:8180/api/workflow/{id}" | jq '.tasks[] | {ref: .referenceTaskName, workerId}'

# 2. Find which container has that hostname
docker ps -a | grep {workerId}
# or
docker ps -a --format "{{.ID}} {{.Names}}" | grep {first-12-chars}

# 3. Check that container's code version
docker exec {container} cat /app/reflector/conductor/workers/{worker}.py | head -50

Resolution

Restart ALL containers that might be polling Conductor tasks:

docker compose restart conductor-worker server

Key Insight

Always verify workerId matches your expected container. In distributed worker setups, know ALL containers that poll for tasks.


Challenge 2: Multiprocessing + AsyncIO + Database Conflicts

Symptoms

InterfaceError: cannot perform operation: another operation is in progress
RuntimeError: Task <Task pending...> running at /app/.../worker.py

Root Cause

Conductor Python SDK forks subprocesses. When subprocess calls asyncio.run():

  1. New event loop is created
  2. But get_database() returns cached connection from parent process context
  3. Parent's connection is incompatible with child's event loop

Resolution

Reset context and create fresh connection in each subprocess:

async def _process():
    import databases
    from reflector.db import _database_context
    from reflector.settings import settings

    # Reset context var - don't inherit from parent
    _database_context.set(None)
    db = databases.Database(settings.DATABASE_URL)
    _database_context.set(db)
    await db.connect()

    # ... rest of async code

Key Insight

Any singleton/cached resource (DB connections, S3 clients, HTTP sessions) must be recreated AFTER fork. Never trust inherited state in multiprocessing workers.

TODO: The Real Problem with get_database()

Current solution is a hack. The issue runs deeper than multiprocessing fork:

What's Actually Happening

  1. Each Conductor subprocess calls asyncio.run(_process()) repeatedly for each task
  2. First asyncio.run(): creates DB connection, stores in ContextVar
  3. First task completes, asyncio.run() exits, event loop destroyed
  4. But: ContextVar still holds the connection reference (ContextVars persist across asyncio.run() calls)
  5. Second asyncio.run(): Creates a new event loop
  6. Code tries to use the old connection (from ContextVar) with the new event loop
  7. Error: "another operation is in progress"

Root issue: get_database() as a global singleton is incompatible with repeated asyncio.run() calls in the same process.

Option 1: Explicit Connection Lifecycle (cleanest)

async def _process():
    import databases
    from reflector.settings import settings

    # Don't use get_database() - create explicit connection
    db = databases.Database(settings.DATABASE_URL)

    try:
        await db.connect()

        # Problem: transcripts_controller.get_by_id() uses get_database() internally
        # Would need to refactor controllers to accept db parameter
        # e.g., await transcripts_controller.get_by_id(transcript_id, db=db)

    finally:
        await db.disconnect()

Pros: Clean separation, explicit lifecycle Cons: Requires refactoring all controller methods to accept db parameter

Option 2: Reset ContextVar Properly (pragmatic)

async def _process():
    from reflector.db import _database_context, get_database

    # Ensure fresh connection per task
    old_db = _database_context.get()
    if old_db and old_db.is_connected:
        await old_db.disconnect()
    _database_context.set(None)

    # Now get_database() will create fresh connection
    db = get_database()
    await db.connect()

    try:
        # ... work ...
    finally:
        await db.disconnect()
        _database_context.set(None)

Pros: Works with existing controller code Cons: Still manipulating globals, cleanup needed in every worker

Option 3: Fix get_database() Itself (best long-term)

# In reflector/db/__init__.py
def get_database() -> databases.Database:
    """Get database instance for current event loop"""
    import asyncio

    db = _database_context.get()

    # Check if connection is valid for current event loop
    if db is not None:
        try:
            loop = asyncio.get_running_loop()
            # If connection's event loop differs, it's stale
            if db._connection and hasattr(db._connection, '_loop'):
                if db._connection._loop != loop:
                    # Stale connection from old event loop
                    db = None
        except RuntimeError:
            # No running loop
            pass

    if db is None:
        db = databases.Database(settings.DATABASE_URL)
        _database_context.set(db)

    return db

Pros: Fixes root cause, no changes needed in workers Cons: Relies on implementation details of databases library

Recommendation

  • Short-term: Option 2 (explicit cleanup in workers that need DB)
  • Long-term: Option 1 (refactor to dependency injection) is the only architecturally clean solution

Challenge 3: Type Mismatches Across Serialization Boundary

Symptoms

ValidationError: 1 validation error for TranscriptTopic
transcript
  Input should be a valid string [type=string_type, input_value={'translation': None, 'words': [...]}]

Root Cause

Conductor JSON-serializes all task inputs/outputs. Complex Pydantic models get serialized to dicts:

  • TitleSummary.transcript: Transcript becomes {"translation": null, "words": [...]}
  • Next task expects TranscriptTopic.transcript: str

Resolution

Explicitly reconstruct types when deserializing:

from reflector.processors.types import TitleSummary, Transcript as TranscriptType, Word

def normalize_topic(t):
    topic = dict(t)
    transcript_data = topic.get("transcript")
    if isinstance(transcript_data, dict):
        words_list = transcript_data.get("words", [])
        word_objects = [Word(**w) for w in words_list]
        topic["transcript"] = TranscriptType(
            words=word_objects,
            translation=transcript_data.get("translation")
        )
    return topic

topic_objects = [TitleSummary(**normalize_topic(t)) for t in topics]

Key Insight

Conductor task I/O is always JSON. Design workers to handle dict inputs and reconstruct domain objects explicitly.


Challenge 4: Conductor Health Check Failures

Symptoms

dependency failed to start: container reflector-conductor-1 is unhealthy

Root Cause

Conductor OSS standalone container health endpoint can be slow/flaky, especially during startup or under load.

Resolution

Bypass docker-compose health check dependency:

# Instead of: docker compose up -d conductor-worker
docker start reflector-conductor-worker-1

Key Insight

For development, consider removing depends_on.condition: service_healthy or increasing health check timeout.


Challenge 5: JOIN Task Output Format

Symptoms

merge_transcripts receives data but outputs word_count: 0

Root Cause

FORK_JOIN_DYNAMIC's JOIN task outputs a dict keyed by task reference names, not an array:

{
  "transcribe_track_0": {"words": [...], "track_index": 0},
  "transcribe_track_1": {"words": [...], "track_index": 1}
}

Resolution

Handle both dict and array inputs:

transcripts = task.input_data.get("transcripts", [])

# Handle JOIN output (dict with task refs as keys)
if isinstance(transcripts, dict):
    transcripts = list(transcripts.values())

for t in transcripts:
    if isinstance(t, dict) and "words" in t:
        all_words.extend(t["words"])

Key Insight

JOIN task output structure differs from FORK input. Always log input types during debugging.


Debugging Workflow

1. Add DEBUG Prints with Flush

Multiprocessing buffers stdout. Force immediate output:

import sys
print("[DEBUG] worker entered", flush=True)
sys.stdout.flush()

2. Test Worker Functions Directly

Bypass Conductor entirely to verify logic:

docker compose exec conductor-worker uv run python -c "
from reflector.conductor.workers.merge_transcripts import merge_transcripts
from conductor.client.http.models import Task

mock_task = Task()
mock_task.input_data = {'transcripts': {...}, 'transcript_id': 'test'}
result = merge_transcripts(mock_task)
print(result.output_data)
"

3. Check Task Timing

Suspiciously fast completion (e.g., 10ms) indicates:

  • Cached result from previous run
  • Wrong worker processed it
  • Task completed without actual execution
curl -s "http://localhost:8180/api/workflow/{id}" | \
  jq '.tasks[] | {ref: .referenceTaskName, duration: (.endTime - .startTime)}'

4. Verify Container Code Version

docker compose exec conductor-worker cat /app/reflector/conductor/workers/{file}.py | head -50

5. Use Conductor Retry API

Retry from specific failed task without re-running entire workflow:

curl -X POST "http://localhost:8180/api/workflow/{id}/retry"

Common Gotchas Summary

Issue Signal Fix
Wrong worker workerId mismatch Restart all worker containers
DB conflict "another operation in progress" Fresh DB connection per subprocess
Type mismatch Pydantic validation error Reconstruct objects from dicts
No logs Task completes but no output Check if different container processed
0 results JOIN output format Convert dict.values() to list
Health check Compose dependency fails Use docker start directly

Files Most Likely to Need Conductor-Specific Handling

  • server/reflector/conductor/workers/*.py - All workers need multiprocessing-safe patterns
  • server/reflector/db/__init__.py - Database singleton, needs context reset
  • server/reflector/conductor/workflows/*.json - Workflow definitions, check input/output mappings