reflector/server/HATCHET_LLM_OBSERVATIONS.md

# Hatchet Migration - LLM Debugging Observations

This document captures hard-won debugging insights from implementing the multitrack diarization pipeline with Hatchet. These observations are particularly relevant for LLM assistants working on this codebase.

## Architecture Context

- **Hatchet SDK v1.21+** uses async workers with gRPC for task polling
- Workers connect to Hatchet server via gRPC (port 7077) and trigger workflows via REST (port 8888)
- `hatchet-lite` image bundles server, engine, and database in one container
- Tasks are decorated with `@workflow.task()` (not `@hatchet.step()` as in older examples)
- Workflow input is validated via Pydantic models with `input_validator=` parameter

---

## Challenge 1: SDK Version API Breaking Changes

### Symptoms
```
AttributeError: 'V1WorkflowRunDetails' object has no attribute 'workflow_run_id'
```

### Root Cause
Hatchet SDK v1.21+ changed the response structure for workflow creation. Old examples show:
```python
result = await client.runs.aio_create(workflow_name, input_data)
return result.workflow_run_id  # OLD - doesn't work
```

### Resolution
Access the run ID through the new nested structure:
```python
result = await client.runs.aio_create(workflow_name, input_data)
return result.run.metadata.id  # NEW - SDK v1.21+
```

### Key Insight
**Don't trust documentation or examples.** Read the SDK source code or use IDE autocomplete to discover actual attribute names. The SDK evolves faster than docs.

---

## Challenge 2: Worker Appears Hung at "starting runner..."

### Symptoms
```
[INFO] Starting Hatchet workers
[INFO] Starting Hatchet worker polling...
[INFO] STARTING HATCHET...
[INFO] starting runner...
# ... nothing else, appears stuck
```

### Root Cause
Without debug mode, Hatchet SDK doesn't log:
- Workflow registration
- gRPC connection status
- Heartbeat activity
- Action listener acquisition

The worker IS working, you just can't see it.

### Resolution
Always enable debug mode during development:
```bash
HATCHET_DEBUG=true
```

With debug enabled, you'll see the actual activity:
```
[DEBUG] 'worker-name' waiting for ['workflow:task1', 'workflow:task2']
[DEBUG] starting action listener: worker-name
[DEBUG] acquired action listener: 562d00a8-8895-42a1-b65b-46f905c902f9
[DEBUG] sending heartbeat
```

### Key Insight
**Start every Hatchet debugging session with `HATCHET_DEBUG=true`.** Silent workers waste hours of debugging time.

---

## Challenge 3: Docker Networking + JWT Token URL Conflicts

### Symptoms
```
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses"
```

### Root Cause
The Hatchet API token embeds URLs:
```json
{
  "aud": "http://localhost:8889",
  "grpc_broadcast_address": "localhost:7077",
  "server_url": "http://localhost:8889"
}
```

Inside Docker containers, `localhost` refers to the container itself, not the Hatchet server.

### Resolution
Override the token-embedded URLs with environment variables:
```bash
# In .env or docker-compose environment
HATCHET_CLIENT_HOST_PORT=hatchet:7077
HATCHET_CLIENT_SERVER_URL=http://hatchet:8888
HATCHET_CLIENT_TLS_STRATEGY=none
```

### Key Insight
**The JWT token is not the final word on connection settings.** Environment variables override token-embedded URLs, which is essential for Docker networking.

---

## Challenge 4: Workflow Name Case Sensitivity

### Symptoms
```
BadRequestException: (400)
HTTP response body: errors=[APIError(description='workflow names not found: diarizationpipeline')]
```

### Root Cause
Hatchet uses the exact workflow name you define for triggering:
```python
diarization_pipeline = hatchet.workflow(
    name="DiarizationPipeline",  # Use THIS exact name to trigger
    input_validator=PipelineInput
)
```

Internally, task identifiers are lowercased (`diarizationpipeline:get_recording`), but workflow triggers must match the defined name.

### Resolution
```python
# Correct
await client.start_workflow('DiarizationPipeline', input_data)

# Wrong
await client.start_workflow('diarizationpipeline', input_data)
```

### Key Insight
**Workflow names are case-sensitive for triggering, but task refs are lowercase.** Don't conflate the two.

---

## Challenge 5: Pydantic Response Object Iteration

### Symptoms
```
AttributeError: 'tuple' object has no attribute 'participant_id'
```

### Root Cause
When API responses return Pydantic models with list fields:
```python
class MeetingParticipantsResponse(BaseModel):
    data: List[MeetingParticipant]
```

Iterating the response object directly is wrong:
```python
for p in participants:  # WRONG - iterates over model fields as tuples
```

### Resolution
Access the `.data` attribute explicitly:
```python
for p in participants.data:  # CORRECT - iterates over list items
    print(p.participant_id)
```

### Key Insight
**Pydantic models with list fields require explicit `.data` access.** The model itself is not iterable in the expected way.

---

## Challenge 6: Database Connections in Async Workers

### Symptoms
```
InterfaceError: cannot perform operation: another operation is in progress
```

### Root Cause
Similar to Conductor, Hatchet workers may inherit stale database connections. Each task runs in an async context that may not share the same event loop as cached connections.

### Resolution
Create fresh database connections per task:
```python
async def _get_fresh_db_connection():
    """Create fresh database connection for worker task."""
    import databases
    from reflector.db import _database_context
    from reflector.settings import settings

    _database_context.set(None)
    db = databases.Database(settings.DATABASE_URL)
    _database_context.set(db)
    await db.connect()
    return db

async def _close_db_connection(db):
    await db.disconnect()
    _database_context.set(None)
```

### Key Insight
**Cached singletons (DB, HTTP clients) are unsafe in workflow workers.** Always create fresh connections.

---

## Challenge 7: Child Workflow Fan-out Pattern

### Symptoms
Child workflows spawn but parent doesn't wait for completion, or results aren't collected.

### Root Cause
Hatchet child workflows need explicit spawning and result collection:
```python
# Spawning children
child_runs = await asyncio.gather(*[
    child_workflow.aio_run(child_input)
    for child_input in inputs
])

# Results are returned directly from aio_run()
```

### Resolution
Use `aio_run()` for child workflows and `asyncio.gather()` for parallelism:
```python
@parent_workflow.task(parents=[setup_task])
async def process_tracks(input: ParentInput, ctx: Context) -> dict:
    child_coroutines = [
        track_workflow.aio_run(TrackInput(track_index=i, ...))
        for i in range(len(input.tracks))
    ]

    results = await asyncio.gather(*child_coroutines, return_exceptions=True)

    # Handle failures
    for i, result in enumerate(results):
        if isinstance(result, Exception):
            logger.error(f"Track {i} failed: {result}")

    return {"track_results": [r for r in results if not isinstance(r, Exception)]}
```

### Key Insight
**Child workflows in Hatchet return results directly.** No need to poll for completion like in Conductor.

---

## Debugging Workflow

### 1. Enable Debug Mode First
```bash
HATCHET_DEBUG=true
```

### 2. Verify Worker Registration
Look for this in debug logs:
```
[DEBUG] 'worker-name' waiting for ['workflow:task1', 'workflow:task2', ...]
[DEBUG] acquired action listener: {uuid}
```

### 3. Test Workflow Trigger Separately
```python
docker exec server uv run python -c "
from reflector.hatchet.client import HatchetClientManager
from reflector.hatchet.workflows.diarization_pipeline import PipelineInput
import asyncio

async def test():
    input_data = PipelineInput(
        transcript_id='test',
        recording_id=None,
        room_name='test-room',
        bucket_name='bucket',
        tracks=[],
    )
    run_id = await HatchetClientManager.start_workflow(
        'DiarizationPipeline',
        input_data.model_dump()
    )
    print(f'Triggered: {run_id}')

asyncio.run(test())
"
```

### 4. Check Hatchet Server Logs
```bash
docker logs reflector-hatchet-1 --tail 50
```

Look for `WRN` entries indicating API errors or connection issues.

### 5. Verify gRPC Connectivity
```python
docker exec worker python -c "
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('hatchet', 7077))
print(f'gRPC port 7077: {\"reachable\" if result == 0 else \"blocked\"}')"
```

### 6. Force Container Rebuild
Volume mounts may cache old bytecode:
```bash
docker compose up -d --build --force-recreate hatchet-worker
```

---

## Common Gotchas Summary

| Issue | Signal | Fix |
|-------|--------|-----|
| SDK API changed | `AttributeError` on result | Check SDK source for actual attributes |
| Worker appears stuck | Only "starting runner..." | Enable `HATCHET_DEBUG=true` |
| Can't connect from Docker | gRPC unavailable | Set `HATCHET_CLIENT_HOST_PORT` and `_SERVER_URL` |
| Workflow not found | 400 Bad Request | Use exact case-sensitive workflow name |
| Tuple iteration error | `'tuple' has no attribute` | Access `.data` on Pydantic response models |
| DB conflicts | "another operation in progress" | Fresh DB connection per task |
| Old code running | Fixed code but same error | Force rebuild container, clear `__pycache__` |

---

## Files Most Likely to Need Hatchet-Specific Handling

- `server/reflector/hatchet/workflows/*.py` - Workflow and task definitions
- `server/reflector/hatchet/client.py` - Client wrapper, SDK version compatibility
- `server/reflector/hatchet/run_workers.py` - Worker startup and registration
- `server/reflector/hatchet/progress.py` - Progress emission for UI updates
- `docker-compose.yml` - Hatchet infrastructure services