hatchet no-mistake

2025-12-22 21:29:05 +00:00 · 2025-12-16 00:48:30 -05:00
parent 243ff2177c
commit c5498d26bf
18 changed files with 2189 additions and 1952 deletions
--- a/server/HATCHET_LLM_OBSERVATIONS.md
+++ b/server/HATCHET_LLM_OBSERVATIONS.md
@@ -0,0 +1,339 @@
+# Hatchet Migration - LLM Debugging Observations
+
+This document captures hard-won debugging insights from implementing the multitrack diarization pipeline with Hatchet. These observations are particularly relevant for LLM assistants working on this codebase.
+
+## Architecture Context
+
+- **Hatchet SDK v1.21+** uses async workers with gRPC for task polling
+- Workers connect to Hatchet server via gRPC (port 7077) and trigger workflows via REST (port 8888)
+- `hatchet-lite` image bundles server, engine, and database in one container
+- Tasks are decorated with `@workflow.task()` (not `@hatchet.step()` as in older examples)
+- Workflow input is validated via Pydantic models with `input_validator=` parameter
+
+---
+
+## Challenge 1: SDK Version API Breaking Changes
+
+### Symptoms
+```
+AttributeError: 'V1WorkflowRunDetails' object has no attribute 'workflow_run_id'
+```
+
+### Root Cause
+Hatchet SDK v1.21+ changed the response structure for workflow creation. Old examples show:
+```python
+result = await client.runs.aio_create(workflow_name, input_data)
+return result.workflow_run_id  # OLD - doesn't work
+```
+
+### Resolution
+Access the run ID through the new nested structure:
+```python
+result = await client.runs.aio_create(workflow_name, input_data)
+return result.run.metadata.id  # NEW - SDK v1.21+
+```
+
+### Key Insight
+**Don't trust documentation or examples.** Read the SDK source code or use IDE autocomplete to discover actual attribute names. The SDK evolves faster than docs.
+
+---
+
+## Challenge 2: Worker Appears Hung at "starting runner..."
+
+### Symptoms
+```
+[INFO] Starting Hatchet workers
+[INFO] Starting Hatchet worker polling...
+[INFO] STARTING HATCHET...
+[INFO] starting runner...
+# ... nothing else, appears stuck
+```
+
+### Root Cause
+Without debug mode, Hatchet SDK doesn't log:
+- Workflow registration
+- gRPC connection status
+- Heartbeat activity
+- Action listener acquisition
+
+The worker IS working, you just can't see it.
+
+### Resolution
+Always enable debug mode during development:
+```bash
+HATCHET_DEBUG=true
+```
+
+With debug enabled, you'll see the actual activity:
+```
+[DEBUG] 'worker-name' waiting for ['workflow:task1', 'workflow:task2']
+[DEBUG] starting action listener: worker-name
+[DEBUG] acquired action listener: 562d00a8-8895-42a1-b65b-46f905c902f9
+[DEBUG] sending heartbeat
+```
+
+### Key Insight
+**Start every Hatchet debugging session with `HATCHET_DEBUG=true`.** Silent workers waste hours of debugging time.
+
+---
+
+## Challenge 3: Docker Networking + JWT Token URL Conflicts
+
+### Symptoms
+```
+grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
+    status = StatusCode.UNAVAILABLE
+    details = "failed to connect to all addresses"
+```
+
+### Root Cause
+The Hatchet API token embeds URLs:
+```json
+{
+  "aud": "http://localhost:8889",
+  "grpc_broadcast_address": "localhost:7077",
+  "server_url": "http://localhost:8889"
+}
+```
+
+Inside Docker containers, `localhost` refers to the container itself, not the Hatchet server.
+
+### Resolution
+Override the token-embedded URLs with environment variables:
+```bash
+# In .env or docker-compose environment
+HATCHET_CLIENT_HOST_PORT=hatchet:7077
+HATCHET_CLIENT_SERVER_URL=http://hatchet:8888
+HATCHET_CLIENT_TLS_STRATEGY=none
+```
+
+### Key Insight
+**The JWT token is not the final word on connection settings.** Environment variables override token-embedded URLs, which is essential for Docker networking.
+
+---
+
+## Challenge 4: Workflow Name Case Sensitivity
+
+### Symptoms
+```
+BadRequestException: (400)
+HTTP response body: errors=[APIError(description='workflow names not found: diarizationpipeline')]
+```
+
+### Root Cause
+Hatchet uses the exact workflow name you define for triggering:
+```python
+diarization_pipeline = hatchet.workflow(
+    name="DiarizationPipeline",  # Use THIS exact name to trigger
+    input_validator=PipelineInput
+)
+```
+
+Internally, task identifiers are lowercased (`diarizationpipeline:get_recording`), but workflow triggers must match the defined name.
+
+### Resolution
+```python
+# Correct
+await client.start_workflow('DiarizationPipeline', input_data)
+
+# Wrong
+await client.start_workflow('diarizationpipeline', input_data)
+```
+
+### Key Insight
+**Workflow names are case-sensitive for triggering, but task refs are lowercase.** Don't conflate the two.
+
+---
+
+## Challenge 5: Pydantic Response Object Iteration
+
+### Symptoms
+```
+AttributeError: 'tuple' object has no attribute 'participant_id'
+```
+
+### Root Cause
+When API responses return Pydantic models with list fields:
+```python
+class MeetingParticipantsResponse(BaseModel):
+    data: List[MeetingParticipant]
+```
+
+Iterating the response object directly is wrong:
+```python
+for p in participants:  # WRONG - iterates over model fields as tuples
+```
+
+### Resolution
+Access the `.data` attribute explicitly:
+```python
+for p in participants.data:  # CORRECT - iterates over list items
+    print(p.participant_id)
+```
+
+### Key Insight
+**Pydantic models with list fields require explicit `.data` access.** The model itself is not iterable in the expected way.
+
+---
+
+## Challenge 6: Database Connections in Async Workers
+
+### Symptoms
+```
+InterfaceError: cannot perform operation: another operation is in progress
+```
+
+### Root Cause
+Similar to Conductor, Hatchet workers may inherit stale database connections. Each task runs in an async context that may not share the same event loop as cached connections.
+
+### Resolution
+Create fresh database connections per task:
+```python
+async def _get_fresh_db_connection():
+    """Create fresh database connection for worker task."""
+    import databases
+    from reflector.db import _database_context
+    from reflector.settings import settings
+
+    _database_context.set(None)
+    db = databases.Database(settings.DATABASE_URL)
+    _database_context.set(db)
+    await db.connect()
+    return db
+
+async def _close_db_connection(db):
+    await db.disconnect()
+    _database_context.set(None)
+```
+
+### Key Insight
+**Cached singletons (DB, HTTP clients) are unsafe in workflow workers.** Always create fresh connections.
+
+---
+
+## Challenge 7: Child Workflow Fan-out Pattern
+
+### Symptoms
+Child workflows spawn but parent doesn't wait for completion, or results aren't collected.
+
+### Root Cause
+Hatchet child workflows need explicit spawning and result collection:
+```python
+# Spawning children
+child_runs = await asyncio.gather(*[
+    child_workflow.aio_run(child_input)
+    for child_input in inputs
+])
+
+# Results are returned directly from aio_run()
+```
+
+### Resolution
+Use `aio_run()` for child workflows and `asyncio.gather()` for parallelism:
+```python
+@parent_workflow.task(parents=[setup_task])
+async def process_tracks(input: ParentInput, ctx: Context) -> dict:
+    child_coroutines = [
+        track_workflow.aio_run(TrackInput(track_index=i, ...))
+        for i in range(len(input.tracks))
+    ]
+
+    results = await asyncio.gather(*child_coroutines, return_exceptions=True)
+
+    # Handle failures
+    for i, result in enumerate(results):
+        if isinstance(result, Exception):
+            logger.error(f"Track {i} failed: {result}")
+
+    return {"track_results": [r for r in results if not isinstance(r, Exception)]}
+```
+
+### Key Insight
+**Child workflows in Hatchet return results directly.** No need to poll for completion like in Conductor.
+
+---
+
+## Debugging Workflow
+
+### 1. Enable Debug Mode First
+```bash
+HATCHET_DEBUG=true
+```
+
+### 2. Verify Worker Registration
+Look for this in debug logs:
+```
+[DEBUG] 'worker-name' waiting for ['workflow:task1', 'workflow:task2', ...]
+[DEBUG] acquired action listener: {uuid}
+```
+
+### 3. Test Workflow Trigger Separately
+```python
+docker exec server uv run python -c "
+from reflector.hatchet.client import HatchetClientManager
+from reflector.hatchet.workflows.diarization_pipeline import PipelineInput
+import asyncio
+
+async def test():
+    input_data = PipelineInput(
+        transcript_id='test',
+        recording_id=None,
+        room_name='test-room',
+        bucket_name='bucket',
+        tracks=[],
+    )
+    run_id = await HatchetClientManager.start_workflow(
+        'DiarizationPipeline',
+        input_data.model_dump()
+    )
+    print(f'Triggered: {run_id}')
+
+asyncio.run(test())
+"
+```
+
+### 4. Check Hatchet Server Logs
+```bash
+docker logs reflector-hatchet-1 --tail 50
+```
+
+Look for `WRN` entries indicating API errors or connection issues.
+
+### 5. Verify gRPC Connectivity
+```python
+docker exec worker python -c "
+import socket
+sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+result = sock.connect_ex(('hatchet', 7077))
+print(f'gRPC port 7077: {\"reachable\" if result == 0 else \"blocked\"}')"
+```
+
+### 6. Force Container Rebuild
+Volume mounts may cache old bytecode:
+```bash
+docker compose up -d --build --force-recreate hatchet-worker
+```
+
+---
+
+## Common Gotchas Summary
+
+| Issue | Signal | Fix |
+|-------|--------|-----|
+| SDK API changed | `AttributeError` on result | Check SDK source for actual attributes |
+| Worker appears stuck | Only "starting runner..." | Enable `HATCHET_DEBUG=true` |
+| Can't connect from Docker | gRPC unavailable | Set `HATCHET_CLIENT_HOST_PORT` and `_SERVER_URL` |
+| Workflow not found | 400 Bad Request | Use exact case-sensitive workflow name |
+| Tuple iteration error | `'tuple' has no attribute` | Access `.data` on Pydantic response models |
+| DB conflicts | "another operation in progress" | Fresh DB connection per task |
+| Old code running | Fixed code but same error | Force rebuild container, clear `__pycache__` |
+
+---
+
+## Files Most Likely to Need Hatchet-Specific Handling
+
+- `server/reflector/hatchet/workflows/*.py` - Workflow and task definitions
+- `server/reflector/hatchet/client.py` - Client wrapper, SDK version compatibility
+- `server/reflector/hatchet/run_workers.py` - Worker startup and registration
+- `server/reflector/hatchet/progress.py` - Progress emission for UI updates
+- `docker-compose.yml` - Hatchet infrastructure services