hatchet no-mistake

This commit is contained in:
Igor Loskutov
2025-12-16 00:48:30 -05:00
parent 243ff2177c
commit c5498d26bf
18 changed files with 2189 additions and 1952 deletions

View File

@@ -0,0 +1,339 @@
# Hatchet Migration - LLM Debugging Observations
This document captures hard-won debugging insights from implementing the multitrack diarization pipeline with Hatchet. These observations are particularly relevant for LLM assistants working on this codebase.
## Architecture Context
- **Hatchet SDK v1.21+** uses async workers with gRPC for task polling
- Workers connect to Hatchet server via gRPC (port 7077) and trigger workflows via REST (port 8888)
- `hatchet-lite` image bundles server, engine, and database in one container
- Tasks are decorated with `@workflow.task()` (not `@hatchet.step()` as in older examples)
- Workflow input is validated via Pydantic models with `input_validator=` parameter
---
## Challenge 1: SDK Version API Breaking Changes
### Symptoms
```
AttributeError: 'V1WorkflowRunDetails' object has no attribute 'workflow_run_id'
```
### Root Cause
Hatchet SDK v1.21+ changed the response structure for workflow creation. Old examples show:
```python
result = await client.runs.aio_create(workflow_name, input_data)
return result.workflow_run_id # OLD - doesn't work
```
### Resolution
Access the run ID through the new nested structure:
```python
result = await client.runs.aio_create(workflow_name, input_data)
return result.run.metadata.id # NEW - SDK v1.21+
```
### Key Insight
**Don't trust documentation or examples.** Read the SDK source code or use IDE autocomplete to discover actual attribute names. The SDK evolves faster than docs.
---
## Challenge 2: Worker Appears Hung at "starting runner..."
### Symptoms
```
[INFO] Starting Hatchet workers
[INFO] Starting Hatchet worker polling...
[INFO] STARTING HATCHET...
[INFO] starting runner...
# ... nothing else, appears stuck
```
### Root Cause
Without debug mode, Hatchet SDK doesn't log:
- Workflow registration
- gRPC connection status
- Heartbeat activity
- Action listener acquisition
The worker IS working, you just can't see it.
### Resolution
Always enable debug mode during development:
```bash
HATCHET_DEBUG=true
```
With debug enabled, you'll see the actual activity:
```
[DEBUG] 'worker-name' waiting for ['workflow:task1', 'workflow:task2']
[DEBUG] starting action listener: worker-name
[DEBUG] acquired action listener: 562d00a8-8895-42a1-b65b-46f905c902f9
[DEBUG] sending heartbeat
```
### Key Insight
**Start every Hatchet debugging session with `HATCHET_DEBUG=true`.** Silent workers waste hours of debugging time.
---
## Challenge 3: Docker Networking + JWT Token URL Conflicts
### Symptoms
```
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
```
### Root Cause
The Hatchet API token embeds URLs:
```json
{
"aud": "http://localhost:8889",
"grpc_broadcast_address": "localhost:7077",
"server_url": "http://localhost:8889"
}
```
Inside Docker containers, `localhost` refers to the container itself, not the Hatchet server.
### Resolution
Override the token-embedded URLs with environment variables:
```bash
# In .env or docker-compose environment
HATCHET_CLIENT_HOST_PORT=hatchet:7077
HATCHET_CLIENT_SERVER_URL=http://hatchet:8888
HATCHET_CLIENT_TLS_STRATEGY=none
```
### Key Insight
**The JWT token is not the final word on connection settings.** Environment variables override token-embedded URLs, which is essential for Docker networking.
---
## Challenge 4: Workflow Name Case Sensitivity
### Symptoms
```
BadRequestException: (400)
HTTP response body: errors=[APIError(description='workflow names not found: diarizationpipeline')]
```
### Root Cause
Hatchet uses the exact workflow name you define for triggering:
```python
diarization_pipeline = hatchet.workflow(
name="DiarizationPipeline", # Use THIS exact name to trigger
input_validator=PipelineInput
)
```
Internally, task identifiers are lowercased (`diarizationpipeline:get_recording`), but workflow triggers must match the defined name.
### Resolution
```python
# Correct
await client.start_workflow('DiarizationPipeline', input_data)
# Wrong
await client.start_workflow('diarizationpipeline', input_data)
```
### Key Insight
**Workflow names are case-sensitive for triggering, but task refs are lowercase.** Don't conflate the two.
---
## Challenge 5: Pydantic Response Object Iteration
### Symptoms
```
AttributeError: 'tuple' object has no attribute 'participant_id'
```
### Root Cause
When API responses return Pydantic models with list fields:
```python
class MeetingParticipantsResponse(BaseModel):
data: List[MeetingParticipant]
```
Iterating the response object directly is wrong:
```python
for p in participants: # WRONG - iterates over model fields as tuples
```
### Resolution
Access the `.data` attribute explicitly:
```python
for p in participants.data: # CORRECT - iterates over list items
print(p.participant_id)
```
### Key Insight
**Pydantic models with list fields require explicit `.data` access.** The model itself is not iterable in the expected way.
---
## Challenge 6: Database Connections in Async Workers
### Symptoms
```
InterfaceError: cannot perform operation: another operation is in progress
```
### Root Cause
Similar to Conductor, Hatchet workers may inherit stale database connections. Each task runs in an async context that may not share the same event loop as cached connections.
### Resolution
Create fresh database connections per task:
```python
async def _get_fresh_db_connection():
"""Create fresh database connection for worker task."""
import databases
from reflector.db import _database_context
from reflector.settings import settings
_database_context.set(None)
db = databases.Database(settings.DATABASE_URL)
_database_context.set(db)
await db.connect()
return db
async def _close_db_connection(db):
await db.disconnect()
_database_context.set(None)
```
### Key Insight
**Cached singletons (DB, HTTP clients) are unsafe in workflow workers.** Always create fresh connections.
---
## Challenge 7: Child Workflow Fan-out Pattern
### Symptoms
Child workflows spawn but parent doesn't wait for completion, or results aren't collected.
### Root Cause
Hatchet child workflows need explicit spawning and result collection:
```python
# Spawning children
child_runs = await asyncio.gather(*[
child_workflow.aio_run(child_input)
for child_input in inputs
])
# Results are returned directly from aio_run()
```
### Resolution
Use `aio_run()` for child workflows and `asyncio.gather()` for parallelism:
```python
@parent_workflow.task(parents=[setup_task])
async def process_tracks(input: ParentInput, ctx: Context) -> dict:
child_coroutines = [
track_workflow.aio_run(TrackInput(track_index=i, ...))
for i in range(len(input.tracks))
]
results = await asyncio.gather(*child_coroutines, return_exceptions=True)
# Handle failures
for i, result in enumerate(results):
if isinstance(result, Exception):
logger.error(f"Track {i} failed: {result}")
return {"track_results": [r for r in results if not isinstance(r, Exception)]}
```
### Key Insight
**Child workflows in Hatchet return results directly.** No need to poll for completion like in Conductor.
---
## Debugging Workflow
### 1. Enable Debug Mode First
```bash
HATCHET_DEBUG=true
```
### 2. Verify Worker Registration
Look for this in debug logs:
```
[DEBUG] 'worker-name' waiting for ['workflow:task1', 'workflow:task2', ...]
[DEBUG] acquired action listener: {uuid}
```
### 3. Test Workflow Trigger Separately
```python
docker exec server uv run python -c "
from reflector.hatchet.client import HatchetClientManager
from reflector.hatchet.workflows.diarization_pipeline import PipelineInput
import asyncio
async def test():
input_data = PipelineInput(
transcript_id='test',
recording_id=None,
room_name='test-room',
bucket_name='bucket',
tracks=[],
)
run_id = await HatchetClientManager.start_workflow(
'DiarizationPipeline',
input_data.model_dump()
)
print(f'Triggered: {run_id}')
asyncio.run(test())
"
```
### 4. Check Hatchet Server Logs
```bash
docker logs reflector-hatchet-1 --tail 50
```
Look for `WRN` entries indicating API errors or connection issues.
### 5. Verify gRPC Connectivity
```python
docker exec worker python -c "
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('hatchet', 7077))
print(f'gRPC port 7077: {\"reachable\" if result == 0 else \"blocked\"}')"
```
### 6. Force Container Rebuild
Volume mounts may cache old bytecode:
```bash
docker compose up -d --build --force-recreate hatchet-worker
```
---
## Common Gotchas Summary
| Issue | Signal | Fix |
|-------|--------|-----|
| SDK API changed | `AttributeError` on result | Check SDK source for actual attributes |
| Worker appears stuck | Only "starting runner..." | Enable `HATCHET_DEBUG=true` |
| Can't connect from Docker | gRPC unavailable | Set `HATCHET_CLIENT_HOST_PORT` and `_SERVER_URL` |
| Workflow not found | 400 Bad Request | Use exact case-sensitive workflow name |
| Tuple iteration error | `'tuple' has no attribute` | Access `.data` on Pydantic response models |
| DB conflicts | "another operation in progress" | Fresh DB connection per task |
| Old code running | Fixed code but same error | Force rebuild container, clear `__pycache__` |
---
## Files Most Likely to Need Hatchet-Specific Handling
- `server/reflector/hatchet/workflows/*.py` - Workflow and task definitions
- `server/reflector/hatchet/client.py` - Client wrapper, SDK version compatibility
- `server/reflector/hatchet/run_workers.py` - Worker startup and registration
- `server/reflector/hatchet/progress.py` - Progress emission for UI updates
- `docker-compose.yml` - Hatchet infrastructure services