feat: local LLM via Ollama + structured output response_format

- Add setup script (scripts/setup-local-llm.sh) for one-command Ollama setup Mac: native Metal GPU, Linux: containerized via docker-compose profiles - Add ollama-gpu and ollama-cpu docker-compose profiles for Linux - Add extra_hosts to server/hatchet-worker-llm for host.docker.internal - Pass response_format JSON schema in StructuredOutputWorkflow.extract() enabling grammar-based constrained decoding on Ollama/llama.cpp/vLLM/OpenAI - Update .env.example with Ollama as default LLM option - Add Ollama PRD and local dev setup docs
2026-05-13 14:35:18 +00:00 · 2026-02-10 15:55:21 -05:00
parent cd2255cfbc
commit 663345ece6
7 changed files with 653 additions and 7 deletions
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -11,6 +11,8 @@ services:
      - ./server/.env
    environment:
      ENTRYPOINT: server
+    extra_hosts:
+      - "host.docker.internal:host-gateway"

  worker:
    build:
@@ -57,6 +59,8 @@ services:
      - ./server/.env
    environment:
      ENTRYPOINT: hatchet-worker-llm
+    extra_hosts:
+      - "host.docker.internal:host-gateway"
    depends_on:
      hatchet:
        condition: service_healthy
@@ -128,6 +132,44 @@ services:
      retries: 5
      start_period: 30s

+  ollama:
+    image: ollama/ollama:latest
+    profiles: ["ollama-gpu"]
+    ports:
+      - "11434:11434"
+    volumes:
+      - ollama_data:/root/.ollama
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+
+  ollama-cpu:
+    image: ollama/ollama:latest
+    profiles: ["ollama-cpu"]
+    ports:
+      - "11434:11434"
+    volumes:
+      - ollama_data:/root/.ollama
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+
+volumes:
+  ollama_data:
+
 networks:
  default:
    attachable: true
--- a/docs/01_ollama.prd.md
+++ b/docs/01_ollama.prd.md
@@ -0,0 +1,306 @@
+# PRD: Local LLM Inference for Reflector
+
+## Business Context
+
+Reflector currently uses a remote LLM endpoint (configurable via `LLM_URL`) for all post-transcription intelligence: topic detection, title generation, subject extraction, summarization, action item identification. The default model is `microsoft/phi-4`.
+
+**Goal**: Run all LLM inference locally on developer machines (and optionally in self-hosted production), eliminating dependence on external LLM API providers. Zero cloud LLM costs, full data privacy, offline-capable development. One setup script, then `docker compose up` works.
+
+---
+
+## Current Architecture
+
+### Single abstraction layer: `server/reflector/llm.py`
+
+All LLM calls go through one `LLM` class wrapping LlamaIndex's `OpenAILike` client.
+
+**Env variables** (`server/reflector/settings.py:73-84`):
+
+| Variable | Default | Purpose |
+|---|---|---|
+| `LLM_MODEL` | `microsoft/phi-4` | Model name |
+| `LLM_URL` | `None` (falls back to OpenAI) | Endpoint URL |
+| `LLM_API_KEY` | required | Auth key |
+| `LLM_CONTEXT_WINDOW` | `16000` | Token limit |
+| `LLM_PARSE_MAX_RETRIES` | `3` | JSON validation retries |
+| `LLM_STRUCTURED_RESPONSE_TIMEOUT` | `300` | Timeout (seconds) |
+
+### Call flow
+
+```
+Hatchet workflows / Legacy processors
+  -> LLM.get_response() or LLM.get_structured_response()
+    -> LlamaIndex TreeSummarize + StructuredOutputWorkflow
+      -> OpenAILike client (is_chat_model=True, is_function_calling_model=False)
+        -> LLM_URL endpoint (OpenAI-compatible API)
+```
+
+### LLM call inventory (per transcript, ~9-15 calls)
+
+| Task | Method | Pydantic Model | Input Size | Temp |
+|---|---|---|---|---|
+| Topic detection (per chunk) | `get_structured_response` | `TopicResponse` | ~500 words/chunk | 0.9 |
+| Title generation | `get_response` | plain string | topic titles list | 0.5 |
+| Subject extraction | `get_structured_response` | `SubjectsResponse` | full transcript | 0.4 |
+| Detailed summary (per subject) | `get_response` | plain string | full transcript | 0.4 |
+| Paragraph summary (per subject) | `get_response` | plain string | detailed summary | 0.4 |
+| Recap | `get_response` | plain string | combined summaries | 0.4 |
+| Action items | `get_structured_response` | `ActionItemsResponse` | full transcript | 0.4 |
+| Participants (optional) | `get_structured_response` | `ParticipantsResponse` | full transcript | 0.4 |
+| Transcription type (optional) | `get_structured_response` | `TranscriptionTypeResponse` | full transcript | 0.4 |
+
+### Structured output mechanism
+
+Two-step process in `StructuredOutputWorkflow`:
+1. `TreeSummarize.aget_response()` -- hierarchical summarization of long text
+2. `Settings.llm.acomplete()` -- formats analysis as JSON matching Pydantic schema
+
+Validation retry: on Pydantic parse failure, error message fed back to LLM, up to 3 retries. No function calling used -- pure JSON text parsing.
+
+### Key dependencies
+
+- `llama-index>=0.12.52`
+- `llama-index-llms-openai-like>=0.4.0`
+- No embeddings, no streaming, no vision
+
+### Concurrency
+
+- Hatchet rate limit: 10 concurrent LLM calls/sec (`hatchet/constants.py`)
+- LLM worker pool: 10 slots (`run_workers_llm.py`)
+- Fan-out: up to 20 concurrent topic chunk workflows
+
+---
+
+## Requirements
+
+### Must Have
+- Local LLM inference on developer Mac (M-series Apple Silicon) with Metal GPU
+- Local LLM inference on Linux with NVIDIA GPU
+- OpenAI-compatible API endpoint (drop-in for `LLM_URL`)
+- Reliable JSON structured output (Pydantic schema compliance)
+- 16K+ context window
+- Works with existing `LLM` class -- config change only, no code rewrite
+- Model persistence across restarts
+- **No Docker Desktop dependency** -- must work with OrbStack, plain Docker Engine
+- **Single setup script** -- developer runs one command, then `docker compose up` works
+
+### Should Have
+- Docker Compose profile for Linux NVIDIA GPU (containerized Ollama)
+- Reasonable inference speed (>10 tok/s for chosen model)
+- Auto-pull model on first setup
+
+### Nice to Have
+- CPU-only fallback for CI/testing
+- Docker Compose profile for CPU-only Ollama
+
+---
+
+## Critical Mac Constraint
+
+**Docker containers on macOS cannot access Apple Silicon GPU.** This applies to Docker Desktop, OrbStack, and all other Mac container runtimes. Ollama in Docker on Mac is CPU-only (~5-6x slower than native Metal).
+
+**Docker Model Runner (DMR)** bypasses this by running llama.cpp as a native host process, but it **requires Docker Desktop 4.41+** -- not available in OrbStack or plain Docker Engine. DMR is not a viable option for this project.
+
+**Solution**: Run Ollama natively on Mac (Metal GPU), run it containerized on Linux (NVIDIA GPU). A setup script handles the difference.
+
+### Performance (approximate, Q4_K_M quantization)
+
+| Model | Mac Native (Metal) | Docker on Mac (CPU) | Linux + RTX 4090 |
+|---|---|---|---|
+| 7B | 25-40 tok/s | 8-12 tok/s | 60-70 tok/s |
+| 14B | 25-40 tok/s (M3/M4 Pro) | 4-7 tok/s | 40-60 tok/s |
+
+---
+
+## Inference Engine: Ollama
+
+Ollama wins over alternatives for this project:
+- Built-in model management (`ollama pull`)
+- OpenAI-compatible API at `/v1/chat/completions` (drop-in for `LLM_URL`)
+- Native Mac Metal GPU support
+- Official Docker image with NVIDIA GPU support on Linux
+- `json_schema` response format support (grammar-based constrained decoding via llama.cpp)
+- MIT license, mature, widely adopted
+
+Other engines (vLLM, llama.cpp direct, LocalAI) either lack Mac GPU support in Docker, require manual model management, or add unnecessary complexity. The `LLM_URL` env var already accepts any OpenAI-compatible endpoint -- developers who prefer another engine can point at it manually.
+
+---
+
+## Model Comparison (for Structured Output)
+
+| Model | Params | RAM (Q4) | JSON Quality | Notes |
+|---|---|---|---|---|
+| **Qwen 2.5 14B** | 14B | ~10 GB | Excellent | Explicitly optimized for JSON. Best open-source at this size. |
+| **Qwen 3 8B** | 8B | ~7 GB | Excellent | Outperforms Qwen 2.5 14B on 15 benchmarks. Lighter. |
+| **Qwen 2.5 7B** | 7B | ~6 GB | Very good | Good if RAM constrained. |
+| Phi-4 | 14B | ~10 GB | Good | Current default. Not optimized for JSON specifically. |
+| Llama 3.1 8B | 8B | ~6 GB | Good | Higher JSON parser errors than Qwen. |
+| Mistral Small 3 | 24B | ~16 GB | Very good | Apache 2.0. Needs 32GB+ machine. |
+
+**Recommendation**: **Qwen 2.5 14B** (quality) or **Qwen 3 8B** (lighter, nearly same quality). Both outperform the current `phi-4` default for structured output tasks.
+
+---
+
+## Proposed Architecture
+
+### Hybrid: Native Ollama on Mac, Containerized Ollama on Linux
+
+```
+Mac developer:
+  ┌────────────────────┐
+  │ Native Ollama      │ ◄── Metal GPU, :11434
+  │ (host process)     │
+  └────────┬───────────┘
+           │ host.docker.internal:11434
+  ┌────────┴───────────────────────────────────┐
+  │ Docker (OrbStack / Docker Engine)          │
+  │ postgres, redis, hatchet, server,          │
+  │ hatchet-worker-cpu, hatchet-worker-llm     │
+  │   LLM_URL=http://host.docker.internal:11434/v1  │
+  └────────────────────────────────────────────┘
+
+Linux server (--profile ollama-gpu):
+  ┌────────────────────────────────────────────┐
+  │ Docker Engine                              │
+  │ ┌───────────────┐                          │
+  │ │ ollama        │ ◄── NVIDIA GPU, :11434   │
+  │ │ (container)   │                          │
+  │ └───────────────┘                          │
+  │ postgres, redis, hatchet, server,          │
+  │ hatchet-worker-cpu, hatchet-worker-llm     │
+  │   LLM_URL=http://ollama:11434/v1          │
+  └────────────────────────────────────────────┘
+```
+
+### How it works
+
+1. **Setup script** (`scripts/setup-local-llm.sh`): detects OS, installs/starts Ollama, pulls model, writes `.env` vars
+2. **Docker Compose profiles**: `ollama-gpu` (Linux+NVIDIA), `ollama-cpu` (Linux CPU-only). No profile on Mac (native Ollama).
+3. **`extra_hosts`** on `hatchet-worker-llm`: maps `host.docker.internal` so containers can reach host Ollama on Mac
+4. **.env**: `LLM_URL` defaults to `http://host.docker.internal:11434/v1` (works on Mac); overridden to `http://ollama:11434/v1` on Linux with profile
+
+### .env changes
+
+```bash
+# Local LLM via Ollama
+# Setup: ./scripts/setup-local-llm.sh
+LLM_URL=http://host.docker.internal:11434/v1
+LLM_MODEL=qwen2.5:14b
+LLM_API_KEY=not-needed
+LLM_CONTEXT_WINDOW=16000
+```
+
+### Docker Compose additions
+
+```yaml
+services:
+  ollama:
+    image: ollama/ollama:latest
+    profiles: ["ollama-gpu"]
+    ports:
+      - "11434:11434"
+    volumes:
+      - ollama_data:/root/.ollama
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+
+  ollama-cpu:
+    image: ollama/ollama:latest
+    profiles: ["ollama-cpu"]
+    ports:
+      - "11434:11434"
+    volumes:
+      - ollama_data:/root/.ollama
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+
+  hatchet-worker-llm:
+    extra_hosts:
+      - "host.docker.internal:host-gateway"
+
+volumes:
+  ollama_data:
+```
+
+### Known gotchas
+
+1. **OrbStack `host.docker.internal`**: OrbStack uses `host.internal` by default, but also supports `host.docker.internal` with `extra_hosts: host-gateway`.
+2. **Linux `host.docker.internal`**: requires `extra_hosts: - "host.docker.internal:host-gateway"` since Docker Engine doesn't add it automatically.
+3. **Ollama binding on Linux**: if running natively (not in container), must use `OLLAMA_HOST=0.0.0.0` so containers can reach it via bridge IP.
+4. **Cold start**: Ollama loads model on first request (~5-10s). Unloads after 5min idle. Set `OLLAMA_KEEP_ALIVE=-1` to keep loaded.
+5. **Concurrent requests**: Ollama queues requests to single llama.cpp instance. With 10 Hatchet LLM worker slots, expect heavy queuing. Reduce for local dev.
+
+---
+
+## Risk Assessment
+
+### High risk: Structured output reliability
+
+Local models may produce malformed JSON more often. Current retry mechanism (3 attempts) assumes the model can self-correct.
+
+**Mitigation**: Qwen 2.5 is explicitly optimized for JSON. Ollama supports `response_format: {type: "json_schema"}` for grammar-based constrained decoding, forcing valid JSON at the token level. `response_format` is now passed in `StructuredOutputWorkflow.extract()` (Task 2, already implemented). Retry mechanism still functions as fallback.
+
+**Resolved**: `OpenAILike.acomplete()` does pass `response_format` through to the HTTP request (verified via code inspection and tests).
+
+### Medium risk: Performance for fan-out workflows
+
+~18 LLM calls per transcript at ~3-5s each locally = ~60-90s total (vs ~10-20s cloud). Acceptable for background processing.
+
+**Mitigation**: Reduce Hatchet concurrency for local dev. Use smaller model (Qwen 2.5 7B or Qwen 3 8B) for faster iteration.
+
+### Low risk: Model quality degradation
+
+Qwen 2.5 14B benchmarks competitively with GPT-4o-mini for summarization/extraction. Sufficient for meeting transcript analysis.
+
+---
+
+## Open Questions
+
+1. **Model choice: Qwen 2.5 14B vs Qwen 3 8B.** Qwen 3 8B reportedly outperforms Qwen 2.5 14B on many benchmarks and needs less RAM. Need to test structured output quality on our specific prompts.
+
+2. **RAM allocation on Mac.** 14B Q4 = ~10 GB for weights + KV cache. On 16GB Mac, limited headroom for Docker VM + services. 32GB+ recommended. 7B/8B model may be necessary for 16GB machines.
+
+3. **Ollama concurrent request handling.** With 10 Hatchet LLM worker slots making parallel requests, expect heavy queuing. Need to benchmark and likely reduce `LLM_RATE_LIMIT_PER_SECOND` and worker slots for local dev.
+
+4. **TreeSummarize behavior with local models.** Multi-step hierarchical reduction may be significantly slower with local inference. Need to measure.
+
+---
+
+## Implementation Phases
+
+### Phase 1: Setup script + Docker Compose integration
+- Create `scripts/setup-local-llm.sh` that detects OS, ensures Ollama, pulls model, writes env vars
+- Add Ollama services to `docker-compose.yml` with profiles (`ollama-gpu`, `ollama-cpu`)
+- Add `extra_hosts` to `hatchet-worker-llm` for host Ollama access
+- Update `server/.env.example` with Ollama defaults
+
+### Phase 2: Grammar-based structured output (DONE)
+- Pass `response_format` with Pydantic JSON schema in `StructuredOutputWorkflow.extract()`
+- Verified: `OpenAILike.acomplete()` passes `response_format` through
+- Tests added and passing
+
+### Phase 3: Validate end-to-end
+- Process test transcript against local Ollama
+- Verify structured output (topics, summaries, titles, participants)
+- Measure latency per LLM call type
+- Compare quality with remote endpoint
+
+### Phase 4: Tune for local performance
+- Adjust Hatchet rate limits / worker slots for local inference speed
+- Benchmark and document expected processing times
+- Test with different model sizes (7B vs 14B)
--- a/docs/docs/installation/local-dev-setup.md
+++ b/docs/docs/installation/local-dev-setup.md
@@ -0,0 +1,94 @@
+---
+sidebar_position: 2
+title: Local Development Setup
+---
+
+# Local Development Setup
+
+**The goal**: a clueless user clones the repo, runs one script, and has a working Reflector instance locally. No cloud accounts, no API keys, no manual env file editing.
+
+```bash
+git clone https://github.com/monadical-sas/reflector.git
+cd reflector
+./scripts/setup-local-dev.sh
+```
+
+The script is idempotent — safe to re-run at any time. It detects what's already set up and skips completed steps.
+
+## Prerequisites
+
+- Docker / OrbStack / Docker Desktop (any)
+- Mac (Apple Silicon) or Linux
+- 16GB+ RAM (32GB recommended for 14B LLM models)
+
+## What the script does
+
+### 1. LLM inference via Ollama (implemented)
+
+**Mac**: starts Ollama natively (Metal GPU acceleration). Pulls the LLM model. Docker containers reach it via `host.docker.internal:11434`.
+
+**Linux**: starts containerized Ollama via docker-compose profile (`ollama-gpu` with NVIDIA, `ollama-cpu` without). Pulls model inside the container.
+
+Configures `server/.env`:
+```
+LLM_URL=http://host.docker.internal:11434/v1
+LLM_MODEL=qwen2.5:14b
+LLM_API_KEY=not-needed
+```
+
+The current standalone script for this step is `scripts/setup-local-llm.sh`. It will be folded into the unified `setup-local-dev.sh` once the other steps are implemented.
+
+See [Ollama PRD](../../01_ollama.prd.md) for architecture, why Ollama over Docker Model Runner, and model comparison.
+
+### 2. Environment files
+
+The script would copy `.env` templates if not present and fill defaults suitable for local dev (localhost postgres, redis, no auth, etc.).
+
+> The exact set of env defaults and whether the script patches an existing `.env` or only creates from template has not been decided yet. A follow-up research pass can determine what's safe to auto-fill vs. what needs user input.
+
+### 3. Transcript storage
+
+Production uses AWS S3. Local dev needs an alternative.
+
+> Options include MinIO in docker-compose (S3-compatible, zero config), a filesystem-backed storage backend (if one exists in the codebase), or skipping storage for dev if the pipeline can function without it. This depends on what `TRANSCRIPT_STORAGE_BACKEND` supports beyond `aws` — needs investigation.
+
+### 4. Transcription and diarization
+
+Production uses Modal.com (cloud GPU) or self-hosted GPU servers.
+
+> The codebase has a `TRANSCRIPT_BACKEND=whisper` option for local Whisper. Whether this runs acceptably on CPU for short dev recordings, and whether diarization has a local fallback, is unknown. For a minimal local setup, it may be sufficient to skip transcription and only test the LLM pipeline against already-transcribed data.
+
+### 5. Docker services
+
+```bash
+docker compose up -d postgres redis server hatchet hatchet-worker-cpu hatchet-worker-llm web
+```
+
+Frontend included in compose (`web` service). Everything comes up in one command.
+
+### 6. Database migrations
+
+```bash
+docker compose exec server uv run alembic upgrade head
+```
+
+Idempotent (alembic tracks applied migrations).
+
+### 7. Health check
+
+Verifies:
+- Server responds at `http://localhost:1250/health`
+- LLM endpoint reachable from inside containers
+- Frontend serves at `http://localhost:3000`
+
+## What's NOT covered
+
+These require external accounts and infrastructure that can't be scripted:
+
+- **Live meeting rooms** — requires Daily.co account, S3 bucket, IAM roles
+- **Authentication** — requires Authentik deployment and OAuth configuration
+- **Production deployment** — see [Deployment Guide](./overview)
+
+## Current status
+
+Step 1 (Ollama/LLM) is implemented and tested. Steps 2-7 need a separate research and implementation pass each.
--- a/scripts/setup-local-llm.sh
+++ b/scripts/setup-local-llm.sh
@@ -0,0 +1,100 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+MODEL="${LLM_MODEL:-qwen2.5:14b}"
+OLLAMA_PORT="${OLLAMA_PORT:-11434}"
+
+wait_for_ollama() {
+    local url="$1"
+    local retries=30
+    for i in $(seq 1 "$retries"); do
+        if curl -sf "$url/api/tags" > /dev/null 2>&1; then
+            return 0
+        fi
+        echo "  Waiting for Ollama... ($i/$retries)"
+        sleep 2
+    done
+    echo "ERROR: Ollama not responding at $url after $retries attempts"
+    return 1
+}
+
+OS="$(uname -s)"
+
+case "$OS" in
+    Darwin)
+        echo "macOS detected -- Ollama must run natively for Metal GPU acceleration."
+        echo ""
+
+        if ! command -v ollama &> /dev/null; then
+            echo "Ollama not found. Install it first:"
+            echo "  brew install ollama"
+            echo "  # or download from https://ollama.com/download"
+            exit 1
+        fi
+
+        # Start Ollama if not already running
+        if ! curl -sf "http://localhost:$OLLAMA_PORT/api/tags" > /dev/null 2>&1; then
+            echo "Starting Ollama..."
+            ollama serve &
+            disown
+        else
+            echo "Ollama already running."
+        fi
+
+        wait_for_ollama "http://localhost:$OLLAMA_PORT"
+
+        echo "Pulling model $MODEL..."
+        ollama pull "$MODEL"
+
+        echo ""
+        echo "Done. Add to server/.env:"
+        echo "  LLM_URL=http://host.docker.internal:$OLLAMA_PORT/v1"
+        echo "  LLM_MODEL=$MODEL"
+        echo "  LLM_API_KEY=not-needed"
+        echo ""
+        echo "Then: docker compose up -d"
+        ;;
+
+    Linux)
+        echo "Linux detected."
+        echo ""
+
+        if command -v nvidia-smi &> /dev/null && nvidia-smi > /dev/null 2>&1; then
+            echo "NVIDIA GPU detected -- using ollama-gpu profile."
+            PROFILE="ollama-gpu"
+            LLM_URL="http://ollama:$OLLAMA_PORT/v1"
+        else
+            echo "No NVIDIA GPU -- using ollama-cpu profile."
+            PROFILE="ollama-cpu"
+            LLM_URL="http://ollama-cpu:$OLLAMA_PORT/v1"
+        fi
+
+        echo "Starting Ollama container..."
+        docker compose --profile "$PROFILE" up -d
+
+        # Determine container name
+        if [ "$PROFILE" = "ollama-gpu" ]; then
+            SVC="ollama"
+        else
+            SVC="ollama-cpu"
+        fi
+
+        wait_for_ollama "http://localhost:$OLLAMA_PORT"
+
+        echo "Pulling model $MODEL..."
+        docker compose exec "$SVC" ollama pull "$MODEL"
+
+        echo ""
+        echo "Done. Add to server/.env:"
+        echo "  LLM_URL=$LLM_URL"
+        echo "  LLM_MODEL=$MODEL"
+        echo "  LLM_API_KEY=not-needed"
+        echo ""
+        echo "Then: docker compose --profile $PROFILE up -d"
+        ;;
+
+    *)
+        echo "Unsupported OS: $OS"
+        exit 1
+        ;;
+esac
--- a/server/.env.example
+++ b/server/.env.example
@@ -66,15 +66,22 @@ TRANSLATE_URL=https://monadical-sas--reflector-translator-web.modal.run
 ## LLM backend (Required)
 ##
 ## Responsible for generating titles, summaries, and topic detection
-## Requires OpenAI API key
+## Supports any OpenAI-compatible endpoint.
 ## =======================================================

-## OpenAI API key - get from https://platform.openai.com/account/api-keys
-LLM_API_KEY=sk-your-openai-api-key
-LLM_MODEL=gpt-4o-mini
+## --- Option A: Local LLM via Ollama (recommended for dev) ---
+## Setup: ./scripts/setup-local-llm.sh
+## Mac: Ollama runs natively (Metal GPU). Containers reach it via host.docker.internal.
+## Linux: docker compose --profile ollama-gpu up -d (or ollama-cpu for no GPU)
+LLM_URL=http://host.docker.internal:11434/v1
+LLM_MODEL=qwen2.5:14b
+LLM_API_KEY=not-needed
+## Linux with containerized Ollama: LLM_URL=http://ollama:11434/v1

-## Optional: Custom endpoint (defaults to OpenAI)
-# LLM_URL=https://api.openai.com/v1
+## --- Option B: Remote/cloud LLM ---
+#LLM_API_KEY=sk-your-openai-api-key
+#LLM_MODEL=gpt-4o-mini
+## LLM_URL defaults to OpenAI when unset

 ## Context size for summary generation (tokens)
 LLM_CONTEXT_WINDOW=16000
--- a/server/reflector/llm.py
+++ b/server/reflector/llm.py
@@ -144,7 +144,18 @@ class StructuredOutputWorkflow(Workflow, Generic[OutputT]):
        )

        # Network retries handled by OpenAILike (max_retries=3)
-        response = await Settings.llm.acomplete(json_prompt)
+        # response_format enables grammar-based constrained decoding on backends
+        # that support it (DMR/llama.cpp, vLLM, Ollama, OpenAI).
+        response = await Settings.llm.acomplete(
+            json_prompt,
+            response_format={
+                "type": "json_schema",
+                "json_schema": {
+                    "name": self.output_cls.__name__,
+                    "schema": self.output_cls.model_json_schema(),
+                },
+            },
+        )
        return ExtractionDone(output=response.text)

    @step
--- a/server/tests/test_llm_retry.py
+++ b/server/tests/test_llm_retry.py
@@ -286,6 +286,92 @@ class TestStructuredOutputWorkflow:
            assert mock_settings.llm.acomplete.call_count == 2


+class TestResponseFormat:
+    """Test that response_format with JSON schema is passed to acomplete"""
+
+    @pytest.mark.asyncio
+    async def test_acomplete_called_with_response_format(self):
+        """acomplete() should receive response_format containing Pydantic JSON schema"""
+        workflow = StructuredOutputWorkflow(
+            output_cls=TestResponse,
+            max_retries=3,
+            timeout=30,
+        )
+
+        with (
+            patch("reflector.llm.TreeSummarize") as mock_summarize,
+            patch("reflector.llm.Settings") as mock_settings,
+        ):
+            mock_summarizer = MagicMock()
+            mock_summarize.return_value = mock_summarizer
+            mock_summarizer.aget_response = AsyncMock(return_value="Some analysis")
+
+            mock_settings.llm.acomplete = AsyncMock(
+                return_value=make_completion_response(
+                    '{"title": "Test", "summary": "Summary", "confidence": 0.95}'
+                )
+            )
+
+            result = await workflow.run(
+                prompt="Extract data",
+                texts=["Some text"],
+                tone_name=None,
+            )
+
+            assert "success" in result
+
+            # Verify response_format was passed
+            call_kwargs = mock_settings.llm.acomplete.call_args
+            assert "response_format" in call_kwargs.kwargs
+            rf = call_kwargs.kwargs["response_format"]
+            assert rf["type"] == "json_schema"
+            assert rf["json_schema"]["name"] == "TestResponse"
+            assert rf["json_schema"]["schema"] == TestResponse.model_json_schema()
+
+    @pytest.mark.asyncio
+    async def test_response_format_present_on_retry(self):
+        """response_format should be passed on retry attempts too"""
+        workflow = StructuredOutputWorkflow(
+            output_cls=TestResponse,
+            max_retries=3,
+            timeout=30,
+        )
+
+        with (
+            patch("reflector.llm.TreeSummarize") as mock_summarize,
+            patch("reflector.llm.Settings") as mock_settings,
+        ):
+            mock_summarizer = MagicMock()
+            mock_summarize.return_value = mock_summarizer
+            mock_summarizer.aget_response = AsyncMock(return_value="Some analysis")
+
+            call_count = {"count": 0}
+
+            async def acomplete_handler(*args, **kwargs):
+                call_count["count"] += 1
+                if call_count["count"] == 1:
+                    return make_completion_response('{"title": "Only title"}')
+                return make_completion_response(
+                    '{"title": "Test", "summary": "Summary", "confidence": 0.9}'
+                )
+
+            mock_settings.llm.acomplete = AsyncMock(side_effect=acomplete_handler)
+
+            result = await workflow.run(
+                prompt="Extract data",
+                texts=["Some text"],
+                tone_name=None,
+            )
+
+            assert "success" in result
+            assert call_count["count"] == 2
+
+            # Both calls should have response_format
+            for call in mock_settings.llm.acomplete.call_args_list:
+                assert "response_format" in call.kwargs
+                assert call.kwargs["response_format"]["type"] == "json_schema"
+
+
 class TestNetworkErrorRetries:
    """Test that network error retries are handled by OpenAILike, not Workflow"""