Update docs and fix LLM JSON parsing

- Use load_dotenv(".env") explicitly in all doc examples
- Move pydantic imports (BaseModel, Field) to setup cell in all examples
- Add separate display cell pattern for DataFrame inspection
- Fix LLM control character error: sanitize JSON before Pydantic parsing
- Remove debug print from llm.py
This commit is contained in:
2026-02-10 19:45:04 -06:00
parent 8eb1fb87a7
commit 46dfebd05f
2 changed files with 50 additions and 14 deletions

View File

@@ -137,13 +137,12 @@ If you write `await` in a non-async cell, marimo cannot parse the cell and saves
If a cell defines Pydantic models (or any class) that other cells need, it **must** return them: If a cell defines Pydantic models (or any class) that other cells need, it **must** return them:
```python ```python
# BaseModel and Field are imported in the setup cell and received as parameters
@app.cell @app.cell
def models(): def models(BaseModel, Field):
from pydantic import BaseModel
class MeetingSentiment(BaseModel): class MeetingSentiment(BaseModel):
overall_sentiment: str overall_sentiment: str
sentiment_score: int sentiment_score: int = Field(description="Score from -10 to +10")
class FrustrationExtraction(BaseModel): class FrustrationExtraction(BaseModel):
has_frustrations: bool has_frustrations: bool
@@ -291,14 +290,15 @@ def config():
@app.cell @app.cell
def setup(): def setup():
from dotenv import load_dotenv from dotenv import load_dotenv
load_dotenv() # Load .env from the project root load_dotenv(".env") # Load .env from the project root
import asyncio # All imports go here — never import inside other cells import asyncio # All imports go here — never import inside other cells
import httpx import httpx
import marimo as mo import marimo as mo
import polars as pl import polars as pl
from pydantic import BaseModel, Field
client = httpx.Client(timeout=30) client = httpx.Client(timeout=30)
return (asyncio, client, mo, pl,) return (asyncio, client, mo, pl, BaseModel, Field,)
# --- your IN / ETL / OUT cells here --- # --- your IN / ETL / OUT cells here ---
@@ -306,7 +306,7 @@ if __name__ == "__main__":
app.run() app.run()
``` ```
> **`load_dotenv()`** reads the `.env` file from the project root (walks up from the notebook's directory). This makes `LLM_API_KEY` and other env vars available to `os.getenv()` calls in `lib/llm.py` without requiring the shell to have them pre-set. Always include `python-dotenv` in PEP 723 dependencies and call `load_dotenv()` early in the setup cell. > **`load_dotenv(".env")`** reads the `.env` file explicitly by name. This makes `LLM_API_KEY` and other env vars available to `os.getenv()` calls in `lib/llm.py` without requiring the shell to have them pre-set. Always include `python-dotenv` in PEP 723 dependencies and call `load_dotenv(".env")` early in the setup cell.
**The `params` cell must always be the first cell** after `app = marimo.App()`. It contains all user-configurable constants (search terms, date ranges, target names, etc.) as plain Python values. This way the user can tweak the workflow by editing a single cell at the top — no need to hunt through the code for hardcoded values. **The `params` cell must always be the first cell** after `app = marimo.App()`. It contains all user-configurable constants (search terms, date ranges, target names, etc.) as plain Python values. This way the user can tweak the workflow by editing a single cell at the top — no need to hunt through the code for hardcoded values.
@@ -581,15 +581,16 @@ When you need to classify, score, or extract structured information from each en
@app.cell @app.cell
def setup(): def setup():
from dotenv import load_dotenv from dotenv import load_dotenv
load_dotenv() # Makes LLM_API_KEY available to lib/llm.py load_dotenv(".env") # Makes LLM_API_KEY available to lib/llm.py
import asyncio import asyncio
import httpx import httpx
import marimo as mo import marimo as mo
import polars as pl import polars as pl
from pydantic import BaseModel, Field
from lib.llm import llm_call from lib.llm import llm_call
client = httpx.Client(timeout=30) client = httpx.Client(timeout=30)
return (asyncio, client, llm_call, mo, pl,) return (asyncio, client, llm_call, mo, pl, BaseModel, Field,)
``` ```
### Define a response model ### Define a response model
@@ -598,8 +599,7 @@ Create a Pydantic model that describes the structured output you want from the L
```python ```python
@app.cell @app.cell
def models(): def models(BaseModel, Field):
from pydantic import BaseModel
class RelevanceScore(BaseModel): class RelevanceScore(BaseModel):
relevant: bool relevant: bool
@@ -653,11 +653,12 @@ When generating marimo notebooks, follow these rules strictly. Violations cause
- **Return only values other cells need** — everything else should be `_`-prefixed and stays private to the cell. - **Return only values other cells need** — everything else should be `_`-prefixed and stays private to the cell.
- **Import stdlib modules in `setup` too** — even `from datetime import datetime` creates a top-level name. If two cells both import `datetime`, marimo errors. Import it once in `setup` and receive it as a parameter, or use it inside a `_`-prefixed helper function where it's naturally scoped. - **Import stdlib modules in `setup` too** — even `from datetime import datetime` creates a top-level name. If two cells both import `datetime`, marimo errors. Import it once in `setup` and receive it as a parameter, or use it inside a `_`-prefixed helper function where it's naturally scoped.
- **Every non-utility cell must show a preview** — see the "Cell Output Previews" section below. - **Every non-utility cell must show a preview** — see the "Cell Output Previews" section below.
- **Use separate display cells for DataFrames** — the build cell returns the DataFrame and shows a `mo.md()` count/heading; a standalone display cell (e.g., `def show_table(df): df`) renders it as an interactive table the user can sort and filter.
- **Keep cell output expressions at the top level** — if a cell conditionally displays a DataFrame, initialize `_output = None` before the `if`/`else`, assign inside the branches, then put `_output` as the last top-level expression. Expressions inside `if`/`else`/`for` blocks are silently ignored by marimo. - **Keep cell output expressions at the top level** — if a cell conditionally displays a DataFrame, initialize `_output = None` before the `if`/`else`, assign inside the branches, then put `_output` as the last top-level expression. Expressions inside `if`/`else`/`for` blocks are silently ignored by marimo.
- **Put all user parameters in a `params` cell as the first cell** — date ranges, search terms, target names, limits. Never hardcode these values deeper in the notebook. - **Put all user parameters in a `params` cell as the first cell** — date ranges, search terms, target names, limits. Never hardcode these values deeper in the notebook.
- **Declare cells as `async def` when using `await`** — `@app.cell` followed by `async def cell_name(...)`. This includes cells using `asyncio.gather`, `await llm_call(...)`, or any async API. - **Declare cells as `async def` when using `await`** — `@app.cell` followed by `async def cell_name(...)`. This includes cells using `asyncio.gather`, `await llm_call(...)`, or any async API.
- **Return classes/models from cells that define them** — if a cell defines `class MyModel(BaseModel)`, return it so other cells can use it as a parameter: `return (MyModel,)`. - **Return classes/models from cells that define them** — if a cell defines `class MyModel(BaseModel)`, return it so other cells can use it as a parameter: `return (MyModel,)`.
- **Use `python-dotenv` to load `.env`** — add `python-dotenv` to PEP 723 dependencies and call `load_dotenv()` early in the setup cell (before importing `lib.llm`). This ensures `LLM_API_KEY` and other env vars are available without requiring them to be pre-set in the shell. - **Use `python-dotenv` to load `.env`** — add `python-dotenv` to PEP 723 dependencies and call `load_dotenv(".env")` early in the setup cell (before importing `lib.llm`). This ensures `LLM_API_KEY` and other env vars are available without requiring them to be pre-set in the shell.
### Don't ### Don't
@@ -724,7 +725,7 @@ def build_table(meetings, pl):
return (meeting_df,) return (meeting_df,)
``` ```
**Good**DataFrame is the last expression, so marimo renders it as an interactive table: **Good**the build cell shows a `mo.md()` count, and a **separate display cell** renders the DataFrame as an interactive table:
```python ```python
@app.cell @app.cell
@@ -739,6 +740,27 @@ def show_meeting_table(meeting_df):
meeting_df # Renders as interactive sortable table meeting_df # Renders as interactive sortable table
``` ```
### Separate display cells for DataFrames
When a cell builds a DataFrame, use **two cells**: one that builds and returns it (with a `mo.md()` summary), and a standalone display cell that renders it as a table. This keeps the build logic clean and gives the user an interactive table they can sort and filter in the marimo UI.
```python
# Cell 1: build and return the DataFrame, show a count
@app.cell
def build_sentiment_table(analyzed_meetings, pl, mo):
_rows = [...]
sentiment_df = pl.DataFrame(_rows).sort("date", descending=True)
mo.md(f"### Sentiment Analysis ({len(sentiment_df)} meetings)")
return (sentiment_df,)
# Cell 2: standalone display — just the DataFrame, nothing else
@app.cell
def show_sentiment_table(sentiment_df):
sentiment_df
```
This pattern makes every result inspectable. The `mo.md()` cell gives a quick count/heading; the display cell lets the user explore the full data interactively.
### Utility cells (no preview needed) ### Utility cells (no preview needed)
Config, setup, and helper cells that only define constants or functions don't need previews: Config, setup, and helper cells that only define constants or functions don't need previews:

View File

@@ -1,6 +1,7 @@
"""Simple LLM helper for workbooks using Mirascope v2.""" """Simple LLM helper for workbooks using Mirascope v2."""
import os import os
import re
from typing import TypeVar from typing import TypeVar
from mirascope import llm from mirascope import llm
@@ -23,6 +24,15 @@ llm.register_provider(
) )
def _sanitize_json(text: str) -> str:
"""Strip control characters (U+0000U+001F) that break JSON parsing.
Some LLMs emit literal newlines/tabs inside JSON string values,
which is invalid per the JSON spec. Replace them with spaces.
"""
return re.sub(r"[\x00-\x1f]+", " ", text)
async def llm_call( async def llm_call(
prompt: str, prompt: str,
response_model: type[T], response_model: type[T],
@@ -47,4 +57,8 @@ async def llm_call(
return f"{system_prompt}\n\n{prompt}" return f"{system_prompt}\n\n{prompt}"
response = await _call() response = await _call()
return response.parse() try:
return response.parse()
except Exception:
# Fallback: sanitize control characters and parse manually
return response_model.model_validate_json(_sanitize_json(response.content))