809 lines
32 KiB
Markdown
809 lines
32 KiB
Markdown
---
|
|
name: notebook-patterns
|
|
description: Marimo notebook patterns for InternalAI data analysis. Use when creating or editing marimo notebooks — covers cell scoping, async cells, pagination helpers, analysis patterns, and do/don't rules.
|
|
user-invocable: false
|
|
---
|
|
|
|
# Marimo Notebook Patterns
|
|
|
|
This guide covers how to create [marimo](https://marimo.io) notebooks for data analysis against the InternalAI platform APIs. Marimo notebooks are plain `.py` files with reactive cells — no `.ipynb` format, no Jupyter dependency.
|
|
|
|
## Marimo Basics
|
|
|
|
A marimo notebook is a Python file with `@app.cell` decorated functions. Each cell returns values as a tuple, and other cells receive them as function parameters — marimo builds a reactive DAG automatically.
|
|
|
|
```python
|
|
import marimo
|
|
app = marimo.App()
|
|
|
|
@app.cell
|
|
def cell_one():
|
|
x = 42
|
|
return (x,)
|
|
|
|
@app.cell
|
|
def cell_two(x):
|
|
# Re-runs automatically when x changes
|
|
result = x * 2
|
|
return (result,)
|
|
```
|
|
|
|
**Key rules:**
|
|
- Cells declare dependencies via function parameters
|
|
- Cells return values as tuples: `return (var1, var2,)`
|
|
- The **last expression at the top level** of a cell is displayed as rich output in the marimo UI (dataframes render as tables, dicts as collapsible trees). Expressions inside `if`/`else`/`for` blocks do **not** count — see [Cell Output Must Be at the Top Level](#cell-output-must-be-at-the-top-level) below
|
|
- Use `mo.md("# heading")` for formatted markdown output (import `mo` once in setup — see below)
|
|
- No manual execution order; the DAG determines it
|
|
- **Variable names must be unique across cells.** Every variable assigned at the top level of a cell is tracked by marimo's DAG. If two cells both define `resp`, marimo raises `MultipleDefinitionError` and refuses to run. Prefix cell-local variables with `_` (e.g., `_resp`, `_rows`, `_data`) to make them **private** to that cell — marimo ignores `_`-prefixed names.
|
|
- **All imports must go in the `setup` cell.** Every `import` statement creates a top-level variable (e.g., `import asyncio` defines `asyncio`). If two cells both `import asyncio`, marimo raises `MultipleDefinitionError`. Place **all** imports in a single setup cell and pass them as cell parameters. Do NOT `import marimo as mo` or `import asyncio` in multiple cells — import once in `setup`, then receive via `def my_cell(mo, asyncio):`.
|
|
|
|
### Cell Variable Scoping — Example
|
|
|
|
This is the **most common mistake**. Any variable assigned at the top level of a cell (not inside a `def` or comprehension) is tracked by marimo. If two cells assign the same name, the notebook refuses to run.
|
|
|
|
**BROKEN** — `resp` is defined at top level in both cells:
|
|
|
|
```python
|
|
# Cell A
|
|
@app.cell
|
|
def search_meetings(client, DATAINDEX):
|
|
resp = client.post(f"{DATAINDEX}/search", json={...}) # defines 'resp'
|
|
resp.raise_for_status()
|
|
results = resp.json()["results"]
|
|
return (results,)
|
|
|
|
# Cell B
|
|
@app.cell
|
|
def fetch_details(client, DATAINDEX, results):
|
|
resp = client.get(f"{DATAINDEX}/entities/{results[0]}") # also defines 'resp' → ERROR
|
|
meeting = resp.json()
|
|
return (meeting,)
|
|
```
|
|
|
|
> **Error:** `MultipleDefinitionError: variable 'resp' is defined in multiple cells`
|
|
|
|
**FIXED** — prefix cell-local variables with `_`:
|
|
|
|
```python
|
|
# Cell A
|
|
@app.cell
|
|
def search_meetings(client, DATAINDEX):
|
|
_resp = client.post(f"{DATAINDEX}/search", json={...}) # _resp is cell-private
|
|
_resp.raise_for_status()
|
|
results = _resp.json()["results"]
|
|
return (results,)
|
|
|
|
# Cell B
|
|
@app.cell
|
|
def fetch_details(client, DATAINDEX, results):
|
|
_resp = client.get(f"{DATAINDEX}/entities/{results[0]}") # _resp is cell-private, no conflict
|
|
meeting = _resp.json()
|
|
return (meeting,)
|
|
```
|
|
|
|
**Rule of thumb:** if a variable is only used within the cell to compute a return value, prefix it with `_`. Only leave names unprefixed if another cell needs to receive them.
|
|
|
|
> **Note:** Variables inside nested `def` functions are naturally local and don't need `_` prefixes — e.g., `resp` inside a `def fetch_all(...)` helper is fine because it's scoped to the function, not the cell.
|
|
|
|
### Cell Output Must Be at the Top Level
|
|
|
|
Marimo only renders the **last expression at the top level** of a cell as rich output. An expression buried inside an `if`/`else`, `for`, `try`, or any other block is **not** displayed — it's silently discarded.
|
|
|
|
**BROKEN** — `_df` inside the `if` branch is never rendered, and `mo.md()` inside `if`/`else` is also discarded:
|
|
|
|
```python
|
|
@app.cell
|
|
def show_results(results, mo):
|
|
if results:
|
|
_df = pl.DataFrame(results)
|
|
mo.md(f"**Found {len(results)} results**")
|
|
_df # Inside an if block — marimo does NOT display this
|
|
else:
|
|
mo.md("**No results found**") # Also inside a block — NOT displayed
|
|
return
|
|
```
|
|
|
|
**FIXED** — split into separate cells. Each cell displays exactly **one thing** at the top level:
|
|
|
|
```python
|
|
# Cell 1: build the data, return it
|
|
@app.cell
|
|
def build_results(results, pl):
|
|
results_df = pl.DataFrame(results) if results else None
|
|
return (results_df,)
|
|
|
|
# Cell 2: heading — mo.md() is the top-level expression (use ternary for conditional text)
|
|
@app.cell
|
|
def show_results_heading(results_df, mo):
|
|
mo.md(f"**Found {len(results_df)} results**" if results_df is not None else "**No results found**")
|
|
|
|
# Cell 3: table — DataFrame is the top-level expression
|
|
@app.cell
|
|
def show_results_table(results_df):
|
|
results_df # Top-level expression — marimo renders this as interactive table
|
|
```
|
|
|
|
**Rules:**
|
|
- Each cell should display **one thing** — either `mo.md()` OR a DataFrame, never both
|
|
- `mo.md()` must be a **top-level expression**, not inside `if`/`else`/`for`/`try` blocks
|
|
- Build conditional text using variables or ternary expressions, then call `mo.md(_text)` at the top level
|
|
- For DataFrames, use a standalone display cell: `def show_table(df): df`
|
|
|
|
### Async Cells
|
|
|
|
When a cell uses `await` (e.g., for `llm_call` or `asyncio.gather`), you **must** declare it as `async def`:
|
|
|
|
```python
|
|
@app.cell
|
|
async def analyze(meetings, llm_call, ResponseModel, asyncio):
|
|
async def _score(meeting):
|
|
return await llm_call(prompt=..., response_model=ResponseModel)
|
|
|
|
results = await asyncio.gather(*[_score(_m) for _m in meetings])
|
|
return (results,)
|
|
```
|
|
|
|
Note that `asyncio` is imported in the `setup` cell and received here as a parameter — never `import asyncio` inside individual cells.
|
|
|
|
If you write `await` in a non-async cell, marimo cannot parse the cell and saves it as an `_unparsable_cell` string literal — the cell won't run, and you'll see `SyntaxError: 'return' outside function` or similar errors. See [Fixing `_unparsable_cell`](#fixing-_unparsable_cell) below.
|
|
|
|
### Cells That Define Classes Must Return Them
|
|
|
|
If a cell defines Pydantic models (or any class) that other cells need, it **must** return them:
|
|
|
|
```python
|
|
# BaseModel and Field are imported in the setup cell and received as parameters
|
|
@app.cell
|
|
def models(BaseModel, Field):
|
|
class MeetingSentiment(BaseModel):
|
|
overall_sentiment: str
|
|
sentiment_score: int = Field(description="Score from -10 to +10")
|
|
|
|
class FrustrationExtraction(BaseModel):
|
|
has_frustrations: bool
|
|
frustrations: list[dict]
|
|
|
|
return MeetingSentiment, FrustrationExtraction # Other cells receive these as parameters
|
|
```
|
|
|
|
A bare `return` (or no return) means those classes are invisible to the rest of the notebook.
|
|
|
|
### Fixing `_unparsable_cell`
|
|
|
|
When marimo can't parse a cell into a proper `@app.cell` function, it saves the raw code as `app._unparsable_cell("...", name="cell_name")`. These cells **won't run** and show errors like `SyntaxError: 'return' outside function`.
|
|
|
|
**Common causes:**
|
|
1. Using `await` without making the cell `async def`
|
|
2. Using `return` in code that marimo failed to wrap into a function (usually a side effect of cause 1)
|
|
|
|
**How to fix:** Convert the `_unparsable_cell` string back into a proper `@app.cell` decorated function:
|
|
|
|
```python
|
|
# BROKEN — saved as _unparsable_cell because of top-level await
|
|
app._unparsable_cell("""
|
|
results = await asyncio.gather(...)
|
|
return results
|
|
""", name="my_cell")
|
|
|
|
# FIXED — proper async cell function (asyncio imported in setup, received as parameter)
|
|
@app.cell
|
|
async def my_cell(some_dependency, asyncio):
|
|
results = await asyncio.gather(...)
|
|
return (results,)
|
|
```
|
|
|
|
**Key differences to note when converting:**
|
|
- Wrap the code in an `async def` function (if it uses `await`)
|
|
- Add cell dependencies as function parameters (including imports like `asyncio`)
|
|
- Return values as tuples: `return (var,)` not `return var`
|
|
- Prefix cell-local variables with `_`
|
|
- Never add `import` statements inside the cell — all imports belong in `setup`
|
|
|
|
### Inline Dependencies with PEP 723
|
|
|
|
Use PEP 723 `/// script` metadata so `uv run` auto-installs dependencies:
|
|
|
|
```python
|
|
# /// script
|
|
# requires-python = ">=3.12"
|
|
# dependencies = [
|
|
# "marimo",
|
|
# "httpx",
|
|
# "polars",
|
|
# "mirascope[openai]",
|
|
# "pydantic",
|
|
# "python-dotenv",
|
|
# ]
|
|
# ///
|
|
```
|
|
|
|
### Checking Notebooks Before Running
|
|
|
|
Always run `marimo check` before opening or running a notebook. It catches common issues — duplicate variable definitions, `_unparsable_cell` blocks, branch expressions that won't display, and more — without needing to start the full editor:
|
|
|
|
```bash
|
|
uvx marimo check notebook.py # Check a single notebook
|
|
uvx marimo check workflows/ # Check all notebooks in a directory
|
|
uvx marimo check --fix notebook.py # Auto-fix fixable issues
|
|
```
|
|
|
|
**Run this after every edit.** A clean `marimo check` (no output, exit code 0) means the notebook is structurally valid. Any errors must be fixed before running.
|
|
|
|
### Running Notebooks
|
|
|
|
```bash
|
|
uvx marimo edit notebook.py # Interactive editor (best for development)
|
|
uvx marimo run notebook.py # Read-only web app
|
|
uv run notebook.py # Script mode (terminal output)
|
|
```
|
|
|
|
### Inspecting Cell Outputs
|
|
|
|
In `marimo edit`, every cell's return value is displayed as rich output below the cell. This is the primary way to introspect API responses:
|
|
|
|
- **Dicts/lists** render as collapsible JSON trees — click to expand nested fields
|
|
- **Polars/Pandas DataFrames** render as interactive sortable tables
|
|
- **Strings** render as plain text
|
|
|
|
To inspect a raw API response, just make it the last expression:
|
|
|
|
```python
|
|
@app.cell
|
|
def inspect_response(client, DATAINDEX):
|
|
_resp = client.get(f"{DATAINDEX}/query", params={
|
|
"entity_types": "meeting", "limit": 2,
|
|
})
|
|
_resp.json() # This gets displayed as a collapsible JSON tree
|
|
```
|
|
|
|
To inspect an intermediate value alongside other work, use `mo.accordion` or return it:
|
|
|
|
```python
|
|
@app.cell
|
|
def debug_meetings(meetings, mo):
|
|
mo.md(f"**Count:** {len(meetings)}")
|
|
# Show first item structure for inspection
|
|
mo.accordion({"First meeting raw": mo.json(meetings[0])}) if meetings else None
|
|
```
|
|
|
|
## Notebook Skeleton
|
|
|
|
Every notebook against InternalAI follows this structure:
|
|
|
|
```python
|
|
# /// script
|
|
# requires-python = ">=3.12"
|
|
# dependencies = [
|
|
# "marimo",
|
|
# "httpx",
|
|
# "polars",
|
|
# "mirascope[openai]",
|
|
# "pydantic",
|
|
# "python-dotenv",
|
|
# ]
|
|
# ///
|
|
|
|
import marimo
|
|
app = marimo.App()
|
|
|
|
@app.cell
|
|
def params():
|
|
"""User parameters — edit these to change the workflow's behavior."""
|
|
SEARCH_TERMS = ["greyhaven"]
|
|
DATE_FROM = "2026-01-01T00:00:00Z"
|
|
DATE_TO = "2026-02-01T00:00:00Z"
|
|
TARGET_PERSON = None # Set to a name like "Alice" to filter by person, or None for all
|
|
return DATE_FROM, DATE_TO, SEARCH_TERMS, TARGET_PERSON
|
|
|
|
@app.cell
|
|
def config():
|
|
BASE = "http://localhost:42000"
|
|
CONTACTDB = f"{BASE}/contactdb-api"
|
|
DATAINDEX = f"{BASE}/dataindex/api/v1"
|
|
return (CONTACTDB, DATAINDEX,)
|
|
|
|
@app.cell
|
|
def setup():
|
|
from dotenv import load_dotenv
|
|
load_dotenv(".env") # Load .env from the project root
|
|
|
|
import asyncio # All imports go here — never import inside other cells
|
|
import httpx
|
|
import marimo as mo
|
|
import polars as pl
|
|
from pydantic import BaseModel, Field
|
|
client = httpx.Client(timeout=30)
|
|
return (asyncio, client, mo, pl, BaseModel, Field,)
|
|
|
|
# --- your IN / ETL / OUT cells here ---
|
|
|
|
if __name__ == "__main__":
|
|
app.run()
|
|
```
|
|
|
|
> **`load_dotenv(".env")`** reads the `.env` file explicitly by name. This makes `LLM_API_KEY` and other env vars available to `os.getenv()` calls in `lib/llm.py` without requiring the shell to have them pre-set. Always include `python-dotenv` in PEP 723 dependencies and call `load_dotenv(".env")` early in the setup cell.
|
|
|
|
**The `params` cell must always be the first cell** after `app = marimo.App()`. It contains all user-configurable constants (search terms, date ranges, target names, etc.) as plain Python values. This way the user can tweak the workflow by editing a single cell at the top — no need to hunt through the code for hardcoded values.
|
|
|
|
## Pagination Helper
|
|
|
|
The DataIndex `GET /query` endpoint paginates with `limit` and `offset`. Always paginate — result sets can be large.
|
|
|
|
```python
|
|
@app.cell
|
|
def helpers(client):
|
|
def fetch_all(url, params):
|
|
"""Fetch all pages from a paginated DataIndex endpoint."""
|
|
all_items = []
|
|
limit = params.get("limit", 50)
|
|
params = {**params, "limit": limit, "offset": 0}
|
|
while True:
|
|
resp = client.get(url, params=params)
|
|
resp.raise_for_status()
|
|
data = resp.json()
|
|
all_items.extend(data["items"])
|
|
if params["offset"] + limit >= data["total"]:
|
|
break
|
|
params["offset"] += limit
|
|
return all_items
|
|
|
|
def resolve_contact(name, contactdb_url):
|
|
"""Find a contact by name, return their ID."""
|
|
resp = client.get(f"{contactdb_url}/api/contacts", params={"search": name})
|
|
resp.raise_for_status()
|
|
contacts = resp.json()["contacts"]
|
|
if not contacts:
|
|
raise ValueError(f"No contact found for '{name}'")
|
|
return contacts[0]
|
|
|
|
return (fetch_all, resolve_contact,)
|
|
```
|
|
|
|
## Pattern 1: Emails Involving a Specific Person
|
|
|
|
Emails have `from_contact_id`, `to_contact_ids`, and `cc_contact_ids`. The query API's `contact_ids` filter matches entities where the contact appears in **any** of these roles.
|
|
|
|
```python
|
|
@app.cell
|
|
def find_person(resolve_contact, CONTACTDB):
|
|
target = resolve_contact("Alice", CONTACTDB)
|
|
target_id = target["id"]
|
|
target_name = target["name"]
|
|
return (target_id, target_name,)
|
|
|
|
@app.cell
|
|
def fetch_emails(fetch_all, DATAINDEX, target_id):
|
|
emails = fetch_all(f"{DATAINDEX}/query", {
|
|
"entity_types": "email",
|
|
"contact_ids": str(target_id),
|
|
"date_from": "2025-01-01T00:00:00Z",
|
|
"sort_order": "desc",
|
|
})
|
|
return (emails,)
|
|
|
|
@app.cell
|
|
def email_table(emails, target_id, target_name, pl):
|
|
email_df = pl.DataFrame([{
|
|
"date": e["timestamp"][:10],
|
|
"subject": e.get("title", "(no subject)"),
|
|
"direction": (
|
|
"sent" if str(target_id) == str(e.get("from_contact_id"))
|
|
else "received"
|
|
),
|
|
"snippet": (e.get("snippet") or e.get("text_content") or "")[:100],
|
|
} for e in emails])
|
|
return (email_df,)
|
|
|
|
@app.cell
|
|
def show_emails(email_df, target_name, mo):
|
|
mo.md(f"## Emails involving {target_name} ({len(email_df)} total)")
|
|
|
|
@app.cell
|
|
def display_email_table(email_df):
|
|
email_df # Renders as interactive table in marimo edit
|
|
```
|
|
|
|
## Pattern 2: Meetings with a Specific Participant
|
|
|
|
Meetings have a `participants` list where each entry may or may not have a resolved `contact_id`. The query API's `contact_ids` filter only matches **resolved** participants.
|
|
|
|
**Strategy:** Query by `contact_ids` to get meetings with resolved participants, then optionally do a client-side check on `participants[].display_name` or `transcript` for unresolved ones.
|
|
|
|
> **Always include `room_name` in meeting tables.** The `room_name` field contains the virtual room name (e.g., `standup-office-bogota`) and often indicates where the meeting took place. It's useful context when `title` is generic or missing — include it as a column alongside `title`.
|
|
|
|
```python
|
|
@app.cell
|
|
def fetch_meetings(fetch_all, DATAINDEX, target_id, my_id):
|
|
# Get meetings where the target appears in contact_ids
|
|
resolved_meetings = fetch_all(f"{DATAINDEX}/query", {
|
|
"entity_types": "meeting",
|
|
"contact_ids": str(target_id),
|
|
"date_from": "2025-01-01T00:00:00Z",
|
|
})
|
|
return (resolved_meetings,)
|
|
|
|
@app.cell
|
|
def meeting_table(resolved_meetings, target_name, pl):
|
|
_rows = []
|
|
for _m in resolved_meetings:
|
|
_participants = _m.get("participants", [])
|
|
_names = [_p["display_name"] for _p in _participants]
|
|
_rows.append({
|
|
"date": (_m.get("start_time") or _m["timestamp"])[:10],
|
|
"title": _m.get("title", "Untitled"),
|
|
"room_name": _m.get("room_name", ""),
|
|
"participants": ", ".join(_names),
|
|
"has_transcript": _m.get("transcript") is not None,
|
|
"has_summary": _m.get("summary") is not None,
|
|
})
|
|
meeting_df = pl.DataFrame(_rows)
|
|
return (meeting_df,)
|
|
```
|
|
|
|
To also find meetings where the person was present but **not resolved** (guest), search the transcript:
|
|
|
|
```python
|
|
@app.cell
|
|
def search_unresolved(client, DATAINDEX, target_name):
|
|
# Semantic search for the person's name in meeting transcripts
|
|
_resp = client.post(f"{DATAINDEX}/search", json={
|
|
"search_text": target_name,
|
|
"entity_types": ["meeting"],
|
|
"limit": 50,
|
|
})
|
|
_resp.raise_for_status()
|
|
transcript_hits = _resp.json()["results"]
|
|
return (transcript_hits,)
|
|
```
|
|
|
|
## Pattern 3: Calendar Events → Meeting Correlation
|
|
|
|
Calendar events and meetings are separate entities from different connectors. To find which calendar events had a corresponding recorded meeting, match by time overlap.
|
|
|
|
```python
|
|
@app.cell
|
|
def fetch_calendar_and_meetings(fetch_all, DATAINDEX, my_id):
|
|
events = fetch_all(f"{DATAINDEX}/query", {
|
|
"entity_types": "calendar_event",
|
|
"contact_ids": str(my_id),
|
|
"date_from": "2025-01-01T00:00:00Z",
|
|
"sort_by": "timestamp",
|
|
"sort_order": "asc",
|
|
})
|
|
meetings = fetch_all(f"{DATAINDEX}/query", {
|
|
"entity_types": "meeting",
|
|
"contact_ids": str(my_id),
|
|
"date_from": "2025-01-01T00:00:00Z",
|
|
})
|
|
return (events, meetings,)
|
|
|
|
@app.cell
|
|
def correlate(events, meetings, pl):
|
|
from datetime import datetime, timedelta
|
|
|
|
def _parse_dt(s):
|
|
if not s:
|
|
return None
|
|
return datetime.fromisoformat(s.replace("Z", "+00:00"))
|
|
|
|
# Index meetings by start_time for matching
|
|
_meeting_by_time = {}
|
|
for _m in meetings:
|
|
_start = _parse_dt(_m.get("start_time"))
|
|
if _start:
|
|
_meeting_by_time[_start] = _m
|
|
|
|
_rows = []
|
|
for _ev in events:
|
|
_ev_start = _parse_dt(_ev.get("start_time"))
|
|
_ev_end = _parse_dt(_ev.get("end_time"))
|
|
if not _ev_start:
|
|
continue
|
|
|
|
# Find meeting within 15-min window of calendar event start
|
|
_matched = None
|
|
for _m_start, _m in _meeting_by_time.items():
|
|
if abs((_m_start - _ev_start).total_seconds()) < 900:
|
|
_matched = _m
|
|
break
|
|
|
|
_rows.append({
|
|
"date": _ev_start.strftime("%Y-%m-%d"),
|
|
"time": _ev_start.strftime("%H:%M"),
|
|
"event_title": _ev.get("title", "(untitled)"),
|
|
"has_recording": _matched is not None,
|
|
"meeting_title": _matched.get("title", "") if _matched else "",
|
|
"attendee_count": len(_ev.get("attendees", [])),
|
|
})
|
|
|
|
calendar_df = pl.DataFrame(_rows)
|
|
return (calendar_df,)
|
|
```
|
|
|
|
## Pattern 4: Full Interaction Timeline for a Person
|
|
|
|
Combine emails, meetings, and Zulip messages into a single chronological view.
|
|
|
|
```python
|
|
@app.cell
|
|
def fetch_all_interactions(fetch_all, DATAINDEX, target_id):
|
|
all_entities = fetch_all(f"{DATAINDEX}/query", {
|
|
"contact_ids": str(target_id),
|
|
"date_from": "2025-01-01T00:00:00Z",
|
|
"sort_by": "timestamp",
|
|
"sort_order": "desc",
|
|
})
|
|
return (all_entities,)
|
|
|
|
@app.cell
|
|
def interaction_timeline(all_entities, target_name, pl):
|
|
_rows = []
|
|
for _e in all_entities:
|
|
_etype = _e["entity_type"]
|
|
_summary = ""
|
|
if _etype == "email":
|
|
_summary = _e.get("snippet") or _e.get("title") or ""
|
|
elif _etype == "meeting":
|
|
_summary = _e.get("summary") or _e.get("title") or ""
|
|
elif _etype == "conversation_message":
|
|
_summary = (_e.get("message") or "")[:120]
|
|
elif _etype == "threaded_conversation":
|
|
_summary = _e.get("title") or ""
|
|
elif _etype == "calendar_event":
|
|
_summary = _e.get("title") or ""
|
|
else:
|
|
_summary = _e.get("title") or _e["entity_type"]
|
|
|
|
_rows.append({
|
|
"date": _e["timestamp"][:10],
|
|
"type": _etype,
|
|
"source": _e["connector_id"],
|
|
"summary": _summary[:120],
|
|
})
|
|
|
|
timeline_df = pl.DataFrame(_rows)
|
|
return (timeline_df,)
|
|
|
|
@app.cell
|
|
def show_timeline(timeline_df, target_name, mo):
|
|
mo.md(f"## Interaction Timeline: {target_name} ({len(timeline_df)} events)")
|
|
|
|
@app.cell
|
|
def display_timeline(timeline_df):
|
|
timeline_df
|
|
```
|
|
|
|
## Pattern 5: LLM Filtering with `lib.llm`
|
|
|
|
When you need to classify, score, or extract structured information from each entity (e.g. "is this meeting about project X?", "rate the relevance of this email"), use the `llm_call` helper from `workflows/lib`. It sends each item to an LLM and parses the response into a typed Pydantic model.
|
|
|
|
**Prerequisites:** Copy `.env.example` to `.env` and fill in your `LLM_API_KEY`. Add `mirascope`, `pydantic`, and `python-dotenv` to the notebook's PEP 723 dependencies.
|
|
|
|
```python
|
|
# /// script
|
|
# requires-python = ">=3.12"
|
|
# dependencies = [
|
|
# "marimo",
|
|
# "httpx",
|
|
# "polars",
|
|
# "mirascope[openai]",
|
|
# "pydantic",
|
|
# "python-dotenv",
|
|
# ]
|
|
# ///
|
|
```
|
|
|
|
### Setup cell — load `.env` and import `llm_call`
|
|
|
|
```python
|
|
@app.cell
|
|
def setup():
|
|
from dotenv import load_dotenv
|
|
load_dotenv(".env") # Makes LLM_API_KEY available to lib/llm.py
|
|
|
|
import asyncio
|
|
import httpx
|
|
import marimo as mo
|
|
import polars as pl
|
|
from pydantic import BaseModel, Field
|
|
from lib.llm import llm_call
|
|
client = httpx.Client(timeout=30)
|
|
return (asyncio, client, llm_call, mo, pl, BaseModel, Field,)
|
|
```
|
|
|
|
### Define a response model
|
|
|
|
Create a Pydantic model that describes the structured output you want from the LLM:
|
|
|
|
```python
|
|
@app.cell
|
|
def models(BaseModel, Field):
|
|
|
|
class RelevanceScore(BaseModel):
|
|
relevant: bool
|
|
reason: str
|
|
score: int # 0-10
|
|
|
|
return (RelevanceScore,)
|
|
```
|
|
|
|
### Filter entities through the LLM
|
|
|
|
Iterate over fetched entities and call `llm_call` for each one. Since `llm_call` is async, use `asyncio.gather` to process items concurrently:
|
|
|
|
```python
|
|
@app.cell
|
|
async def llm_filter(meetings, llm_call, RelevanceScore, pl, mo, asyncio):
|
|
_topic = "Greyhaven"
|
|
|
|
async def _score(meeting):
|
|
_text = meeting.get("summary") or meeting.get("title") or ""
|
|
_result = await llm_call(
|
|
prompt=f"Is this meeting about '{_topic}'?\n\nMeeting: {_text}",
|
|
response_model=RelevanceScore,
|
|
system_prompt="Score the relevance of this meeting to the given topic. Set relevant=true if score >= 5.",
|
|
)
|
|
return {**meeting, "llm_relevant": _result.relevant, "llm_reason": _result.reason, "llm_score": _result.score}
|
|
|
|
scored_meetings = await asyncio.gather(*[_score(_m) for _m in meetings])
|
|
relevant_meetings = [_m for _m in scored_meetings if _m["llm_relevant"]]
|
|
|
|
mo.md(f"**LLM filter:** {len(relevant_meetings)}/{len(meetings)} meetings relevant to '{_topic}'")
|
|
return (relevant_meetings,)
|
|
```
|
|
|
|
### Tips for LLM filtering
|
|
|
|
- **Keep prompts short** — only include the fields the LLM needs (title, summary, snippet), not the entire raw entity.
|
|
- **Use structured output** — always pass a `response_model` so you get typed fields back, not free-text.
|
|
- **Batch wisely** — `asyncio.gather` sends all requests concurrently. For large datasets (100+ items), process in chunks to avoid rate limits.
|
|
- **Cache results** — LLM calls are slow and cost money. If iterating on a notebook, consider storing scored results in a cell variable so you don't re-score on every edit.
|
|
|
|
## Do / Don't — Quick Reference for LLM Agents
|
|
|
|
When generating marimo notebooks, follow these rules strictly. Violations cause `MultipleDefinitionError` at runtime.
|
|
|
|
### Do
|
|
|
|
- **Prefix cell-local variables with `_`** — `_resp`, `_rows`, `_m`, `_data`, `_chunk`. Marimo ignores `_`-prefixed names so they won't clash across cells.
|
|
- **Put all imports in the `setup` cell** and pass them as cell parameters: `def my_cell(client, mo, pl, asyncio):`. Never `import` inside other cells — even `import asyncio` in two async cells causes `MultipleDefinitionError`.
|
|
- **Give returned DataFrames unique names** — `email_df`, `meeting_df`, `timeline_df`. Never use a bare `df` that might collide with another cell.
|
|
- **Return only values other cells need** — everything else should be `_`-prefixed and stays private to the cell.
|
|
- **Import stdlib modules in `setup` too** — even `from datetime import datetime` creates a top-level name. If two cells both import `datetime`, marimo errors. Import it once in `setup` and receive it as a parameter, or use it inside a `_`-prefixed helper function where it's naturally scoped.
|
|
- **Every non-utility cell must show a preview** — see the "Cell Output Previews" section below.
|
|
- **Use separate display cells for DataFrames** — the build cell returns the DataFrame and shows a `mo.md()` count/heading; a standalone display cell (e.g., `def show_table(df): df`) renders it as an interactive table the user can sort and filter.
|
|
- **Include `room_name` when listing meetings** — the virtual room name provides useful context about where the meeting took place (e.g., `standup-office-bogota`). Show it as a column alongside `title`.
|
|
- **Keep cell output expressions at the top level** — if a cell conditionally displays a DataFrame, initialize `_output = None` before the `if`/`else`, assign inside the branches, then put `_output` as the last top-level expression. Expressions inside `if`/`else`/`for` blocks are silently ignored by marimo.
|
|
- **Put all user parameters in a `params` cell as the first cell** — date ranges, search terms, target names, limits. Never hardcode these values deeper in the notebook.
|
|
- **Declare cells as `async def` when using `await`** — `@app.cell` followed by `async def cell_name(...)`. This includes cells using `asyncio.gather`, `await llm_call(...)`, or any async API.
|
|
- **Return classes/models from cells that define them** — if a cell defines `class MyModel(BaseModel)`, return it so other cells can use it as a parameter: `return (MyModel,)`.
|
|
- **Use `python-dotenv` to load `.env`** — add `python-dotenv` to PEP 723 dependencies and call `load_dotenv(".env")` early in the setup cell (before importing `lib.llm`). This ensures `LLM_API_KEY` and other env vars are available without requiring them to be pre-set in the shell.
|
|
|
|
### Don't
|
|
|
|
- **Don't define the same variable name in two cells** — even `resp = ...` in cell A and `resp = ...` in cell B is a fatal error.
|
|
- **Don't `import` inside non-setup cells** — every `import X` defines a top-level variable `X`. If two cells both `import asyncio`, marimo raises `MultipleDefinitionError` and refuses to run. Put all imports in the `setup` cell and receive them as function parameters.
|
|
- **Don't use generic top-level names** like `df`, `rows`, `resp`, `data`, `result` — either prefix with `_` or give them a unique descriptive name.
|
|
- **Don't return temporary variables** — if `_rows` is only used to build a DataFrame, keep it `_`-prefixed and only return the DataFrame.
|
|
- **Don't use `await` in a non-async cell** — this causes marimo to save the cell as `_unparsable_cell` (a string literal that won't execute). Always use `async def` for cells that call async functions.
|
|
- **Don't define classes in a cell without returning them** — a bare `return` or no return makes classes invisible to the DAG. Other cells can't receive them as parameters.
|
|
- **Don't put display expressions inside `if`/`else`/`for` blocks** — marimo only renders the last top-level expression. A DataFrame inside an `if` branch is silently discarded. Use the `_output = None` pattern instead (see [Cell Output Must Be at the Top Level](#cell-output-must-be-at-the-top-level)).
|
|
|
|
## Cell Output Previews
|
|
|
|
Every cell that fetches, transforms, or produces data **must display a preview** so the user can validate results at each step. The only exceptions are **utility cells** (config, setup, helpers) that only define constants or functions.
|
|
|
|
Think from the user's perspective: when they open the notebook in `marimo edit`, each cell should tell them something useful — a count, a sample, a summary. Silent cells that do work but show nothing are hard to debug and validate.
|
|
|
|
### What to show
|
|
|
|
| Cell type | What to preview |
|
|
|-----------|----------------|
|
|
| API fetch (list of items) | `mo.md(f"**Fetched {len(items)} meetings**")` |
|
|
| DataFrame build | The DataFrame itself as last expression (renders as interactive table) |
|
|
| Scalar result | `mo.md(f"**Contact:** {name} (id={contact_id})")` |
|
|
| Search / filter | `mo.md(f"**{len(hits)} results** matching '{term}'")` |
|
|
| Final output | Full DataFrame or `mo.md()` summary as last expression |
|
|
|
|
### Example: fetch cell with preview
|
|
|
|
**Bad** — cell runs silently, user sees nothing:
|
|
|
|
```python
|
|
@app.cell
|
|
def fetch_meetings(fetch_all, DATAINDEX, my_id):
|
|
meetings = fetch_all(f"{DATAINDEX}/query", {
|
|
"entity_types": "meeting",
|
|
"contact_ids": str(my_id),
|
|
})
|
|
return (meetings,)
|
|
```
|
|
|
|
**Good** — cell shows a count so the user knows it worked:
|
|
|
|
```python
|
|
@app.cell
|
|
def fetch_meetings(fetch_all, DATAINDEX, my_id, mo):
|
|
meetings = fetch_all(f"{DATAINDEX}/query", {
|
|
"entity_types": "meeting",
|
|
"contact_ids": str(my_id),
|
|
})
|
|
mo.md(f"**Fetched {len(meetings)} meetings**")
|
|
return (meetings,)
|
|
```
|
|
|
|
### Example: transform cell with table preview
|
|
|
|
**Bad** — builds DataFrame but doesn't display it:
|
|
|
|
```python
|
|
@app.cell
|
|
def build_table(meetings, pl):
|
|
_rows = [{"date": _m["timestamp"][:10], "title": _m.get("title", "")} for _m in meetings]
|
|
meeting_df = pl.DataFrame(_rows)
|
|
return (meeting_df,)
|
|
```
|
|
|
|
**Good** — the build cell shows a `mo.md()` count, and a **separate display cell** renders the DataFrame as an interactive table:
|
|
|
|
```python
|
|
@app.cell
|
|
def build_table(meetings, pl, mo):
|
|
_rows = [{"date": _m["timestamp"][:10], "title": _m.get("title", "")} for _m in meetings]
|
|
meeting_df = pl.DataFrame(_rows).sort("date")
|
|
mo.md(f"### Meetings ({len(meeting_df)} results)")
|
|
return (meeting_df,)
|
|
|
|
@app.cell
|
|
def show_meeting_table(meeting_df):
|
|
meeting_df # Renders as interactive sortable table
|
|
```
|
|
|
|
### Separate display cells for DataFrames
|
|
|
|
When a cell builds a DataFrame, use **two cells**: one that builds and returns it (with a `mo.md()` summary), and a standalone display cell that renders it as a table. This keeps the build logic clean and gives the user an interactive table they can sort and filter in the marimo UI.
|
|
|
|
```python
|
|
# Cell 1: build and return the DataFrame, show a count
|
|
@app.cell
|
|
def build_sentiment_table(analyzed_meetings, pl, mo):
|
|
_rows = [...]
|
|
sentiment_df = pl.DataFrame(_rows).sort("date", descending=True)
|
|
mo.md(f"### Sentiment Analysis ({len(sentiment_df)} meetings)")
|
|
return (sentiment_df,)
|
|
|
|
# Cell 2: standalone display — just the DataFrame, nothing else
|
|
@app.cell
|
|
def show_sentiment_table(sentiment_df):
|
|
sentiment_df
|
|
```
|
|
|
|
This pattern makes every result inspectable. The `mo.md()` cell gives a quick count/heading; the display cell lets the user explore the full data interactively.
|
|
|
|
### Utility cells (no preview needed)
|
|
|
|
Config, setup, and helper cells that only define constants or functions don't need previews:
|
|
|
|
```python
|
|
@app.cell
|
|
def config():
|
|
BASE = "http://localhost:42000"
|
|
CONTACTDB = f"{BASE}/contactdb-api"
|
|
DATAINDEX = f"{BASE}/dataindex/api/v1"
|
|
return CONTACTDB, DATAINDEX
|
|
|
|
@app.cell
|
|
def helpers(client):
|
|
def fetch_all(url, params):
|
|
...
|
|
return (fetch_all,)
|
|
```
|
|
|
|
## Tips
|
|
|
|
- Use `marimo edit` during development to see cell outputs interactively
|
|
- Make raw API responses the last expression in a cell to inspect their structure
|
|
- Use `polars` over `pandas` for better performance and type safety
|
|
- Set `timeout=30` on httpx clients — some queries over large date ranges are slow
|
|
- Name cells descriptively — function names appear in the marimo sidebar
|