# Marimo Notebook Patterns This guide covers how to create [marimo](https://marimo.io) notebooks for data analysis against the InternalAI platform APIs. Marimo notebooks are plain `.py` files with reactive cells — no `.ipynb` format, no Jupyter dependency. ## Marimo Basics A marimo notebook is a Python file with `@app.cell` decorated functions. Each cell returns values as a tuple, and other cells receive them as function parameters — marimo builds a reactive DAG automatically. ```python import marimo app = marimo.App() @app.cell def cell_one(): x = 42 return (x,) @app.cell def cell_two(x): # Re-runs automatically when x changes result = x * 2 return (result,) ``` **Key rules:** - Cells declare dependencies via function parameters - Cells return values as tuples: `return (var1, var2,)` - The **last expression** in a cell is displayed as rich output in the marimo UI (dataframes render as tables, dicts as collapsible trees) - Use `mo.md("# heading")` for formatted markdown output (import `mo` once in setup — see below) - No manual execution order; the DAG determines it - **Variable names must be unique across cells.** Every variable assigned at the top level of a cell is tracked by marimo's DAG. If two cells both define `resp`, marimo raises `MultipleDefinitionError` and refuses to run. Prefix cell-local variables with `_` (e.g., `_resp`, `_rows`, `_data`) to make them **private** to that cell — marimo ignores `_`-prefixed names. - **Import shared modules once** in a single setup cell and pass them as cell parameters. Do NOT `import marimo as mo` in multiple cells — that defines `mo` twice. Instead, import it once in `setup` and receive it via `def my_cell(mo):`. ### Cell Variable Scoping — Example This is the **most common mistake**. Any variable assigned at the top level of a cell (not inside a `def` or comprehension) is tracked by marimo. If two cells assign the same name, the notebook refuses to run. **BROKEN** — `resp` is defined at top level in both cells: ```python # Cell A @app.cell def search_meetings(client, DATAINDEX): resp = client.post(f"{DATAINDEX}/search", json={...}) # defines 'resp' resp.raise_for_status() results = resp.json()["results"] return (results,) # Cell B @app.cell def fetch_details(client, DATAINDEX, results): resp = client.get(f"{DATAINDEX}/entities/{results[0]}") # also defines 'resp' → ERROR meeting = resp.json() return (meeting,) ``` > **Error:** `MultipleDefinitionError: variable 'resp' is defined in multiple cells` **FIXED** — prefix cell-local variables with `_`: ```python # Cell A @app.cell def search_meetings(client, DATAINDEX): _resp = client.post(f"{DATAINDEX}/search", json={...}) # _resp is cell-private _resp.raise_for_status() results = _resp.json()["results"] return (results,) # Cell B @app.cell def fetch_details(client, DATAINDEX, results): _resp = client.get(f"{DATAINDEX}/entities/{results[0]}") # _resp is cell-private, no conflict meeting = _resp.json() return (meeting,) ``` **Rule of thumb:** if a variable is only used within the cell to compute a return value, prefix it with `_`. Only leave names unprefixed if another cell needs to receive them. > **Note:** Variables inside nested `def` functions are naturally local and don't need `_` prefixes — e.g., `resp` inside a `def fetch_all(...)` helper is fine because it's scoped to the function, not the cell. ### Inline Dependencies with PEP 723 Use PEP 723 `/// script` metadata so `uv run` auto-installs dependencies: ```python # /// script # requires-python = ">=3.12" # dependencies = [ # "marimo", # "httpx", # "polars", # ] # /// ``` ### Running Notebooks ```bash uvx marimo edit notebook.py # Interactive editor (best for development) uvx marimo run notebook.py # Read-only web app uv run notebook.py # Script mode (terminal output) ``` ### Inspecting Cell Outputs In `marimo edit`, every cell's return value is displayed as rich output below the cell. This is the primary way to introspect API responses: - **Dicts/lists** render as collapsible JSON trees — click to expand nested fields - **Polars/Pandas DataFrames** render as interactive sortable tables - **Strings** render as plain text To inspect a raw API response, just make it the last expression: ```python @app.cell def inspect_response(client, DATAINDEX): _resp = client.get(f"{DATAINDEX}/query", params={ "entity_types": "meeting", "limit": 2, }) _resp.json() # This gets displayed as a collapsible JSON tree ``` To inspect an intermediate value alongside other work, use `mo.accordion` or return it: ```python @app.cell def debug_meetings(meetings, mo): mo.md(f"**Count:** {len(meetings)}") # Show first item structure for inspection mo.accordion({"First meeting raw": mo.json(meetings[0])}) if meetings else None ``` ## Notebook Skeleton Every notebook against InternalAI follows this structure: ```python # /// script # requires-python = ">=3.12" # dependencies = [ # "marimo", # "httpx", # "polars", # ] # /// import marimo app = marimo.App() @app.cell def params(): """User parameters — edit these to change the workflow's behavior.""" SEARCH_TERMS = ["greyhaven"] DATE_FROM = "2026-01-01T00:00:00Z" DATE_TO = "2026-02-01T00:00:00Z" TARGET_PERSON = None # Set to a name like "Alice" to filter by person, or None for all return DATE_FROM, DATE_TO, SEARCH_TERMS, TARGET_PERSON @app.cell def config(): BASE = "http://localhost:42000" CONTACTDB = f"{BASE}/contactdb-api" DATAINDEX = f"{BASE}/dataindex/api/v1" return (CONTACTDB, DATAINDEX,) @app.cell def setup(): import httpx import marimo as mo import polars as pl client = httpx.Client(timeout=30) return (client, mo, pl,) # --- your IN / ETL / OUT cells here --- if __name__ == "__main__": app.run() ``` **The `params` cell must always be the first cell** after `app = marimo.App()`. It contains all user-configurable constants (search terms, date ranges, target names, etc.) as plain Python values. This way the user can tweak the workflow by editing a single cell at the top — no need to hunt through the code for hardcoded values. ## Pagination Helper The DataIndex `GET /query` endpoint paginates with `limit` and `offset`. Always paginate — result sets can be large. ```python @app.cell def helpers(client): def fetch_all(url, params): """Fetch all pages from a paginated DataIndex endpoint.""" all_items = [] limit = params.get("limit", 50) params = {**params, "limit": limit, "offset": 0} while True: resp = client.get(url, params=params) resp.raise_for_status() data = resp.json() all_items.extend(data["items"]) if params["offset"] + limit >= data["total"]: break params["offset"] += limit return all_items def resolve_contact(name, contactdb_url): """Find a contact by name, return their ID.""" resp = client.get(f"{contactdb_url}/api/contacts", params={"search": name}) resp.raise_for_status() contacts = resp.json()["contacts"] if not contacts: raise ValueError(f"No contact found for '{name}'") return contacts[0] return (fetch_all, resolve_contact,) ``` ## Pattern 1: Emails Involving a Specific Person Emails have `from_contact_id`, `to_contact_ids`, and `cc_contact_ids`. The query API's `contact_ids` filter matches entities where the contact appears in **any** of these roles. ```python @app.cell def find_person(resolve_contact, CONTACTDB): target = resolve_contact("Alice", CONTACTDB) target_id = target["id"] target_name = target["name"] return (target_id, target_name,) @app.cell def fetch_emails(fetch_all, DATAINDEX, target_id): emails = fetch_all(f"{DATAINDEX}/query", { "entity_types": "email", "contact_ids": str(target_id), "date_from": "2025-01-01T00:00:00Z", "sort_order": "desc", }) return (emails,) @app.cell def email_table(emails, target_id, target_name, pl): email_df = pl.DataFrame([{ "date": e["timestamp"][:10], "subject": e.get("title", "(no subject)"), "direction": ( "sent" if str(target_id) == str(e.get("from_contact_id")) else "received" ), "snippet": (e.get("snippet") or e.get("text_content") or "")[:100], } for e in emails]) return (email_df,) @app.cell def show_emails(email_df, target_name, mo): mo.md(f"## Emails involving {target_name} ({len(email_df)} total)") @app.cell def display_email_table(email_df): email_df # Renders as interactive table in marimo edit ``` ## Pattern 2: Meetings with a Specific Participant Meetings have a `participants` list where each entry may or may not have a resolved `contact_id`. The query API's `contact_ids` filter only matches **resolved** participants. **Strategy:** Query by `contact_ids` to get meetings with resolved participants, then optionally do a client-side check on `participants[].display_name` or `transcript` for unresolved ones. ```python @app.cell def fetch_meetings(fetch_all, DATAINDEX, target_id, my_id): # Get meetings where the target appears in contact_ids resolved_meetings = fetch_all(f"{DATAINDEX}/query", { "entity_types": "meeting", "contact_ids": str(target_id), "date_from": "2025-01-01T00:00:00Z", }) return (resolved_meetings,) @app.cell def meeting_table(resolved_meetings, target_name, pl): _rows = [] for _m in resolved_meetings: _participants = _m.get("participants", []) _names = [_p["display_name"] for _p in _participants] _rows.append({ "date": (_m.get("start_time") or _m["timestamp"])[:10], "title": _m.get("title", _m.get("room_name", "Untitled")), "participants": ", ".join(_names), "has_transcript": _m.get("transcript") is not None, "has_summary": _m.get("summary") is not None, }) meeting_df = pl.DataFrame(_rows) return (meeting_df,) ``` To also find meetings where the person was present but **not resolved** (guest), search the transcript: ```python @app.cell def search_unresolved(client, DATAINDEX, target_name): # Semantic search for the person's name in meeting transcripts _resp = client.post(f"{DATAINDEX}/search", json={ "search_text": target_name, "entity_types": ["meeting"], "limit": 50, }) _resp.raise_for_status() transcript_hits = _resp.json()["results"] return (transcript_hits,) ``` ## Pattern 3: Calendar Events → Meeting Correlation Calendar events and meetings are separate entities from different connectors. To find which calendar events had a corresponding recorded meeting, match by time overlap. ```python @app.cell def fetch_calendar_and_meetings(fetch_all, DATAINDEX, my_id): events = fetch_all(f"{DATAINDEX}/query", { "entity_types": "calendar_event", "contact_ids": str(my_id), "date_from": "2025-01-01T00:00:00Z", "sort_by": "timestamp", "sort_order": "asc", }) meetings = fetch_all(f"{DATAINDEX}/query", { "entity_types": "meeting", "contact_ids": str(my_id), "date_from": "2025-01-01T00:00:00Z", }) return (events, meetings,) @app.cell def correlate(events, meetings, pl): from datetime import datetime, timedelta def _parse_dt(s): if not s: return None return datetime.fromisoformat(s.replace("Z", "+00:00")) # Index meetings by start_time for matching _meeting_by_time = {} for _m in meetings: _start = _parse_dt(_m.get("start_time")) if _start: _meeting_by_time[_start] = _m _rows = [] for _ev in events: _ev_start = _parse_dt(_ev.get("start_time")) _ev_end = _parse_dt(_ev.get("end_time")) if not _ev_start: continue # Find meeting within 15-min window of calendar event start _matched = None for _m_start, _m in _meeting_by_time.items(): if abs((_m_start - _ev_start).total_seconds()) < 900: _matched = _m break _rows.append({ "date": _ev_start.strftime("%Y-%m-%d"), "time": _ev_start.strftime("%H:%M"), "event_title": _ev.get("title", "(untitled)"), "has_recording": _matched is not None, "meeting_title": _matched.get("title", "") if _matched else "", "attendee_count": len(_ev.get("attendees", [])), }) calendar_df = pl.DataFrame(_rows) return (calendar_df,) ``` ## Pattern 4: Full Interaction Timeline for a Person Combine emails, meetings, and Zulip messages into a single chronological view. ```python @app.cell def fetch_all_interactions(fetch_all, DATAINDEX, target_id): all_entities = fetch_all(f"{DATAINDEX}/query", { "contact_ids": str(target_id), "date_from": "2025-01-01T00:00:00Z", "sort_by": "timestamp", "sort_order": "desc", }) return (all_entities,) @app.cell def interaction_timeline(all_entities, target_name, pl): _rows = [] for _e in all_entities: _etype = _e["entity_type"] _summary = "" if _etype == "email": _summary = _e.get("snippet") or _e.get("title") or "" elif _etype == "meeting": _summary = _e.get("summary") or _e.get("title") or "" elif _etype == "conversation_message": _summary = (_e.get("message") or "")[:120] elif _etype == "threaded_conversation": _summary = _e.get("title") or "" elif _etype == "calendar_event": _summary = _e.get("title") or "" else: _summary = _e.get("title") or _e["entity_type"] _rows.append({ "date": _e["timestamp"][:10], "type": _etype, "source": _e["connector_id"], "summary": _summary[:120], }) timeline_df = pl.DataFrame(_rows) return (timeline_df,) @app.cell def show_timeline(timeline_df, target_name, mo): mo.md(f"## Interaction Timeline: {target_name} ({len(timeline_df)} events)") @app.cell def display_timeline(timeline_df): timeline_df ``` ## Pattern 5: LLM Filtering with `lib.llm` When you need to classify, score, or extract structured information from each entity (e.g. "is this meeting about project X?", "rate the relevance of this email"), use the `llm_call` helper from `workflows/lib`. It sends each item to an LLM and parses the response into a typed Pydantic model. **Prerequisites:** Copy `.env.example` to `.env` and fill in your `LLM_API_KEY`. Add `mirascope` and `pydantic` to the notebook's PEP 723 dependencies. ```python # /// script # requires-python = ">=3.12" # dependencies = [ # "marimo", # "httpx", # "polars", # "mirascope", # "pydantic", # ] # /// ``` ### Setup cell — import `llm_call` ```python @app.cell def setup(): import httpx import marimo as mo import polars as pl from lib.llm import llm_call client = httpx.Client(timeout=30) return (client, llm_call, mo, pl,) ``` ### Define a response model Create a Pydantic model that describes the structured output you want from the LLM: ```python @app.cell def models(): from pydantic import BaseModel class RelevanceScore(BaseModel): relevant: bool reason: str score: int # 0-10 return (RelevanceScore,) ``` ### Filter entities through the LLM Iterate over fetched entities and call `llm_call` for each one. Since `llm_call` is async, use `asyncio.gather` to process items concurrently: ```python @app.cell async def llm_filter(meetings, llm_call, RelevanceScore, pl, mo): import asyncio _topic = "Greyhaven" async def _score(meeting): _text = meeting.get("summary") or meeting.get("title") or "" _result = await llm_call( prompt=f"Is this meeting about '{_topic}'?\n\nMeeting: {_text}", response_model=RelevanceScore, system_prompt="Score the relevance of this meeting to the given topic. Set relevant=true if score >= 5.", ) return {**meeting, "llm_relevant": _result.relevant, "llm_reason": _result.reason, "llm_score": _result.score} scored_meetings = await asyncio.gather(*[_score(_m) for _m in meetings]) relevant_meetings = [_m for _m in scored_meetings if _m["llm_relevant"]] mo.md(f"**LLM filter:** {len(relevant_meetings)}/{len(meetings)} meetings relevant to '{_topic}'") return (relevant_meetings,) ``` ### Tips for LLM filtering - **Keep prompts short** — only include the fields the LLM needs (title, summary, snippet), not the entire raw entity. - **Use structured output** — always pass a `response_model` so you get typed fields back, not free-text. - **Batch wisely** — `asyncio.gather` sends all requests concurrently. For large datasets (100+ items), process in chunks to avoid rate limits. - **Cache results** — LLM calls are slow and cost money. If iterating on a notebook, consider storing scored results in a cell variable so you don't re-score on every edit. ## Do / Don't — Quick Reference for LLM Agents When generating marimo notebooks, follow these rules strictly. Violations cause `MultipleDefinitionError` at runtime. ### Do - **Prefix cell-local variables with `_`** — `_resp`, `_rows`, `_m`, `_data`, `_chunk`. Marimo ignores `_`-prefixed names so they won't clash across cells. - **Import shared modules once in `setup`** and pass them as cell parameters: `def my_cell(client, mo, pl):`. - **Give returned DataFrames unique names** — `email_df`, `meeting_df`, `timeline_df`. Never use a bare `df` that might collide with another cell. - **Return only values other cells need** — everything else should be `_`-prefixed and stays private to the cell. - **Use `from datetime import datetime` inside the cell** that needs it (stdlib imports are fine inline since they're `_`-safe inside functions, but avoid assigning them to non-`_` names if another cell does the same). - **Every non-utility cell must show a preview** — see the "Cell Output Previews" section below. - **Put all user parameters in a `params` cell as the first cell** — date ranges, search terms, target names, limits. Never hardcode these values deeper in the notebook. ### Don't - **Don't define the same variable name in two cells** — even `resp = ...` in cell A and `resp = ...` in cell B is a fatal error. - **Don't `import marimo as mo` in multiple cells** — this defines `mo` twice. Import it once in `setup`, then receive it via `def my_cell(mo):`. - **Don't use generic top-level names** like `df`, `rows`, `resp`, `data`, `result` — either prefix with `_` or give them a unique descriptive name. - **Don't return temporary variables** — if `_rows` is only used to build a DataFrame, keep it `_`-prefixed and only return the DataFrame. - **Don't use `import X` at the top level of multiple cells** for the same module — the module variable name would be duplicated. Import once in `setup` or use `_`-prefixed local imports (`_json = __import__("json")`). ## Cell Output Previews Every cell that fetches, transforms, or produces data **must display a preview** so the user can validate results at each step. The only exceptions are **utility cells** (config, setup, helpers) that only define constants or functions. Think from the user's perspective: when they open the notebook in `marimo edit`, each cell should tell them something useful — a count, a sample, a summary. Silent cells that do work but show nothing are hard to debug and validate. ### What to show | Cell type | What to preview | |-----------|----------------| | API fetch (list of items) | `mo.md(f"**Fetched {len(items)} meetings**")` | | DataFrame build | The DataFrame itself as last expression (renders as interactive table) | | Scalar result | `mo.md(f"**Contact:** {name} (id={contact_id})")` | | Search / filter | `mo.md(f"**{len(hits)} results** matching '{term}'")` | | Final output | Full DataFrame or `mo.md()` summary as last expression | ### Example: fetch cell with preview **Bad** — cell runs silently, user sees nothing: ```python @app.cell def fetch_meetings(fetch_all, DATAINDEX, my_id): meetings = fetch_all(f"{DATAINDEX}/query", { "entity_types": "meeting", "contact_ids": str(my_id), }) return (meetings,) ``` **Good** — cell shows a count so the user knows it worked: ```python @app.cell def fetch_meetings(fetch_all, DATAINDEX, my_id, mo): meetings = fetch_all(f"{DATAINDEX}/query", { "entity_types": "meeting", "contact_ids": str(my_id), }) mo.md(f"**Fetched {len(meetings)} meetings**") return (meetings,) ``` ### Example: transform cell with table preview **Bad** — builds DataFrame but doesn't display it: ```python @app.cell def build_table(meetings, pl): _rows = [{"date": _m["timestamp"][:10], "title": _m.get("title", "")} for _m in meetings] meeting_df = pl.DataFrame(_rows) return (meeting_df,) ``` **Good** — DataFrame is the last expression, so marimo renders it as an interactive table: ```python @app.cell def build_table(meetings, pl, mo): _rows = [{"date": _m["timestamp"][:10], "title": _m.get("title", "")} for _m in meetings] meeting_df = pl.DataFrame(_rows).sort("date") mo.md(f"### Meetings ({len(meeting_df)} results)") return (meeting_df,) @app.cell def show_meeting_table(meeting_df): meeting_df # Renders as interactive sortable table ``` ### Utility cells (no preview needed) Config, setup, and helper cells that only define constants or functions don't need previews: ```python @app.cell def config(): BASE = "http://localhost:42000" CONTACTDB = f"{BASE}/contactdb-api" DATAINDEX = f"{BASE}/dataindex/api/v1" return CONTACTDB, DATAINDEX @app.cell def helpers(client): def fetch_all(url, params): ... return (fetch_all,) ``` ## Tips - Use `marimo edit` during development to see cell outputs interactively - Make raw API responses the last expression in a cell to inspect their structure - Use `polars` over `pandas` for better performance and type safety - Set `timeout=30` on httpx clients — some queries over large date ranges are slow - Name cells descriptively — function names appear in the marimo sidebar