Files

Mathieu Virbel b18ee3b564 feat: first commit

2026-02-10 18:19:30 -06:00

19 KiB

Raw Blame History

Marimo Notebook Patterns

This guide covers how to create marimo notebooks for data analysis against the InternalAI platform APIs. Marimo notebooks are plain .py files with reactive cells — no .ipynb format, no Jupyter dependency.

Marimo Basics

A marimo notebook is a Python file with @app.cell decorated functions. Each cell returns values as a tuple, and other cells receive them as function parameters — marimo builds a reactive DAG automatically.

import marimo
app = marimo.App()

@app.cell
def cell_one():
    x = 42
    return (x,)

@app.cell
def cell_two(x):
    # Re-runs automatically when x changes
    result = x * 2
    return (result,)

Key rules:

Cells declare dependencies via function parameters
Cells return values as tuples: return (var1, var2,)
The last expression in a cell is displayed as rich output in the marimo UI (dataframes render as tables, dicts as collapsible trees)
Use mo.md("# heading") for formatted markdown output (import mo once in setup — see below)
No manual execution order; the DAG determines it
Variable names must be unique across cells. Every variable assigned at the top level of a cell is tracked by marimo's DAG. If two cells both define resp, marimo raises MultipleDefinitionError and refuses to run. Prefix cell-local variables with _ (e.g., _resp, _rows, _data) to make them private to that cell — marimo ignores _-prefixed names.
Import shared modules once in a single setup cell and pass them as cell parameters. Do NOT import marimo as mo in multiple cells — that defines mo twice. Instead, import it once in setup and receive it via def my_cell(mo):.

Cell Variable Scoping — Example

This is the most common mistake. Any variable assigned at the top level of a cell (not inside a def or comprehension) is tracked by marimo. If two cells assign the same name, the notebook refuses to run.

BROKEN — resp is defined at top level in both cells:

# Cell A
@app.cell
def search_meetings(client, DATAINDEX):
    resp = client.post(f"{DATAINDEX}/search", json={...})  # defines 'resp'
    resp.raise_for_status()
    results = resp.json()["results"]
    return (results,)

# Cell B
@app.cell
def fetch_details(client, DATAINDEX, results):
    resp = client.get(f"{DATAINDEX}/entities/{results[0]}")  # also defines 'resp' → ERROR
    meeting = resp.json()
    return (meeting,)

Error: MultipleDefinitionError: variable 'resp' is defined in multiple cells

FIXED — prefix cell-local variables with _:

# Cell A
@app.cell
def search_meetings(client, DATAINDEX):
    _resp = client.post(f"{DATAINDEX}/search", json={...})  # _resp is cell-private
    _resp.raise_for_status()
    results = _resp.json()["results"]
    return (results,)

# Cell B
@app.cell
def fetch_details(client, DATAINDEX, results):
    _resp = client.get(f"{DATAINDEX}/entities/{results[0]}")  # _resp is cell-private, no conflict
    meeting = _resp.json()
    return (meeting,)

Rule of thumb: if a variable is only used within the cell to compute a return value, prefix it with _. Only leave names unprefixed if another cell needs to receive them.

Note: Variables inside nested def functions are naturally local and don't need _ prefixes — e.g., resp inside a def fetch_all(...) helper is fine because it's scoped to the function, not the cell.

Inline Dependencies with PEP 723

Use PEP 723 /// script metadata so uv run auto-installs dependencies:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "marimo",
#     "httpx",
#     "polars",
# ]
# ///

Running Notebooks

uvx marimo edit notebook.py   # Interactive editor (best for development)
uvx marimo run notebook.py    # Read-only web app
uv run notebook.py            # Script mode (terminal output)

Inspecting Cell Outputs

In marimo edit, every cell's return value is displayed as rich output below the cell. This is the primary way to introspect API responses:

Dicts/lists render as collapsible JSON trees — click to expand nested fields
Polars/Pandas DataFrames render as interactive sortable tables
Strings render as plain text

To inspect a raw API response, just make it the last expression:

@app.cell
def inspect_response(client, DATAINDEX):
    _resp = client.get(f"{DATAINDEX}/query", params={
        "entity_types": "meeting", "limit": 2,
    })
    _resp.json()  # This gets displayed as a collapsible JSON tree

To inspect an intermediate value alongside other work, use mo.accordion or return it:

@app.cell
def debug_meetings(meetings, mo):
    mo.md(f"**Count:** {len(meetings)}")
    # Show first item structure for inspection
    mo.accordion({"First meeting raw": mo.json(meetings[0])}) if meetings else None

Notebook Skeleton

Every notebook against InternalAI follows this structure:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "marimo",
#     "httpx",
#     "polars",
# ]
# ///

import marimo
app = marimo.App()

@app.cell
def params():
    """User parameters — edit these to change the workflow's behavior."""
    SEARCH_TERMS = ["greyhaven"]
    DATE_FROM = "2026-01-01T00:00:00Z"
    DATE_TO = "2026-02-01T00:00:00Z"
    TARGET_PERSON = None  # Set to a name like "Alice" to filter by person, or None for all
    return DATE_FROM, DATE_TO, SEARCH_TERMS, TARGET_PERSON

@app.cell
def config():
    BASE = "http://localhost:42000"
    CONTACTDB = f"{BASE}/contactdb-api"
    DATAINDEX = f"{BASE}/dataindex/api/v1"
    return (CONTACTDB, DATAINDEX,)

@app.cell
def setup():
    import httpx
    import marimo as mo
    import polars as pl
    client = httpx.Client(timeout=30)
    return (client, mo, pl,)

# --- your IN / ETL / OUT cells here ---

if __name__ == "__main__":
    app.run()

The params cell must always be the first cell after app = marimo.App(). It contains all user-configurable constants (search terms, date ranges, target names, etc.) as plain Python values. This way the user can tweak the workflow by editing a single cell at the top — no need to hunt through the code for hardcoded values.

Pagination Helper

The DataIndex GET /query endpoint paginates with limit and offset. Always paginate — result sets can be large.

@app.cell
def helpers(client):
    def fetch_all(url, params):
        """Fetch all pages from a paginated DataIndex endpoint."""
        all_items = []
        limit = params.get("limit", 50)
        params = {**params, "limit": limit, "offset": 0}
        while True:
            resp = client.get(url, params=params)
            resp.raise_for_status()
            data = resp.json()
            all_items.extend(data["items"])
            if params["offset"] + limit >= data["total"]:
                break
            params["offset"] += limit
        return all_items

    def resolve_contact(name, contactdb_url):
        """Find a contact by name, return their ID."""
        resp = client.get(f"{contactdb_url}/api/contacts", params={"search": name})
        resp.raise_for_status()
        contacts = resp.json()["contacts"]
        if not contacts:
            raise ValueError(f"No contact found for '{name}'")
        return contacts[0]

    return (fetch_all, resolve_contact,)

Pattern 1: Emails Involving a Specific Person

Emails have from_contact_id, to_contact_ids, and cc_contact_ids. The query API's contact_ids filter matches entities where the contact appears in any of these roles.

@app.cell
def find_person(resolve_contact, CONTACTDB):
    target = resolve_contact("Alice", CONTACTDB)
    target_id = target["id"]
    target_name = target["name"]
    return (target_id, target_name,)

@app.cell
def fetch_emails(fetch_all, DATAINDEX, target_id):
    emails = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "email",
        "contact_ids": str(target_id),
        "date_from": "2025-01-01T00:00:00Z",
        "sort_order": "desc",
    })
    return (emails,)

@app.cell
def email_table(emails, target_id, target_name, pl):
    email_df = pl.DataFrame([{
        "date": e["timestamp"][:10],
        "subject": e.get("title", "(no subject)"),
        "direction": (
            "sent" if str(target_id) == str(e.get("from_contact_id"))
            else "received"
        ),
        "snippet": (e.get("snippet") or e.get("text_content") or "")[:100],
    } for e in emails])
    return (email_df,)

@app.cell
def show_emails(email_df, target_name, mo):
    mo.md(f"## Emails involving {target_name} ({len(email_df)} total)")

@app.cell
def display_email_table(email_df):
    email_df  # Renders as interactive table in marimo edit

Pattern 2: Meetings with a Specific Participant

Meetings have a participants list where each entry may or may not have a resolved contact_id. The query API's contact_ids filter only matches resolved participants.

Strategy: Query by contact_ids to get meetings with resolved participants, then optionally do a client-side check on participants[].display_name or transcript for unresolved ones.

@app.cell
def fetch_meetings(fetch_all, DATAINDEX, target_id, my_id):
    # Get meetings where the target appears in contact_ids
    resolved_meetings = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "meeting",
        "contact_ids": str(target_id),
        "date_from": "2025-01-01T00:00:00Z",
    })
    return (resolved_meetings,)

@app.cell
def meeting_table(resolved_meetings, target_name, pl):
    _rows = []
    for _m in resolved_meetings:
        _participants = _m.get("participants", [])
        _names = [_p["display_name"] for _p in _participants]
        _rows.append({
            "date": (_m.get("start_time") or _m["timestamp"])[:10],
            "title": _m.get("title", _m.get("room_name", "Untitled")),
            "participants": ", ".join(_names),
            "has_transcript": _m.get("transcript") is not None,
            "has_summary": _m.get("summary") is not None,
        })
    meeting_df = pl.DataFrame(_rows)
    return (meeting_df,)

To also find meetings where the person was present but not resolved (guest), search the transcript:

@app.cell
def search_unresolved(client, DATAINDEX, target_name):
    # Semantic search for the person's name in meeting transcripts
    _resp = client.post(f"{DATAINDEX}/search", json={
        "search_text": target_name,
        "entity_types": ["meeting"],
        "limit": 50,
    })
    _resp.raise_for_status()
    transcript_hits = _resp.json()["results"]
    return (transcript_hits,)

Pattern 3: Calendar Events → Meeting Correlation

Calendar events and meetings are separate entities from different connectors. To find which calendar events had a corresponding recorded meeting, match by time overlap.

@app.cell
def fetch_calendar_and_meetings(fetch_all, DATAINDEX, my_id):
    events = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "calendar_event",
        "contact_ids": str(my_id),
        "date_from": "2025-01-01T00:00:00Z",
        "sort_by": "timestamp",
        "sort_order": "asc",
    })
    meetings = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "meeting",
        "contact_ids": str(my_id),
        "date_from": "2025-01-01T00:00:00Z",
    })
    return (events, meetings,)

@app.cell
def correlate(events, meetings, pl):
    from datetime import datetime, timedelta

    def _parse_dt(s):
        if not s:
            return None
        return datetime.fromisoformat(s.replace("Z", "+00:00"))

    # Index meetings by start_time for matching
    _meeting_by_time = {}
    for _m in meetings:
        _start = _parse_dt(_m.get("start_time"))
        if _start:
            _meeting_by_time[_start] = _m

    _rows = []
    for _ev in events:
        _ev_start = _parse_dt(_ev.get("start_time"))
        _ev_end = _parse_dt(_ev.get("end_time"))
        if not _ev_start:
            continue

        # Find meeting within 15-min window of calendar event start
        _matched = None
        for _m_start, _m in _meeting_by_time.items():
            if abs((_m_start - _ev_start).total_seconds()) < 900:
                _matched = _m
                break

        _rows.append({
            "date": _ev_start.strftime("%Y-%m-%d"),
            "time": _ev_start.strftime("%H:%M"),
            "event_title": _ev.get("title", "(untitled)"),
            "has_recording": _matched is not None,
            "meeting_title": _matched.get("title", "") if _matched else "",
            "attendee_count": len(_ev.get("attendees", [])),
        })

    calendar_df = pl.DataFrame(_rows)
    return (calendar_df,)

Pattern 4: Full Interaction Timeline for a Person

Combine emails, meetings, and Zulip messages into a single chronological view.

@app.cell
def fetch_all_interactions(fetch_all, DATAINDEX, target_id):
    all_entities = fetch_all(f"{DATAINDEX}/query", {
        "contact_ids": str(target_id),
        "date_from": "2025-01-01T00:00:00Z",
        "sort_by": "timestamp",
        "sort_order": "desc",
    })
    return (all_entities,)

@app.cell
def interaction_timeline(all_entities, target_name, pl):
    _rows = []
    for _e in all_entities:
        _etype = _e["entity_type"]
        _summary = ""
        if _etype == "email":
            _summary = _e.get("snippet") or _e.get("title") or ""
        elif _etype == "meeting":
            _summary = _e.get("summary") or _e.get("title") or ""
        elif _etype == "conversation_message":
            _summary = (_e.get("message") or "")[:120]
        elif _etype == "threaded_conversation":
            _summary = _e.get("title") or ""
        elif _etype == "calendar_event":
            _summary = _e.get("title") or ""
        else:
            _summary = _e.get("title") or _e["entity_type"]

        _rows.append({
            "date": _e["timestamp"][:10],
            "type": _etype,
            "source": _e["connector_id"],
            "summary": _summary[:120],
        })

    timeline_df = pl.DataFrame(_rows)
    return (timeline_df,)

@app.cell
def show_timeline(timeline_df, target_name, mo):
    mo.md(f"## Interaction Timeline: {target_name} ({len(timeline_df)} events)")

@app.cell
def display_timeline(timeline_df):
    timeline_df

Do / Don't — Quick Reference for LLM Agents

When generating marimo notebooks, follow these rules strictly. Violations cause MultipleDefinitionError at runtime.

Do

Prefix cell-local variables with _ — _resp, _rows, _m, _data, _chunk. Marimo ignores _-prefixed names so they won't clash across cells.
Import shared modules once in setup and pass them as cell parameters: def my_cell(client, mo, pl):.
Give returned DataFrames unique names — email_df, meeting_df, timeline_df. Never use a bare df that might collide with another cell.
Return only values other cells need — everything else should be _-prefixed and stays private to the cell.
Use from datetime import datetime inside the cell that needs it (stdlib imports are fine inline since they're _-safe inside functions, but avoid assigning them to non-_ names if another cell does the same).
Every non-utility cell must show a preview — see the "Cell Output Previews" section below.
Put all user parameters in a params cell as the first cell — date ranges, search terms, target names, limits. Never hardcode these values deeper in the notebook.

Don't

Don't define the same variable name in two cells — even resp = ... in cell A and resp = ... in cell B is a fatal error.
Don't import marimo as mo in multiple cells — this defines mo twice. Import it once in setup, then receive it via def my_cell(mo):.
Don't use generic top-level names like df, rows, resp, data, result — either prefix with _ or give them a unique descriptive name.
Don't return temporary variables — if _rows is only used to build a DataFrame, keep it _-prefixed and only return the DataFrame.
Don't use import X at the top level of multiple cells for the same module — the module variable name would be duplicated. Import once in setup or use _-prefixed local imports (_json = __import__("json")).

Cell Output Previews

Every cell that fetches, transforms, or produces data must display a preview so the user can validate results at each step. The only exceptions are utility cells (config, setup, helpers) that only define constants or functions.

Think from the user's perspective: when they open the notebook in marimo edit, each cell should tell them something useful — a count, a sample, a summary. Silent cells that do work but show nothing are hard to debug and validate.

What to show

Cell type	What to preview
API fetch (list of items)	`mo.md(f"Fetched {len(items)} meetings")`
DataFrame build	The DataFrame itself as last expression (renders as interactive table)
Scalar result	`mo.md(f"Contact: {name} (id={contact_id})")`
Search / filter	`mo.md(f"{len(hits)} results matching '{term}'")`
Final output	Full DataFrame or `mo.md()` summary as last expression

Example: fetch cell with preview

Bad — cell runs silently, user sees nothing:

@app.cell
def fetch_meetings(fetch_all, DATAINDEX, my_id):
    meetings = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "meeting",
        "contact_ids": str(my_id),
    })
    return (meetings,)

Good — cell shows a count so the user knows it worked:

@app.cell
def fetch_meetings(fetch_all, DATAINDEX, my_id, mo):
    meetings = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "meeting",
        "contact_ids": str(my_id),
    })
    mo.md(f"**Fetched {len(meetings)} meetings**")
    return (meetings,)

Example: transform cell with table preview

Bad — builds DataFrame but doesn't display it:

@app.cell
def build_table(meetings, pl):
    _rows = [{"date": _m["timestamp"][:10], "title": _m.get("title", "")} for _m in meetings]
    meeting_df = pl.DataFrame(_rows)
    return (meeting_df,)

Good — DataFrame is the last expression, so marimo renders it as an interactive table:

@app.cell
def build_table(meetings, pl, mo):
    _rows = [{"date": _m["timestamp"][:10], "title": _m.get("title", "")} for _m in meetings]
    meeting_df = pl.DataFrame(_rows).sort("date")
    mo.md(f"### Meetings ({len(meeting_df)} results)")
    return (meeting_df,)

@app.cell
def show_meeting_table(meeting_df):
    meeting_df  # Renders as interactive sortable table

Utility cells (no preview needed)

Config, setup, and helper cells that only define constants or functions don't need previews:

@app.cell
def config():
    BASE = "http://localhost:42000"
    CONTACTDB = f"{BASE}/contactdb-api"
    DATAINDEX = f"{BASE}/dataindex/api/v1"
    return CONTACTDB, DATAINDEX

@app.cell
def helpers(client):
    def fetch_all(url, params):
        ...
    return (fetch_all,)

Tips

Use marimo edit during development to see cell outputs interactively
Make raw API responses the last expression in a cell to inspect their structure
Use polars over pandas for better performance and type safety
Set timeout=30 on httpx clients — some queries over large date ranges are slow
Name cells descriptively — function names appear in the marimo sidebar

19 KiB Raw Blame History