Files

Mathieu Virbel eefac81e57 feat: migrate to skills-based approach

2026-02-19 11:36:32 -06:00

32 KiB

Raw Blame History

name, description, user-invocable

name	description	user-invocable
notebook-patterns	Marimo notebook patterns for InternalAI data analysis. Use when creating or editing marimo notebooks — covers cell scoping, async cells, pagination helpers, analysis patterns, and do/don't rules.	false

Marimo Notebook Patterns

This guide covers how to create marimo notebooks for data analysis against the InternalAI platform APIs. Marimo notebooks are plain .py files with reactive cells — no .ipynb format, no Jupyter dependency.

Marimo Basics

A marimo notebook is a Python file with @app.cell decorated functions. Each cell returns values as a tuple, and other cells receive them as function parameters — marimo builds a reactive DAG automatically.

import marimo
app = marimo.App()

@app.cell
def cell_one():
    x = 42
    return (x,)

@app.cell
def cell_two(x):
    # Re-runs automatically when x changes
    result = x * 2
    return (result,)

Key rules:

Cells declare dependencies via function parameters
Cells return values as tuples: return (var1, var2,)
The last expression at the top level of a cell is displayed as rich output in the marimo UI (dataframes render as tables, dicts as collapsible trees). Expressions inside if/else/for blocks do not count — see Cell Output Must Be at the Top Level below
Use mo.md("# heading") for formatted markdown output (import mo once in setup — see below)
No manual execution order; the DAG determines it
Variable names must be unique across cells. Every variable assigned at the top level of a cell is tracked by marimo's DAG. If two cells both define resp, marimo raises MultipleDefinitionError and refuses to run. Prefix cell-local variables with _ (e.g., _resp, _rows, _data) to make them private to that cell — marimo ignores _-prefixed names.
All imports must go in the setup cell. Every import statement creates a top-level variable (e.g., import asyncio defines asyncio). If two cells both import asyncio, marimo raises MultipleDefinitionError. Place all imports in a single setup cell and pass them as cell parameters. Do NOT import marimo as mo or import asyncio in multiple cells — import once in setup, then receive via def my_cell(mo, asyncio):.

Cell Variable Scoping — Example

This is the most common mistake. Any variable assigned at the top level of a cell (not inside a def or comprehension) is tracked by marimo. If two cells assign the same name, the notebook refuses to run.

BROKEN — resp is defined at top level in both cells:

# Cell A
@app.cell
def search_meetings(client, DATAINDEX):
    resp = client.post(f"{DATAINDEX}/search", json={...})  # defines 'resp'
    resp.raise_for_status()
    results = resp.json()["results"]
    return (results,)

# Cell B
@app.cell
def fetch_details(client, DATAINDEX, results):
    resp = client.get(f"{DATAINDEX}/entities/{results[0]}")  # also defines 'resp' → ERROR
    meeting = resp.json()
    return (meeting,)

Error: MultipleDefinitionError: variable 'resp' is defined in multiple cells

FIXED — prefix cell-local variables with _:

# Cell A
@app.cell
def search_meetings(client, DATAINDEX):
    _resp = client.post(f"{DATAINDEX}/search", json={...})  # _resp is cell-private
    _resp.raise_for_status()
    results = _resp.json()["results"]
    return (results,)

# Cell B
@app.cell
def fetch_details(client, DATAINDEX, results):
    _resp = client.get(f"{DATAINDEX}/entities/{results[0]}")  # _resp is cell-private, no conflict
    meeting = _resp.json()
    return (meeting,)

Rule of thumb: if a variable is only used within the cell to compute a return value, prefix it with _. Only leave names unprefixed if another cell needs to receive them.

Note: Variables inside nested def functions are naturally local and don't need _ prefixes — e.g., resp inside a def fetch_all(...) helper is fine because it's scoped to the function, not the cell.

Cell Output Must Be at the Top Level

Marimo only renders the last expression at the top level of a cell as rich output. An expression buried inside an if/else, for, try, or any other block is not displayed — it's silently discarded.

BROKEN — _df inside the if branch is never rendered, and mo.md() inside if/else is also discarded:

@app.cell
def show_results(results, mo):
    if results:
        _df = pl.DataFrame(results)
        mo.md(f"**Found {len(results)} results**")
        _df  # Inside an if block — marimo does NOT display this
    else:
        mo.md("**No results found**")  # Also inside a block — NOT displayed
    return

FIXED — split into separate cells. Each cell displays exactly one thing at the top level:

# Cell 1: build the data, return it
@app.cell
def build_results(results, pl):
    results_df = pl.DataFrame(results) if results else None
    return (results_df,)

# Cell 2: heading — mo.md() is the top-level expression (use ternary for conditional text)
@app.cell
def show_results_heading(results_df, mo):
    mo.md(f"**Found {len(results_df)} results**" if results_df is not None else "**No results found**")

# Cell 3: table — DataFrame is the top-level expression
@app.cell
def show_results_table(results_df):
    results_df  # Top-level expression — marimo renders this as interactive table

Rules:

Each cell should display one thing — either mo.md() OR a DataFrame, never both
mo.md() must be a top-level expression, not inside if/else/for/try blocks
Build conditional text using variables or ternary expressions, then call mo.md(_text) at the top level
For DataFrames, use a standalone display cell: def show_table(df): df

Async Cells

When a cell uses await (e.g., for llm_call or asyncio.gather), you must declare it as async def:

@app.cell
async def analyze(meetings, llm_call, ResponseModel, asyncio):
    async def _score(meeting):
        return await llm_call(prompt=..., response_model=ResponseModel)

    results = await asyncio.gather(*[_score(_m) for _m in meetings])
    return (results,)

Note that asyncio is imported in the setup cell and received here as a parameter — never import asyncio inside individual cells.

If you write await in a non-async cell, marimo cannot parse the cell and saves it as an _unparsable_cell string literal — the cell won't run, and you'll see SyntaxError: 'return' outside function or similar errors. See Fixing _unparsable_cell below.

Cells That Define Classes Must Return Them

If a cell defines Pydantic models (or any class) that other cells need, it must return them:

# BaseModel and Field are imported in the setup cell and received as parameters
@app.cell
def models(BaseModel, Field):
    class MeetingSentiment(BaseModel):
        overall_sentiment: str
        sentiment_score: int = Field(description="Score from -10 to +10")

    class FrustrationExtraction(BaseModel):
        has_frustrations: bool
        frustrations: list[dict]

    return MeetingSentiment, FrustrationExtraction  # Other cells receive these as parameters

A bare return (or no return) means those classes are invisible to the rest of the notebook.

Fixing `_unparsable_cell`

When marimo can't parse a cell into a proper @app.cell function, it saves the raw code as app._unparsable_cell("...", name="cell_name"). These cells won't run and show errors like SyntaxError: 'return' outside function.

Common causes:

Using await without making the cell async def
Using return in code that marimo failed to wrap into a function (usually a side effect of cause 1)

How to fix: Convert the _unparsable_cell string back into a proper @app.cell decorated function:

# BROKEN — saved as _unparsable_cell because of top-level await
app._unparsable_cell("""
results = await asyncio.gather(...)
return results
""", name="my_cell")

# FIXED — proper async cell function (asyncio imported in setup, received as parameter)
@app.cell
async def my_cell(some_dependency, asyncio):
    results = await asyncio.gather(...)
    return (results,)

Key differences to note when converting:

Wrap the code in an async def function (if it uses await)
Add cell dependencies as function parameters (including imports like asyncio)
Return values as tuples: return (var,) not return var
Prefix cell-local variables with _
Never add import statements inside the cell — all imports belong in setup

Inline Dependencies with PEP 723

Use PEP 723 /// script metadata so uv run auto-installs dependencies:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "marimo",
#     "httpx",
#     "polars",
#     "mirascope[openai]",
#     "pydantic",
#     "python-dotenv",
# ]
# ///

Checking Notebooks Before Running

Always run marimo check before opening or running a notebook. It catches common issues — duplicate variable definitions, _unparsable_cell blocks, branch expressions that won't display, and more — without needing to start the full editor:

uvx marimo check notebook.py           # Check a single notebook
uvx marimo check workflows/            # Check all notebooks in a directory
uvx marimo check --fix notebook.py     # Auto-fix fixable issues

Run this after every edit. A clean marimo check (no output, exit code 0) means the notebook is structurally valid. Any errors must be fixed before running.

Running Notebooks

uvx marimo edit notebook.py   # Interactive editor (best for development)
uvx marimo run notebook.py    # Read-only web app
uv run notebook.py            # Script mode (terminal output)

Inspecting Cell Outputs

In marimo edit, every cell's return value is displayed as rich output below the cell. This is the primary way to introspect API responses:

Dicts/lists render as collapsible JSON trees — click to expand nested fields
Polars/Pandas DataFrames render as interactive sortable tables
Strings render as plain text

To inspect a raw API response, just make it the last expression:

@app.cell
def inspect_response(client, DATAINDEX):
    _resp = client.get(f"{DATAINDEX}/query", params={
        "entity_types": "meeting", "limit": 2,
    })
    _resp.json()  # This gets displayed as a collapsible JSON tree

To inspect an intermediate value alongside other work, use mo.accordion or return it:

@app.cell
def debug_meetings(meetings, mo):
    mo.md(f"**Count:** {len(meetings)}")
    # Show first item structure for inspection
    mo.accordion({"First meeting raw": mo.json(meetings[0])}) if meetings else None

Notebook Skeleton

Every notebook against InternalAI follows this structure:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "marimo",
#     "httpx",
#     "polars",
#     "mirascope[openai]",
#     "pydantic",
#     "python-dotenv",
# ]
# ///

import marimo
app = marimo.App()

@app.cell
def params():
    """User parameters — edit these to change the workflow's behavior."""
    SEARCH_TERMS = ["greyhaven"]
    DATE_FROM = "2026-01-01T00:00:00Z"
    DATE_TO = "2026-02-01T00:00:00Z"
    TARGET_PERSON = None  # Set to a name like "Alice" to filter by person, or None for all
    return DATE_FROM, DATE_TO, SEARCH_TERMS, TARGET_PERSON

@app.cell
def config():
    BASE = "http://localhost:42000"
    CONTACTDB = f"{BASE}/contactdb-api"
    DATAINDEX = f"{BASE}/dataindex/api/v1"
    return (CONTACTDB, DATAINDEX,)

@app.cell
def setup():
    from dotenv import load_dotenv
    load_dotenv(".env")  # Load .env from the project root

    import asyncio  # All imports go here — never import inside other cells
    import httpx
    import marimo as mo
    import polars as pl
    from pydantic import BaseModel, Field
    client = httpx.Client(timeout=30)
    return (asyncio, client, mo, pl, BaseModel, Field,)

# --- your IN / ETL / OUT cells here ---

if __name__ == "__main__":
    app.run()

load_dotenv(".env") reads the .env file explicitly by name. This makes LLM_API_KEY and other env vars available to os.getenv() calls in lib/llm.py without requiring the shell to have them pre-set. Always include python-dotenv in PEP 723 dependencies and call load_dotenv(".env") early in the setup cell.

The params cell must always be the first cell after app = marimo.App(). It contains all user-configurable constants (search terms, date ranges, target names, etc.) as plain Python values. This way the user can tweak the workflow by editing a single cell at the top — no need to hunt through the code for hardcoded values.

Pagination Helper

The DataIndex GET /query endpoint paginates with limit and offset. Always paginate — result sets can be large.

@app.cell
def helpers(client):
    def fetch_all(url, params):
        """Fetch all pages from a paginated DataIndex endpoint."""
        all_items = []
        limit = params.get("limit", 50)
        params = {**params, "limit": limit, "offset": 0}
        while True:
            resp = client.get(url, params=params)
            resp.raise_for_status()
            data = resp.json()
            all_items.extend(data["items"])
            if params["offset"] + limit >= data["total"]:
                break
            params["offset"] += limit
        return all_items

    def resolve_contact(name, contactdb_url):
        """Find a contact by name, return their ID."""
        resp = client.get(f"{contactdb_url}/api/contacts", params={"search": name})
        resp.raise_for_status()
        contacts = resp.json()["contacts"]
        if not contacts:
            raise ValueError(f"No contact found for '{name}'")
        return contacts[0]

    return (fetch_all, resolve_contact,)

Pattern 1: Emails Involving a Specific Person

Emails have from_contact_id, to_contact_ids, and cc_contact_ids. The query API's contact_ids filter matches entities where the contact appears in any of these roles.

@app.cell
def find_person(resolve_contact, CONTACTDB):
    target = resolve_contact("Alice", CONTACTDB)
    target_id = target["id"]
    target_name = target["name"]
    return (target_id, target_name,)

@app.cell
def fetch_emails(fetch_all, DATAINDEX, target_id):
    emails = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "email",
        "contact_ids": str(target_id),
        "date_from": "2025-01-01T00:00:00Z",
        "sort_order": "desc",
    })
    return (emails,)

@app.cell
def email_table(emails, target_id, target_name, pl):
    email_df = pl.DataFrame([{
        "date": e["timestamp"][:10],
        "subject": e.get("title", "(no subject)"),
        "direction": (
            "sent" if str(target_id) == str(e.get("from_contact_id"))
            else "received"
        ),
        "snippet": (e.get("snippet") or e.get("text_content") or "")[:100],
    } for e in emails])
    return (email_df,)

@app.cell
def show_emails(email_df, target_name, mo):
    mo.md(f"## Emails involving {target_name} ({len(email_df)} total)")

@app.cell
def display_email_table(email_df):
    email_df  # Renders as interactive table in marimo edit

Pattern 2: Meetings with a Specific Participant

Meetings have a participants list where each entry may or may not have a resolved contact_id. The query API's contact_ids filter only matches resolved participants.

Strategy: Query by contact_ids to get meetings with resolved participants, then optionally do a client-side check on participants[].display_name or transcript for unresolved ones.

Always include room_name in meeting tables. The room_name field contains the virtual room name (e.g., standup-office-bogota) and often indicates where the meeting took place. It's useful context when title is generic or missing — include it as a column alongside title.

@app.cell
def fetch_meetings(fetch_all, DATAINDEX, target_id, my_id):
    # Get meetings where the target appears in contact_ids
    resolved_meetings = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "meeting",
        "contact_ids": str(target_id),
        "date_from": "2025-01-01T00:00:00Z",
    })
    return (resolved_meetings,)

@app.cell
def meeting_table(resolved_meetings, target_name, pl):
    _rows = []
    for _m in resolved_meetings:
        _participants = _m.get("participants", [])
        _names = [_p["display_name"] for _p in _participants]
        _rows.append({
            "date": (_m.get("start_time") or _m["timestamp"])[:10],
            "title": _m.get("title", "Untitled"),
            "room_name": _m.get("room_name", ""),
            "participants": ", ".join(_names),
            "has_transcript": _m.get("transcript") is not None,
            "has_summary": _m.get("summary") is not None,
        })
    meeting_df = pl.DataFrame(_rows)
    return (meeting_df,)

To also find meetings where the person was present but not resolved (guest), search the transcript:

@app.cell
def search_unresolved(client, DATAINDEX, target_name):
    # Semantic search for the person's name in meeting transcripts
    _resp = client.post(f"{DATAINDEX}/search", json={
        "search_text": target_name,
        "entity_types": ["meeting"],
        "limit": 50,
    })
    _resp.raise_for_status()
    transcript_hits = _resp.json()["results"]
    return (transcript_hits,)

Pattern 3: Calendar Events → Meeting Correlation

Calendar events and meetings are separate entities from different connectors. To find which calendar events had a corresponding recorded meeting, match by time overlap.

@app.cell
def fetch_calendar_and_meetings(fetch_all, DATAINDEX, my_id):
    events = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "calendar_event",
        "contact_ids": str(my_id),
        "date_from": "2025-01-01T00:00:00Z",
        "sort_by": "timestamp",
        "sort_order": "asc",
    })
    meetings = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "meeting",
        "contact_ids": str(my_id),
        "date_from": "2025-01-01T00:00:00Z",
    })
    return (events, meetings,)

@app.cell
def correlate(events, meetings, pl):
    from datetime import datetime, timedelta

    def _parse_dt(s):
        if not s:
            return None
        return datetime.fromisoformat(s.replace("Z", "+00:00"))

    # Index meetings by start_time for matching
    _meeting_by_time = {}
    for _m in meetings:
        _start = _parse_dt(_m.get("start_time"))
        if _start:
            _meeting_by_time[_start] = _m

    _rows = []
    for _ev in events:
        _ev_start = _parse_dt(_ev.get("start_time"))
        _ev_end = _parse_dt(_ev.get("end_time"))
        if not _ev_start:
            continue

        # Find meeting within 15-min window of calendar event start
        _matched = None
        for _m_start, _m in _meeting_by_time.items():
            if abs((_m_start - _ev_start).total_seconds()) < 900:
                _matched = _m
                break

        _rows.append({
            "date": _ev_start.strftime("%Y-%m-%d"),
            "time": _ev_start.strftime("%H:%M"),
            "event_title": _ev.get("title", "(untitled)"),
            "has_recording": _matched is not None,
            "meeting_title": _matched.get("title", "") if _matched else "",
            "attendee_count": len(_ev.get("attendees", [])),
        })

    calendar_df = pl.DataFrame(_rows)
    return (calendar_df,)

Pattern 4: Full Interaction Timeline for a Person

Combine emails, meetings, and Zulip messages into a single chronological view.

@app.cell
def fetch_all_interactions(fetch_all, DATAINDEX, target_id):
    all_entities = fetch_all(f"{DATAINDEX}/query", {
        "contact_ids": str(target_id),
        "date_from": "2025-01-01T00:00:00Z",
        "sort_by": "timestamp",
        "sort_order": "desc",
    })
    return (all_entities,)

@app.cell
def interaction_timeline(all_entities, target_name, pl):
    _rows = []
    for _e in all_entities:
        _etype = _e["entity_type"]
        _summary = ""
        if _etype == "email":
            _summary = _e.get("snippet") or _e.get("title") or ""
        elif _etype == "meeting":
            _summary = _e.get("summary") or _e.get("title") or ""
        elif _etype == "conversation_message":
            _summary = (_e.get("message") or "")[:120]
        elif _etype == "threaded_conversation":
            _summary = _e.get("title") or ""
        elif _etype == "calendar_event":
            _summary = _e.get("title") or ""
        else:
            _summary = _e.get("title") or _e["entity_type"]

        _rows.append({
            "date": _e["timestamp"][:10],
            "type": _etype,
            "source": _e["connector_id"],
            "summary": _summary[:120],
        })

    timeline_df = pl.DataFrame(_rows)
    return (timeline_df,)

@app.cell
def show_timeline(timeline_df, target_name, mo):
    mo.md(f"## Interaction Timeline: {target_name} ({len(timeline_df)} events)")

@app.cell
def display_timeline(timeline_df):
    timeline_df

Pattern 5: LLM Filtering with `lib.llm`

When you need to classify, score, or extract structured information from each entity (e.g. "is this meeting about project X?", "rate the relevance of this email"), use the llm_call helper from workflows/lib. It sends each item to an LLM and parses the response into a typed Pydantic model.

Prerequisites: Copy .env.example to .env and fill in your LLM_API_KEY. Add mirascope, pydantic, and python-dotenv to the notebook's PEP 723 dependencies.

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "marimo",
#     "httpx",
#     "polars",
#     "mirascope[openai]",
#     "pydantic",
#     "python-dotenv",
# ]
# ///

Setup cell — load `.env` and import `llm_call`

@app.cell
def setup():
    from dotenv import load_dotenv
    load_dotenv(".env")  # Makes LLM_API_KEY available to lib/llm.py

    import asyncio
    import httpx
    import marimo as mo
    import polars as pl
    from pydantic import BaseModel, Field
    from lib.llm import llm_call
    client = httpx.Client(timeout=30)
    return (asyncio, client, llm_call, mo, pl, BaseModel, Field,)

Define a response model

Create a Pydantic model that describes the structured output you want from the LLM:

@app.cell
def models(BaseModel, Field):

    class RelevanceScore(BaseModel):
        relevant: bool
        reason: str
        score: int  # 0-10

    return (RelevanceScore,)

Filter entities through the LLM

Iterate over fetched entities and call llm_call for each one. Since llm_call is async, use asyncio.gather to process items concurrently:

@app.cell
async def llm_filter(meetings, llm_call, RelevanceScore, pl, mo, asyncio):
    _topic = "Greyhaven"

    async def _score(meeting):
        _text = meeting.get("summary") or meeting.get("title") or ""
        _result = await llm_call(
            prompt=f"Is this meeting about '{_topic}'?\n\nMeeting: {_text}",
            response_model=RelevanceScore,
            system_prompt="Score the relevance of this meeting to the given topic. Set relevant=true if score >= 5.",
        )
        return {**meeting, "llm_relevant": _result.relevant, "llm_reason": _result.reason, "llm_score": _result.score}

    scored_meetings = await asyncio.gather(*[_score(_m) for _m in meetings])
    relevant_meetings = [_m for _m in scored_meetings if _m["llm_relevant"]]

    mo.md(f"**LLM filter:** {len(relevant_meetings)}/{len(meetings)} meetings relevant to '{_topic}'")
    return (relevant_meetings,)

Tips for LLM filtering

Keep prompts short — only include the fields the LLM needs (title, summary, snippet), not the entire raw entity.
Use structured output — always pass a response_model so you get typed fields back, not free-text.
Batch wisely — asyncio.gather sends all requests concurrently. For large datasets (100+ items), process in chunks to avoid rate limits.
Cache results — LLM calls are slow and cost money. If iterating on a notebook, consider storing scored results in a cell variable so you don't re-score on every edit.

Do / Don't — Quick Reference for LLM Agents

When generating marimo notebooks, follow these rules strictly. Violations cause MultipleDefinitionError at runtime.

Do

Prefix cell-local variables with _ — _resp, _rows, _m, _data, _chunk. Marimo ignores _-prefixed names so they won't clash across cells.
Put all imports in the setup cell and pass them as cell parameters: def my_cell(client, mo, pl, asyncio):. Never import inside other cells — even import asyncio in two async cells causes MultipleDefinitionError.
Give returned DataFrames unique names — email_df, meeting_df, timeline_df. Never use a bare df that might collide with another cell.
Return only values other cells need — everything else should be _-prefixed and stays private to the cell.
Import stdlib modules in setup too — even from datetime import datetime creates a top-level name. If two cells both import datetime, marimo errors. Import it once in setup and receive it as a parameter, or use it inside a _-prefixed helper function where it's naturally scoped.
Every non-utility cell must show a preview — see the "Cell Output Previews" section below.
Use separate display cells for DataFrames — the build cell returns the DataFrame and shows a mo.md() count/heading; a standalone display cell (e.g., def show_table(df): df) renders it as an interactive table the user can sort and filter.
Include room_name when listing meetings — the virtual room name provides useful context about where the meeting took place (e.g., standup-office-bogota). Show it as a column alongside title.
Keep cell output expressions at the top level — if a cell conditionally displays a DataFrame, initialize _output = None before the if/else, assign inside the branches, then put _output as the last top-level expression. Expressions inside if/else/for blocks are silently ignored by marimo.
Put all user parameters in a params cell as the first cell — date ranges, search terms, target names, limits. Never hardcode these values deeper in the notebook.
Declare cells as async def when using await — @app.cell followed by async def cell_name(...). This includes cells using asyncio.gather, await llm_call(...), or any async API.
Return classes/models from cells that define them — if a cell defines class MyModel(BaseModel), return it so other cells can use it as a parameter: return (MyModel,).
Use python-dotenv to load .env — add python-dotenv to PEP 723 dependencies and call load_dotenv(".env") early in the setup cell (before importing lib.llm). This ensures LLM_API_KEY and other env vars are available without requiring them to be pre-set in the shell.

Don't

Don't define the same variable name in two cells — even resp = ... in cell A and resp = ... in cell B is a fatal error.
Don't import inside non-setup cells — every import X defines a top-level variable X. If two cells both import asyncio, marimo raises MultipleDefinitionError and refuses to run. Put all imports in the setup cell and receive them as function parameters.
Don't use generic top-level names like df, rows, resp, data, result — either prefix with _ or give them a unique descriptive name.
Don't return temporary variables — if _rows is only used to build a DataFrame, keep it _-prefixed and only return the DataFrame.
Don't use await in a non-async cell — this causes marimo to save the cell as _unparsable_cell (a string literal that won't execute). Always use async def for cells that call async functions.
Don't define classes in a cell without returning them — a bare return or no return makes classes invisible to the DAG. Other cells can't receive them as parameters.
Don't put display expressions inside if/else/for blocks — marimo only renders the last top-level expression. A DataFrame inside an if branch is silently discarded. Use the _output = None pattern instead (see Cell Output Must Be at the Top Level).

Cell Output Previews

Every cell that fetches, transforms, or produces data must display a preview so the user can validate results at each step. The only exceptions are utility cells (config, setup, helpers) that only define constants or functions.

Think from the user's perspective: when they open the notebook in marimo edit, each cell should tell them something useful — a count, a sample, a summary. Silent cells that do work but show nothing are hard to debug and validate.

What to show

Cell type	What to preview
API fetch (list of items)	`mo.md(f"Fetched {len(items)} meetings")`
DataFrame build	The DataFrame itself as last expression (renders as interactive table)
Scalar result	`mo.md(f"Contact: {name} (id={contact_id})")`
Search / filter	`mo.md(f"{len(hits)} results matching '{term}'")`
Final output	Full DataFrame or `mo.md()` summary as last expression

Example: fetch cell with preview

Bad — cell runs silently, user sees nothing:

@app.cell
def fetch_meetings(fetch_all, DATAINDEX, my_id):
    meetings = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "meeting",
        "contact_ids": str(my_id),
    })
    return (meetings,)

Good — cell shows a count so the user knows it worked:

@app.cell
def fetch_meetings(fetch_all, DATAINDEX, my_id, mo):
    meetings = fetch_all(f"{DATAINDEX}/query", {
        "entity_types": "meeting",
        "contact_ids": str(my_id),
    })
    mo.md(f"**Fetched {len(meetings)} meetings**")
    return (meetings,)

Example: transform cell with table preview

Bad — builds DataFrame but doesn't display it:

@app.cell
def build_table(meetings, pl):
    _rows = [{"date": _m["timestamp"][:10], "title": _m.get("title", "")} for _m in meetings]
    meeting_df = pl.DataFrame(_rows)
    return (meeting_df,)

Good — the build cell shows a mo.md() count, and a separate display cell renders the DataFrame as an interactive table:

@app.cell
def build_table(meetings, pl, mo):
    _rows = [{"date": _m["timestamp"][:10], "title": _m.get("title", "")} for _m in meetings]
    meeting_df = pl.DataFrame(_rows).sort("date")
    mo.md(f"### Meetings ({len(meeting_df)} results)")
    return (meeting_df,)

@app.cell
def show_meeting_table(meeting_df):
    meeting_df  # Renders as interactive sortable table

Separate display cells for DataFrames

When a cell builds a DataFrame, use two cells: one that builds and returns it (with a mo.md() summary), and a standalone display cell that renders it as a table. This keeps the build logic clean and gives the user an interactive table they can sort and filter in the marimo UI.

# Cell 1: build and return the DataFrame, show a count
@app.cell
def build_sentiment_table(analyzed_meetings, pl, mo):
    _rows = [...]
    sentiment_df = pl.DataFrame(_rows).sort("date", descending=True)
    mo.md(f"### Sentiment Analysis ({len(sentiment_df)} meetings)")
    return (sentiment_df,)

# Cell 2: standalone display — just the DataFrame, nothing else
@app.cell
def show_sentiment_table(sentiment_df):
    sentiment_df

This pattern makes every result inspectable. The mo.md() cell gives a quick count/heading; the display cell lets the user explore the full data interactively.

Utility cells (no preview needed)

Config, setup, and helper cells that only define constants or functions don't need previews:

@app.cell
def config():
    BASE = "http://localhost:42000"
    CONTACTDB = f"{BASE}/contactdb-api"
    DATAINDEX = f"{BASE}/dataindex/api/v1"
    return CONTACTDB, DATAINDEX

@app.cell
def helpers(client):
    def fetch_all(url, params):
        ...
    return (fetch_all,)

Tips

Use marimo edit during development to see cell outputs interactively
Make raw API responses the last expression in a cell to inspect their structure
Use polars over pandas for better performance and type safety
Set timeout=30 on httpx clients — some queries over large date ranges are slow
Name cells descriptively — function names appear in the marimo sidebar

32 KiB Raw Blame History