From eefac81e577f31b4269e41e1940df5ac00e8e8c7 Mon Sep 17 00:00:00 2001 From: Mathieu Virbel Date: Thu, 19 Feb 2026 11:36:32 -0600 Subject: [PATCH] feat: migrate to skills-based approach --- .agents/skills/checkout/SKILL.md | 52 ++ .agents/skills/company/SKILL.md | 49 ++ .agents/skills/connectors/SKILL.md | 105 +++ .agents/skills/contactdb/SKILL.md | 160 +++++ .agents/skills/dataindex/SKILL.md | 223 ++++++ .agents/skills/notebook-patterns/SKILL.md | 808 ++++++++++++++++++++++ .agents/skills/project-history/SKILL.md | 364 ++++++++++ .agents/skills/project-init/SKILL.md | 264 +++++++ .agents/skills/project-sync/SKILL.md | 344 +++++++++ .agents/skills/workflow/SKILL.md | 105 +++ AGENTS.md | 146 +--- MYSELF.example.md | 28 - README.md | 137 ++-- docs/contactdb-api.md | 2 +- docs/dataindex-api.md | 2 +- 15 files changed, 2565 insertions(+), 224 deletions(-) create mode 100644 .agents/skills/checkout/SKILL.md create mode 100644 .agents/skills/company/SKILL.md create mode 100644 .agents/skills/connectors/SKILL.md create mode 100644 .agents/skills/contactdb/SKILL.md create mode 100644 .agents/skills/dataindex/SKILL.md create mode 100644 .agents/skills/notebook-patterns/SKILL.md create mode 100644 .agents/skills/project-history/SKILL.md create mode 100644 .agents/skills/project-init/SKILL.md create mode 100644 .agents/skills/project-sync/SKILL.md create mode 100644 .agents/skills/workflow/SKILL.md delete mode 100644 MYSELF.example.md diff --git a/.agents/skills/checkout/SKILL.md b/.agents/skills/checkout/SKILL.md new file mode 100644 index 0000000..d0fb742 --- /dev/null +++ b/.agents/skills/checkout/SKILL.md @@ -0,0 +1,52 @@ +--- +name: checkout +description: Build a weekly checkout/review covering Sunday through today. Gathers meetings, emails, Zulip conversations, and Gitea activity, then produces a structured summary. +disable-model-invocation: true +--- + +# Weekly Review Builder + +Build my weekly checkout covering Sunday through today. + +1. **Get my identity** with `contactdb_get_me` to obtain my contact_id +2. **Determine date range**: Sunday to today (use `date -d "last sunday" +%Y-%m-%d`) +3. **Gather activity in parallel**: + - **Dataindex**: Launch **one subagent per day** (Sunday through today). Each subagent should query `dataindex_query_entities` for that specific day with my contact_id, looking for meetings, calendar events, emails, documents. Return day-by-day summary. + - **Threaded Conversations**: Launch **one subagent per day** (Sunday through today). Each subagent should: + 1. Query `dataindex_query_entities` for entity_type `threaded_conversation` for that specific day with my contact_id + 2. For each conversation found, fetch all `conversation_message` entities using the conversation ID as parent_id filter + 3. Return messages I participated in with context + - **Gitea**: Launch one subagent to run `~/bin/gitea-activity -s START -e END` and extract commits, PRs (opened/merged/approved), and repositories worked on +4. **Query dataindex directly** for the full week as backup to ensure nothing is missed + +**Build the checkout with this structure:** + +``` +# Weekly Review: [Date Range] + +## Objectives +- List 2-3 high-level goals for the week based on the main themes of work + +****Major Achievements**** +- Bullet points of concrete deliverables, grouped by theme +- Focus on shipped features, solved problems, infrastructure built + +****Code Activity**** +- Stats line: X commits across Y repositories, Z PRs total (N merged, M open) +- **New Repositories**: `[name](url)` - brief description +- **Pull Requests Merged**: `[#N Title](url)` - one per line with descriptive title +- **Pull Requests Opened (not merged)**: `[#N](url)` - include status if known (approved, draft, etc.) + +****Team Interactions**** +- **Meeting Type (Nx)**: Brief description of purpose/outcome + With: Key participants +- **Notable conversations**: Date, participants, main subject discussed +``` + +**Rules:** +- Use `****Title****` format for section headers (not ##) +- All PRs and repositories must be markdown links `[name](url)` +- List merged PRs first, then open/unmerged ones +- Only include meaningful interactions (skip routine standups unless notable decisions made) +- No "who am I" header, no summary section at the end +- Focus on outcomes and business value, not just activity volume diff --git a/.agents/skills/company/SKILL.md b/.agents/skills/company/SKILL.md new file mode 100644 index 0000000..dc45f6a --- /dev/null +++ b/.agents/skills/company/SKILL.md @@ -0,0 +1,49 @@ +--- +name: company +description: Monadical company context. Use when you need to understand the organization structure, Zulip stream layout, communication tools, meeting/calendar relationships, or internal product names. +user-invocable: false +--- + +# Company Context + +## About Monadical + +Monadical is a software consultancy founded in 2016. The company operates across multiple locations: Montreal and Vancouver (Canada), and Medellin and Cali (Colombia). The team builds internal products alongside client work. + +### Internal Products + +- **Reflector** — Meeting recording and transcription tool (produces meeting entities in DataIndex) +- **GreyHaven / InternalAI platform** — A local-first platform that aggregates personal data, resolve contact to do automation and analysis + +## Communication Tools + +| Tool | Role | Data in DataIndex? | +|------------|-----------------------------|---------------------| +| Zulip | Primary internal chat | Yes (connector: `zulip`) | +| Fastmail/Email | External communication | Yes (connector: `mbsync_email`) | +| Calendar | Scheduling (ICS feeds) | Yes (connector: `ics_calendar`) | +| Reflector | Meeting recordings | Yes (connector: `reflector`) | +| HedgeDoc | Collaborative documents | Yes (connector: `hedgedoc`) | + +## How the company is working + +We use zulip as our main hub for communication. Zulip have channels (top level) and topic (low level). Depending the channels, differents behavior have to be adopted. + +### Zulip channels + +Here is a list of zulip stream prefix with context on how the company is organized: + +- InternalAI (zulip:stream:193) is about this specific platform. +- Leads (zulip:stream:78) is where we talk about our leads/client. We usually create one topic per lead/client - So if you are searching information about a client, always have a look if a related topic exist, that match the client or the company name. +- Checkins (zulip:stream:24) are usually one topic per employee. This is where an employee indicate what it did or will do during a period of time, or just some status update. Not everybody is using the system on regular basis. +- Devcap (zulip:stream:156) is where we are talking about our investment / due diligence before investing. One topic per company. +- General (zulip:stream:21) is where we talk about different topic on various subject, company wide or services. +- Enginerring (zulip:stream:25) is where we talk about enginerring issue / services / new tool to try +- Learning (zulip:stream:31) is where we share links about new tools / ideas or stuff to learn about +- Reflector (zulip:stream:155) dedicated stream about reflector development and usage +- GreyHaven is separated in multiple topics: branding is in (zulip:stream:206), leads specific to greyhaven (zulip:stream:208) with one topic per lead, and marketing (zulip:stream:212) + +### Meeting and Calendar + +Some persons in the company have a dedicated room for their meeting in reflector. This can be seen in `room_name` in `meeting` entity. +For person like Max, dataindex have calendar information, and he mostly have a related meeting that will be in reflector. However, there is no direct relation between calendar information and reflector meeting. A correlation has to be done to figure out which meeting is it when talking about an event. diff --git a/.agents/skills/connectors/SKILL.md b/.agents/skills/connectors/SKILL.md new file mode 100644 index 0000000..458bca7 --- /dev/null +++ b/.agents/skills/connectors/SKILL.md @@ -0,0 +1,105 @@ +--- +name: connectors +description: Reference for all data connectors and their entity type mappings. Use when determining which connector produces which entity types, understanding connector-specific fields, or choosing the right data source for a query. +user-invocable: false +--- + +# Connectors and Data Sources + +Each connector ingests data from an external source into DataIndex. Connectors run periodic background syncs to keep data fresh. + +Use `list_connectors()` at runtime to see which connectors are actually configured — not all connectors below may be active in every deployment. + +## Connector → Entity Type Mapping + +| Connector ID | Entity Types Produced | Description | +|------------------|-----------------------------------------------------------------|----------------------------------| +| `reflector` | `meeting` | Meeting recordings + transcripts | +| `ics_calendar` | `calendar_event` | ICS calendar feed events | +| `mbsync_email` | `email` | Email via mbsync IMAP sync | +| `zulip` | `conversation`, `conversation_message`, `threaded_conversation` | Zulip chat streams and topics | +| `babelfish` | `conversation_message`, `threaded_conversation` | Chat translation bridge | +| `hedgedoc` | `document` | HedgeDoc collaborative documents | +| `contactdb` | `contact` | Synced from ContactDB (static) | +| `browser_history`| `webpage` | Browser extension page visits | +| `api_document` | `document` | API-ingested documents (static) | + +## Per-Connector Details + +### `reflector` — Meeting Recordings + +Ingests meetings from Reflector, Monadical's meeting recording tool. + +- **Entity type:** `meeting` +- **Key fields:** `transcript`, `summary`, `participants`, `start_time`, `end_time`, `room_name` +- **Use cases:** Find meetings someone attended, search meeting transcripts, get summaries +- **Tip:** Filter with `contact_ids` to find meetings involving specific people. The `transcript` field contains speaker-diarized text. + +### `ics_calendar` — Calendar Events + +Parses ICS calendar feeds (Google Calendar, Outlook, etc.). + +- **Entity type:** `calendar_event` +- **Key fields:** `start_time`, `end_time`, `attendees`, `location`, `description`, `calendar_name` +- **Use cases:** Check upcoming events, find events with specific attendees, review past schedule +- **Tip:** Multiple calendar feeds may be configured as separate connectors (e.g., `personal_calendar`, `work_calendar`). Use `list_connectors()` to discover them. + +### `mbsync_email` — Email + +Syncs email via mbsync (IMAP). + +- **Entity type:** `email` +- **Key fields:** `text_content`, `from_contact_id`, `to_contact_ids`, `cc_contact_ids`, `thread_id`, `has_attachments` +- **Use cases:** Find emails from/to someone, search email content, track email threads +- **Tip:** Use `from_contact_id` and `to_contact_ids` with `contact_ids` filter. For thread grouping, use the `thread_id` field. + +### `zulip` — Chat + +Ingests Zulip streams, topics, and messages. + +- **Entity types:** + - `conversation` — A Zulip stream/channel with recent messages + - `conversation_message` — Individual chat messages + - `threaded_conversation` — A topic thread within a stream +- **Key fields:** `message`, `mentioned_contact_ids`, `recent_messages` +- **Use cases:** Find discussions about a topic, track who said what, find @-mentions +- **Tip:** Use `threaded_conversation` to find topic-level discussions. Use `conversation_message` with `mentioned_contact_ids` to find messages that mention specific people. + +### `babelfish` — Translation Bridge + +Ingests translated chat messages from the Babelfish service. + +- **Entity types:** `conversation_message`, `threaded_conversation` +- **Use cases:** Similar to Zulip but for translated cross-language conversations +- **Tip:** Query alongside `zulip` connector for complete conversation coverage. + +### `hedgedoc` — Collaborative Documents + +Syncs documents from HedgeDoc (collaborative markdown editor). + +- **Entity type:** `document` +- **Key fields:** `content`, `description`, `url`, `revision_id` +- **Use cases:** Find documents by content, track document revisions +- **Tip:** Use `search()` for semantic document search rather than `query_entities` text filter. + +### `contactdb` — Contact Sync (Static) + +Mirrors contacts from ContactDB into DataIndex for unified search. + +- **Entity type:** `contact` +- **Note:** This is a read-only mirror. Use ContactDB MCP tools directly for contact operations. + +### `browser_history` — Browser Extension (Static) + +Captures visited webpages from a browser extension. + +- **Entity type:** `webpage` +- **Key fields:** `url`, `visit_time`, `text_content` +- **Use cases:** Find previously visited pages, search page content + +### `api_document` — API Documents (Static) + +Documents ingested via the REST API (e.g., uploaded PDFs, imported files). + +- **Entity type:** `document` +- **Note:** These are ingested via `POST /api/v1/ingest/documents`, not periodic sync. diff --git a/.agents/skills/contactdb/SKILL.md b/.agents/skills/contactdb/SKILL.md new file mode 100644 index 0000000..17ab330 --- /dev/null +++ b/.agents/skills/contactdb/SKILL.md @@ -0,0 +1,160 @@ +--- +name: contactdb +description: ContactDB REST API reference. Use when resolving people to contact_ids, searching contacts by name/email, or accessing relationships, notes, and platform identities. +user-invocable: false +--- + +# ContactDB API Reference + +ContactDB is the people directory. It stores contacts, their platform identities, relationships, notes, and links. Every person across all data sources resolves to a single ContactDB `contact_id`. + +**Base URL:** `http://localhost:42000/contactdb-api` (via Caddy) or `http://localhost:42800` (direct) + +## Core Entities + +### Contact + +The central entity — represents a person. + +| Field | Type | Description | +|----------------------|---------------------|------------------------------------------------| +| `id` | int | Unique contact ID | +| `name` | string | Display name | +| `emails` | EmailField[] | `{type, value, preferred}` | +| `phones` | PhoneField[] | `{type, value, preferred}` | +| `bio` | string? | Short biography | +| `avatar_url` | string? | Profile image URL | +| `personal_info` | PersonalInfo | Birthday, partner, children, role, company, location, how_we_met | +| `interests` | string[] | Topics of interest | +| `values` | string[] | Personal values | +| `tags` | string[] | User-assigned tags | +| `profile_description`| string? | Extended description | +| `is_placeholder` | bool | Auto-created stub (not yet fully resolved) | +| `is_service_account` | bool | Non-human account (bot, no-reply) | +| `stats` | ContactStats | Interaction statistics (see below) | +| `enrichment_data` | dict | Data from enrichment providers | +| `platform_identities`| PlatformIdentity[] | Identities on various platforms | +| `created_at` | datetime | When created | +| `updated_at` | datetime | Last modified | +| `merged_into_id` | int? | If merged, target contact ID | +| `deleted_at` | datetime? | Soft-delete timestamp | + +### ContactStats + +| Field | Type | Description | +|--------------------------|---------------|--------------------------------------| +| `total_messages` | int | Total messages across platforms | +| `platforms_count` | int | Number of platforms active on | +| `last_interaction_at` | string? | ISO datetime of last interaction | +| `interaction_count_30d` | int | Interactions in last 30 days | +| `interaction_count_90d` | int | Interactions in last 90 days | +| `hotness` | HotnessScore? | Composite engagement score (0-100) | + +### PlatformIdentity + +Links a contact to a specific platform account. + +| Field | Type | Description | +|--------------------|-----------|------------------------------------------| +| `id` | int | Identity record ID | +| `contact_id` | int | Parent contact | +| `source` | string | Data provenance (e.g., `dataindex_zulip`)| +| `platform` | string | Platform name (e.g., `email`, `zulip`) | +| `platform_user_id` | string | User ID on that platform | +| `display_name` | string? | Name shown on that platform | +| `avatar_url` | string? | Platform-specific avatar | +| `bio` | string? | Platform-specific bio | +| `extra_data` | dict | Additional platform-specific data | +| `first_seen_at` | datetime | When first observed | +| `last_seen_at` | datetime | When last observed | + +### Relationship + +Tracks connections between contacts. + +| Field | Type | Description | +|------------------------|-----------|--------------------------------------| +| `id` | int | Relationship ID | +| `from_contact_id` | int | Source contact | +| `to_contact_id` | int | Target contact | +| `relationship_type` | string | Type (e.g., "colleague", "client") | +| `since_date` | date? | When relationship started | +| `relationship_metadata`| dict | Additional metadata | + +### Note + +Free-text notes attached to a contact. + +| Field | Type | Description | +|--------------|----------|----------------------| +| `id` | int | Note ID | +| `contact_id` | int | Parent contact | +| `content` | string | Note text | +| `created_by` | string | Who wrote it | +| `created_at` | datetime | When created | + +### Link + +External URLs associated with a contact. + +| Field | Type | Description | +|--------------|----------|--------------------------| +| `id` | int | Link ID | +| `contact_id` | int | Parent contact | +| `type` | string | Link type (e.g., "github", "linkedin") | +| `label` | string | Display label | +| `url` | string | URL | + +## REST Endpoints + +### GET `/api/contacts` — List/search contacts + +Primary way to find contacts. Returns `{contacts: [...], total, limit, offset}`. + +**Query parameters:** + +| Parameter | Type | Description | +|------------------------|---------------|----------------------------------------------| +| `search` | string? | Search in name and bio | +| `is_placeholder` | bool? | Filter by placeholder status | +| `is_service_account` | bool? | Filter by service account status | +| `sort_by` | string? | `"hotness"`, `"name"`, or `"updated_at"` | +| `min_hotness` | float? | Minimum hotness score (0-100) | +| `max_hotness` | float? | Maximum hotness score (0-100) | +| `platforms` | string[]? | Contacts with ALL specified platforms (AND) | +| `last_interaction_from`| string? | ISO datetime lower bound | +| `last_interaction_to` | string? | ISO datetime upper bound | +| `limit` | int | Max results (1-100, default 50) | +| `offset` | int | Pagination offset (default 0) | + +### GET `/api/contacts/me` — Get self contact + +Returns the platform operator's own contact record. **Call this first** in most workflows to get your own `contact_id`. + +### GET `/api/contacts/{id}` — Get contact by ID + +Get full details for a single contact by numeric ID. + +### GET `/api/contacts/by-email/{email}` — Get contact by email + +Look up a contact by email address. + +### Other Endpoints + +| Method | Path | Description | +|--------|-----------------------------------------|----------------------------------| +| POST | `/api/contacts` | Create contact | +| PUT | `/api/contacts/{id}` | Update contact | +| DELETE | `/api/contacts/{id}` | Delete contact | +| POST | `/api/contacts/merge` | Merge two contacts | +| GET | `/api/contacts/{id}/relationships` | List relationships | +| GET | `/api/contacts/{id}/notes` | List notes | +| GET | `/api/contacts/{id}/links` | List links | +| GET | `/api/platform-identities/contacts/{id}`| List platform identities | + +## Usage Pattern + +1. **Start with `GET /api/contacts/me`** to get the operator's contact ID +2. **Search by name** with `GET /api/contacts?search=Alice` +3. **Use contact IDs** from results as filters in DataIndex queries (`contact_ids` parameter) +4. **Paginate** large result sets with `offset` increments diff --git a/.agents/skills/dataindex/SKILL.md b/.agents/skills/dataindex/SKILL.md new file mode 100644 index 0000000..92f1230 --- /dev/null +++ b/.agents/skills/dataindex/SKILL.md @@ -0,0 +1,223 @@ +--- +name: dataindex +description: DataIndex REST API reference. Use when querying unified data (emails, meetings, calendar events, Zulip conversations, documents) via GET /query, POST /search, or GET /entities/{id}. +user-invocable: false +--- + +# DataIndex API Reference + +DataIndex aggregates data from all connected sources (email, calendar, Zulip, meetings, documents) into a unified query interface. Every piece of data is an **entity** with a common base structure plus type-specific fields. + +**Base URL:** `http://localhost:42000/dataindex/api/v1` (via Caddy) or `http://localhost:42180/api/v1` (direct) + +## Entity Types + +All entities share these base fields: + +| Field | Type | Description | +|----------------------|-------------|---------------------------------------------| +| `id` | string | Format: `connector_name:native_id` | +| `entity_type` | string | One of the types below | +| `timestamp` | datetime | When the entity occurred | +| `contact_ids` | string[] | ContactDB IDs of people involved | +| `connector_id` | string | Which connector produced this | +| `title` | string? | Display title | +| `parent_id` | string? | Parent entity (e.g., thread for a message) | +| `raw_data` | dict | Original source data (excluded by default) | + +### `calendar_event` + +From ICS calendar feeds. + +| Field | Type | Description | +|-----------------------|-------------|--------------------------------| +| `start_time` | datetime? | Event start | +| `end_time` | datetime? | Event end | +| `all_day` | bool | All-day event flag | +| `description` | string? | Event description | +| `location` | string? | Event location | +| `attendees` | dict[] | Attendee list | +| `organizer_contact_id`| string? | ContactDB ID of organizer | +| `status` | string? | Event status | +| `calendar_name` | string? | Source calendar name | +| `meeting_url` | string? | Video call link | + +### `meeting` + +From Reflector (recorded meetings with transcripts). + +| Field | Type | Description | +|--------------------|---------------------|-----------------------------------| +| `start_time` | datetime? | Meeting start | +| `end_time` | datetime? | Meeting end | +| `participants` | MeetingParticipant[]| People in the meeting | +| `meeting_platform` | string? | Platform (e.g., "jitsi") | +| `transcript` | string? | Full transcript text | +| `summary` | string? | AI-generated summary | +| `meeting_url` | string? | Meeting link | +| `recording_url` | string? | Recording link | +| `location` | string? | Physical location | +| `room_name` | string? | Virtual room name (also indicates meeting location — see below) | + +**MeetingParticipant** fields: `display_name`, `contact_id?`, `platform_user_id?`, `email?`, `speaker?` + +> **`room_name` as location indicator:** The `room_name` field often encodes where the meeting took place (e.g., a Jitsi room name like `standup-office-bogota`). Use it to infer the meeting location when `location` is not set. + +> **Participant and contact coverage is incomplete.** Meeting data comes from Reflector, which only tracks users who are logged into the Reflector platform. This means: +> +> - **`contact_ids`** only contains ContactDB IDs for Reflector-logged participants who were matched to a known contact. It will often be a **subset** of the actual attendees — do not assume it is the full list. +> - **`participants`** is more complete than `contact_ids` but still only includes people detected by Reflector. Not all participants have accounts or could be identified — some attendees may be entirely absent from this list. +> - **`contact_id` within a participant** may be `null` if the person was detected but couldn't be matched to a ContactDB entry. +> +> **Consequence for queries:** Filtering meetings by `contact_ids` will **miss meetings** where the person attended but wasn't logged into Reflector or wasn't resolved. To get better coverage, combine multiple strategies: +> +> 1. Filter by `contact_ids` for resolved participants +> 2. Search `participants[].display_name` client-side for name matches +> 3. Use `POST /search` with the person's name to search meeting transcripts and summaries + +### `email` + +From mbsync email sync. + +| Field | Type | Description | +|--------------------|-----------|--------------------------------------| +| `thread_id` | string? | Email thread grouping | +| `text_content` | string? | Plain text body | +| `html_content` | string? | HTML body | +| `snippet` | string? | Preview snippet | +| `from_contact_id` | string? | Sender's ContactDB ID | +| `to_contact_ids` | string[] | Recipient ContactDB IDs | +| `cc_contact_ids` | string[] | CC recipient ContactDB IDs | +| `has_attachments` | bool | Has attachments flag | +| `attachments` | dict[] | Attachment metadata | + +### `conversation` + +A Zulip stream/channel. + +| Field | Type | Description | +|--------------------|---------|----------------------------------------| +| `recent_messages` | dict[] | Recent messages in the conversation | + +### `conversation_message` + +A single message in a Zulip conversation. + +| Field | Type | Description | +|-------------------------|-----------|-----------------------------------| +| `message` | string? | Message text content | +| `mentioned_contact_ids` | string[] | ContactDB IDs of mentioned people | + +### `threaded_conversation` + +A Zulip topic thread (group of messages under a topic). + +| Field | Type | Description | +|--------------------|---------|----------------------------------------| +| `recent_messages` | dict[] | Recent messages in the thread | + +### `document` + +From HedgeDoc, API ingestion, or other document sources. + +| Field | Type | Description | +|----------------|-----------|------------------------------| +| `content` | string? | Document body text | +| `description` | string? | Document description | +| `mimetype` | string? | MIME type | +| `url` | string? | Source URL | +| `revision_id` | string? | Revision identifier | + +### `webpage` + +From browser history extension. + +| Field | Type | Description | +|----------------|-----------|------------------------------| +| `url` | string | Page URL | +| `visit_time` | datetime | When visited | +| `text_content` | string? | Page text content | + +## REST Endpoints + +### GET `/api/v1/query` — Exhaustive Filtered Enumeration + +Use when you need **all** entities matching specific criteria. Supports pagination. + +**When to use:** "List all meetings since January", "Get all emails from Alice", "Count calendar events this week" + +**Query parameters:** + +| Parameter | Type | Description | +|------------------|---------------|------------------------------------------------| +| `entity_types` | string (repeat) | Filter by type — repeat param for multiple: `?entity_types=email&entity_types=meeting` | +| `contact_ids` | string | Comma-separated ContactDB IDs: `"1,42"` | +| `connector_ids` | string | Comma-separated connector IDs: `"zulip,reflector"` | +| `date_from` | string | ISO datetime lower bound (UTC if no timezone) | +| `date_to` | string | ISO datetime upper bound | +| `search` | string? | Text filter on content fields | +| `parent_id` | string? | Filter by parent entity | +| `id_prefix` | string? | Filter entities by ID prefix (e.g., `zulip:stream:155`) | +| `thread_id` | string? | Filter emails by thread ID | +| `room_name` | string? | Filter meetings by room name | +| `limit` | int | Max results per page (default 50) | +| `offset` | int | Pagination offset (default 0) | +| `sort_by` | string | `"timestamp"` (default), `"title"`, `"contact_activity"`, etc. | +| `sort_order` | string | `"desc"` (default) or `"asc"` | +| `include_raw_data`| bool | Include raw_data field (default false) | + +**Response format:** + +```json +{ + "items": [...], + "total": 152, + "page": 1, + "size": 50, + "pages": 4 +} +``` + +**Pagination:** loop with offset increments until `offset >= total`. See the [notebook-patterns skill](.agents/skills/notebook-patterns/SKILL.md) for a reusable helper. + +### POST `/api/v1/search` — Semantic Search + +Use when you need **relevant** results for a natural-language question. Returns ranked text chunks. No pagination — set a higher `limit` instead. + +**When to use:** "What was discussed about the product roadmap?", "Find conversations about hiring" + +**Request body (JSON):** + +```json +{ + "search_text": "product roadmap decisions", + "entity_types": ["meeting", "threaded_conversation"], + "contact_ids": ["1", "42"], + "date_from": "2025-01-01T00:00:00Z", + "date_to": "2025-06-01T00:00:00Z", + "connector_ids": ["reflector", "zulip"], + "limit": 20 +} +``` + +**Response:** `{results: [...chunks], total_count}` — each chunk has `entity_ids`, `entity_type`, `connector_id`, `content`, `timestamp`. + +### GET `/api/v1/entities/{id}` — Get Entity by ID + +Retrieve full details of a single entity. The `entity_id` format is `connector_name:native_id`. + +### GET `/api/v1/connectors/status` — Connector Status + +Get sync status for all connectors (last sync time, entity count, health). + +## Common Query Recipes + +| Question | entity_type + connector_id | +|---------------------------------------|------------------------------------------| +| Meetings I attended | `meeting` + `reflector`, with your contact_id | +| Upcoming calendar events | `calendar_event` + `ics_calendar`, date_from=now | +| Emails from someone | `email` + `mbsync_email`, with their contact_id | +| Zulip threads about a topic | `threaded_conversation` + `zulip`, search="topic" | +| All documents | `document` + `hedgedoc` | +| Chat messages mentioning someone | `conversation_message` + `zulip`, with contact_id | +| What was discussed about X? | Use `POST /search` with `search_text` | diff --git a/.agents/skills/notebook-patterns/SKILL.md b/.agents/skills/notebook-patterns/SKILL.md new file mode 100644 index 0000000..4070ede --- /dev/null +++ b/.agents/skills/notebook-patterns/SKILL.md @@ -0,0 +1,808 @@ +--- +name: notebook-patterns +description: Marimo notebook patterns for InternalAI data analysis. Use when creating or editing marimo notebooks — covers cell scoping, async cells, pagination helpers, analysis patterns, and do/don't rules. +user-invocable: false +--- + +# Marimo Notebook Patterns + +This guide covers how to create [marimo](https://marimo.io) notebooks for data analysis against the InternalAI platform APIs. Marimo notebooks are plain `.py` files with reactive cells — no `.ipynb` format, no Jupyter dependency. + +## Marimo Basics + +A marimo notebook is a Python file with `@app.cell` decorated functions. Each cell returns values as a tuple, and other cells receive them as function parameters — marimo builds a reactive DAG automatically. + +```python +import marimo +app = marimo.App() + +@app.cell +def cell_one(): + x = 42 + return (x,) + +@app.cell +def cell_two(x): + # Re-runs automatically when x changes + result = x * 2 + return (result,) +``` + +**Key rules:** +- Cells declare dependencies via function parameters +- Cells return values as tuples: `return (var1, var2,)` +- The **last expression at the top level** of a cell is displayed as rich output in the marimo UI (dataframes render as tables, dicts as collapsible trees). Expressions inside `if`/`else`/`for` blocks do **not** count — see [Cell Output Must Be at the Top Level](#cell-output-must-be-at-the-top-level) below +- Use `mo.md("# heading")` for formatted markdown output (import `mo` once in setup — see below) +- No manual execution order; the DAG determines it +- **Variable names must be unique across cells.** Every variable assigned at the top level of a cell is tracked by marimo's DAG. If two cells both define `resp`, marimo raises `MultipleDefinitionError` and refuses to run. Prefix cell-local variables with `_` (e.g., `_resp`, `_rows`, `_data`) to make them **private** to that cell — marimo ignores `_`-prefixed names. +- **All imports must go in the `setup` cell.** Every `import` statement creates a top-level variable (e.g., `import asyncio` defines `asyncio`). If two cells both `import asyncio`, marimo raises `MultipleDefinitionError`. Place **all** imports in a single setup cell and pass them as cell parameters. Do NOT `import marimo as mo` or `import asyncio` in multiple cells — import once in `setup`, then receive via `def my_cell(mo, asyncio):`. + +### Cell Variable Scoping — Example + +This is the **most common mistake**. Any variable assigned at the top level of a cell (not inside a `def` or comprehension) is tracked by marimo. If two cells assign the same name, the notebook refuses to run. + +**BROKEN** — `resp` is defined at top level in both cells: + +```python +# Cell A +@app.cell +def search_meetings(client, DATAINDEX): + resp = client.post(f"{DATAINDEX}/search", json={...}) # defines 'resp' + resp.raise_for_status() + results = resp.json()["results"] + return (results,) + +# Cell B +@app.cell +def fetch_details(client, DATAINDEX, results): + resp = client.get(f"{DATAINDEX}/entities/{results[0]}") # also defines 'resp' → ERROR + meeting = resp.json() + return (meeting,) +``` + +> **Error:** `MultipleDefinitionError: variable 'resp' is defined in multiple cells` + +**FIXED** — prefix cell-local variables with `_`: + +```python +# Cell A +@app.cell +def search_meetings(client, DATAINDEX): + _resp = client.post(f"{DATAINDEX}/search", json={...}) # _resp is cell-private + _resp.raise_for_status() + results = _resp.json()["results"] + return (results,) + +# Cell B +@app.cell +def fetch_details(client, DATAINDEX, results): + _resp = client.get(f"{DATAINDEX}/entities/{results[0]}") # _resp is cell-private, no conflict + meeting = _resp.json() + return (meeting,) +``` + +**Rule of thumb:** if a variable is only used within the cell to compute a return value, prefix it with `_`. Only leave names unprefixed if another cell needs to receive them. + +> **Note:** Variables inside nested `def` functions are naturally local and don't need `_` prefixes — e.g., `resp` inside a `def fetch_all(...)` helper is fine because it's scoped to the function, not the cell. + +### Cell Output Must Be at the Top Level + +Marimo only renders the **last expression at the top level** of a cell as rich output. An expression buried inside an `if`/`else`, `for`, `try`, or any other block is **not** displayed — it's silently discarded. + +**BROKEN** — `_df` inside the `if` branch is never rendered, and `mo.md()` inside `if`/`else` is also discarded: + +```python +@app.cell +def show_results(results, mo): + if results: + _df = pl.DataFrame(results) + mo.md(f"**Found {len(results)} results**") + _df # Inside an if block — marimo does NOT display this + else: + mo.md("**No results found**") # Also inside a block — NOT displayed + return +``` + +**FIXED** — split into separate cells. Each cell displays exactly **one thing** at the top level: + +```python +# Cell 1: build the data, return it +@app.cell +def build_results(results, pl): + results_df = pl.DataFrame(results) if results else None + return (results_df,) + +# Cell 2: heading — mo.md() is the top-level expression (use ternary for conditional text) +@app.cell +def show_results_heading(results_df, mo): + mo.md(f"**Found {len(results_df)} results**" if results_df is not None else "**No results found**") + +# Cell 3: table — DataFrame is the top-level expression +@app.cell +def show_results_table(results_df): + results_df # Top-level expression — marimo renders this as interactive table +``` + +**Rules:** +- Each cell should display **one thing** — either `mo.md()` OR a DataFrame, never both +- `mo.md()` must be a **top-level expression**, not inside `if`/`else`/`for`/`try` blocks +- Build conditional text using variables or ternary expressions, then call `mo.md(_text)` at the top level +- For DataFrames, use a standalone display cell: `def show_table(df): df` + +### Async Cells + +When a cell uses `await` (e.g., for `llm_call` or `asyncio.gather`), you **must** declare it as `async def`: + +```python +@app.cell +async def analyze(meetings, llm_call, ResponseModel, asyncio): + async def _score(meeting): + return await llm_call(prompt=..., response_model=ResponseModel) + + results = await asyncio.gather(*[_score(_m) for _m in meetings]) + return (results,) +``` + +Note that `asyncio` is imported in the `setup` cell and received here as a parameter — never `import asyncio` inside individual cells. + +If you write `await` in a non-async cell, marimo cannot parse the cell and saves it as an `_unparsable_cell` string literal — the cell won't run, and you'll see `SyntaxError: 'return' outside function` or similar errors. See [Fixing `_unparsable_cell`](#fixing-_unparsable_cell) below. + +### Cells That Define Classes Must Return Them + +If a cell defines Pydantic models (or any class) that other cells need, it **must** return them: + +```python +# BaseModel and Field are imported in the setup cell and received as parameters +@app.cell +def models(BaseModel, Field): + class MeetingSentiment(BaseModel): + overall_sentiment: str + sentiment_score: int = Field(description="Score from -10 to +10") + + class FrustrationExtraction(BaseModel): + has_frustrations: bool + frustrations: list[dict] + + return MeetingSentiment, FrustrationExtraction # Other cells receive these as parameters +``` + +A bare `return` (or no return) means those classes are invisible to the rest of the notebook. + +### Fixing `_unparsable_cell` + +When marimo can't parse a cell into a proper `@app.cell` function, it saves the raw code as `app._unparsable_cell("...", name="cell_name")`. These cells **won't run** and show errors like `SyntaxError: 'return' outside function`. + +**Common causes:** +1. Using `await` without making the cell `async def` +2. Using `return` in code that marimo failed to wrap into a function (usually a side effect of cause 1) + +**How to fix:** Convert the `_unparsable_cell` string back into a proper `@app.cell` decorated function: + +```python +# BROKEN — saved as _unparsable_cell because of top-level await +app._unparsable_cell(""" +results = await asyncio.gather(...) +return results +""", name="my_cell") + +# FIXED — proper async cell function (asyncio imported in setup, received as parameter) +@app.cell +async def my_cell(some_dependency, asyncio): + results = await asyncio.gather(...) + return (results,) +``` + +**Key differences to note when converting:** +- Wrap the code in an `async def` function (if it uses `await`) +- Add cell dependencies as function parameters (including imports like `asyncio`) +- Return values as tuples: `return (var,)` not `return var` +- Prefix cell-local variables with `_` +- Never add `import` statements inside the cell — all imports belong in `setup` + +### Inline Dependencies with PEP 723 + +Use PEP 723 `/// script` metadata so `uv run` auto-installs dependencies: + +```python +# /// script +# requires-python = ">=3.12" +# dependencies = [ +# "marimo", +# "httpx", +# "polars", +# "mirascope[openai]", +# "pydantic", +# "python-dotenv", +# ] +# /// +``` + +### Checking Notebooks Before Running + +Always run `marimo check` before opening or running a notebook. It catches common issues — duplicate variable definitions, `_unparsable_cell` blocks, branch expressions that won't display, and more — without needing to start the full editor: + +```bash +uvx marimo check notebook.py # Check a single notebook +uvx marimo check workflows/ # Check all notebooks in a directory +uvx marimo check --fix notebook.py # Auto-fix fixable issues +``` + +**Run this after every edit.** A clean `marimo check` (no output, exit code 0) means the notebook is structurally valid. Any errors must be fixed before running. + +### Running Notebooks + +```bash +uvx marimo edit notebook.py # Interactive editor (best for development) +uvx marimo run notebook.py # Read-only web app +uv run notebook.py # Script mode (terminal output) +``` + +### Inspecting Cell Outputs + +In `marimo edit`, every cell's return value is displayed as rich output below the cell. This is the primary way to introspect API responses: + +- **Dicts/lists** render as collapsible JSON trees — click to expand nested fields +- **Polars/Pandas DataFrames** render as interactive sortable tables +- **Strings** render as plain text + +To inspect a raw API response, just make it the last expression: + +```python +@app.cell +def inspect_response(client, DATAINDEX): + _resp = client.get(f"{DATAINDEX}/query", params={ + "entity_types": "meeting", "limit": 2, + }) + _resp.json() # This gets displayed as a collapsible JSON tree +``` + +To inspect an intermediate value alongside other work, use `mo.accordion` or return it: + +```python +@app.cell +def debug_meetings(meetings, mo): + mo.md(f"**Count:** {len(meetings)}") + # Show first item structure for inspection + mo.accordion({"First meeting raw": mo.json(meetings[0])}) if meetings else None +``` + +## Notebook Skeleton + +Every notebook against InternalAI follows this structure: + +```python +# /// script +# requires-python = ">=3.12" +# dependencies = [ +# "marimo", +# "httpx", +# "polars", +# "mirascope[openai]", +# "pydantic", +# "python-dotenv", +# ] +# /// + +import marimo +app = marimo.App() + +@app.cell +def params(): + """User parameters — edit these to change the workflow's behavior.""" + SEARCH_TERMS = ["greyhaven"] + DATE_FROM = "2026-01-01T00:00:00Z" + DATE_TO = "2026-02-01T00:00:00Z" + TARGET_PERSON = None # Set to a name like "Alice" to filter by person, or None for all + return DATE_FROM, DATE_TO, SEARCH_TERMS, TARGET_PERSON + +@app.cell +def config(): + BASE = "http://localhost:42000" + CONTACTDB = f"{BASE}/contactdb-api" + DATAINDEX = f"{BASE}/dataindex/api/v1" + return (CONTACTDB, DATAINDEX,) + +@app.cell +def setup(): + from dotenv import load_dotenv + load_dotenv(".env") # Load .env from the project root + + import asyncio # All imports go here — never import inside other cells + import httpx + import marimo as mo + import polars as pl + from pydantic import BaseModel, Field + client = httpx.Client(timeout=30) + return (asyncio, client, mo, pl, BaseModel, Field,) + +# --- your IN / ETL / OUT cells here --- + +if __name__ == "__main__": + app.run() +``` + +> **`load_dotenv(".env")`** reads the `.env` file explicitly by name. This makes `LLM_API_KEY` and other env vars available to `os.getenv()` calls in `lib/llm.py` without requiring the shell to have them pre-set. Always include `python-dotenv` in PEP 723 dependencies and call `load_dotenv(".env")` early in the setup cell. + +**The `params` cell must always be the first cell** after `app = marimo.App()`. It contains all user-configurable constants (search terms, date ranges, target names, etc.) as plain Python values. This way the user can tweak the workflow by editing a single cell at the top — no need to hunt through the code for hardcoded values. + +## Pagination Helper + +The DataIndex `GET /query` endpoint paginates with `limit` and `offset`. Always paginate — result sets can be large. + +```python +@app.cell +def helpers(client): + def fetch_all(url, params): + """Fetch all pages from a paginated DataIndex endpoint.""" + all_items = [] + limit = params.get("limit", 50) + params = {**params, "limit": limit, "offset": 0} + while True: + resp = client.get(url, params=params) + resp.raise_for_status() + data = resp.json() + all_items.extend(data["items"]) + if params["offset"] + limit >= data["total"]: + break + params["offset"] += limit + return all_items + + def resolve_contact(name, contactdb_url): + """Find a contact by name, return their ID.""" + resp = client.get(f"{contactdb_url}/api/contacts", params={"search": name}) + resp.raise_for_status() + contacts = resp.json()["contacts"] + if not contacts: + raise ValueError(f"No contact found for '{name}'") + return contacts[0] + + return (fetch_all, resolve_contact,) +``` + +## Pattern 1: Emails Involving a Specific Person + +Emails have `from_contact_id`, `to_contact_ids`, and `cc_contact_ids`. The query API's `contact_ids` filter matches entities where the contact appears in **any** of these roles. + +```python +@app.cell +def find_person(resolve_contact, CONTACTDB): + target = resolve_contact("Alice", CONTACTDB) + target_id = target["id"] + target_name = target["name"] + return (target_id, target_name,) + +@app.cell +def fetch_emails(fetch_all, DATAINDEX, target_id): + emails = fetch_all(f"{DATAINDEX}/query", { + "entity_types": "email", + "contact_ids": str(target_id), + "date_from": "2025-01-01T00:00:00Z", + "sort_order": "desc", + }) + return (emails,) + +@app.cell +def email_table(emails, target_id, target_name, pl): + email_df = pl.DataFrame([{ + "date": e["timestamp"][:10], + "subject": e.get("title", "(no subject)"), + "direction": ( + "sent" if str(target_id) == str(e.get("from_contact_id")) + else "received" + ), + "snippet": (e.get("snippet") or e.get("text_content") or "")[:100], + } for e in emails]) + return (email_df,) + +@app.cell +def show_emails(email_df, target_name, mo): + mo.md(f"## Emails involving {target_name} ({len(email_df)} total)") + +@app.cell +def display_email_table(email_df): + email_df # Renders as interactive table in marimo edit +``` + +## Pattern 2: Meetings with a Specific Participant + +Meetings have a `participants` list where each entry may or may not have a resolved `contact_id`. The query API's `contact_ids` filter only matches **resolved** participants. + +**Strategy:** Query by `contact_ids` to get meetings with resolved participants, then optionally do a client-side check on `participants[].display_name` or `transcript` for unresolved ones. + +> **Always include `room_name` in meeting tables.** The `room_name` field contains the virtual room name (e.g., `standup-office-bogota`) and often indicates where the meeting took place. It's useful context when `title` is generic or missing — include it as a column alongside `title`. + +```python +@app.cell +def fetch_meetings(fetch_all, DATAINDEX, target_id, my_id): + # Get meetings where the target appears in contact_ids + resolved_meetings = fetch_all(f"{DATAINDEX}/query", { + "entity_types": "meeting", + "contact_ids": str(target_id), + "date_from": "2025-01-01T00:00:00Z", + }) + return (resolved_meetings,) + +@app.cell +def meeting_table(resolved_meetings, target_name, pl): + _rows = [] + for _m in resolved_meetings: + _participants = _m.get("participants", []) + _names = [_p["display_name"] for _p in _participants] + _rows.append({ + "date": (_m.get("start_time") or _m["timestamp"])[:10], + "title": _m.get("title", "Untitled"), + "room_name": _m.get("room_name", ""), + "participants": ", ".join(_names), + "has_transcript": _m.get("transcript") is not None, + "has_summary": _m.get("summary") is not None, + }) + meeting_df = pl.DataFrame(_rows) + return (meeting_df,) +``` + +To also find meetings where the person was present but **not resolved** (guest), search the transcript: + +```python +@app.cell +def search_unresolved(client, DATAINDEX, target_name): + # Semantic search for the person's name in meeting transcripts + _resp = client.post(f"{DATAINDEX}/search", json={ + "search_text": target_name, + "entity_types": ["meeting"], + "limit": 50, + }) + _resp.raise_for_status() + transcript_hits = _resp.json()["results"] + return (transcript_hits,) +``` + +## Pattern 3: Calendar Events → Meeting Correlation + +Calendar events and meetings are separate entities from different connectors. To find which calendar events had a corresponding recorded meeting, match by time overlap. + +```python +@app.cell +def fetch_calendar_and_meetings(fetch_all, DATAINDEX, my_id): + events = fetch_all(f"{DATAINDEX}/query", { + "entity_types": "calendar_event", + "contact_ids": str(my_id), + "date_from": "2025-01-01T00:00:00Z", + "sort_by": "timestamp", + "sort_order": "asc", + }) + meetings = fetch_all(f"{DATAINDEX}/query", { + "entity_types": "meeting", + "contact_ids": str(my_id), + "date_from": "2025-01-01T00:00:00Z", + }) + return (events, meetings,) + +@app.cell +def correlate(events, meetings, pl): + from datetime import datetime, timedelta + + def _parse_dt(s): + if not s: + return None + return datetime.fromisoformat(s.replace("Z", "+00:00")) + + # Index meetings by start_time for matching + _meeting_by_time = {} + for _m in meetings: + _start = _parse_dt(_m.get("start_time")) + if _start: + _meeting_by_time[_start] = _m + + _rows = [] + for _ev in events: + _ev_start = _parse_dt(_ev.get("start_time")) + _ev_end = _parse_dt(_ev.get("end_time")) + if not _ev_start: + continue + + # Find meeting within 15-min window of calendar event start + _matched = None + for _m_start, _m in _meeting_by_time.items(): + if abs((_m_start - _ev_start).total_seconds()) < 900: + _matched = _m + break + + _rows.append({ + "date": _ev_start.strftime("%Y-%m-%d"), + "time": _ev_start.strftime("%H:%M"), + "event_title": _ev.get("title", "(untitled)"), + "has_recording": _matched is not None, + "meeting_title": _matched.get("title", "") if _matched else "", + "attendee_count": len(_ev.get("attendees", [])), + }) + + calendar_df = pl.DataFrame(_rows) + return (calendar_df,) +``` + +## Pattern 4: Full Interaction Timeline for a Person + +Combine emails, meetings, and Zulip messages into a single chronological view. + +```python +@app.cell +def fetch_all_interactions(fetch_all, DATAINDEX, target_id): + all_entities = fetch_all(f"{DATAINDEX}/query", { + "contact_ids": str(target_id), + "date_from": "2025-01-01T00:00:00Z", + "sort_by": "timestamp", + "sort_order": "desc", + }) + return (all_entities,) + +@app.cell +def interaction_timeline(all_entities, target_name, pl): + _rows = [] + for _e in all_entities: + _etype = _e["entity_type"] + _summary = "" + if _etype == "email": + _summary = _e.get("snippet") or _e.get("title") or "" + elif _etype == "meeting": + _summary = _e.get("summary") or _e.get("title") or "" + elif _etype == "conversation_message": + _summary = (_e.get("message") or "")[:120] + elif _etype == "threaded_conversation": + _summary = _e.get("title") or "" + elif _etype == "calendar_event": + _summary = _e.get("title") or "" + else: + _summary = _e.get("title") or _e["entity_type"] + + _rows.append({ + "date": _e["timestamp"][:10], + "type": _etype, + "source": _e["connector_id"], + "summary": _summary[:120], + }) + + timeline_df = pl.DataFrame(_rows) + return (timeline_df,) + +@app.cell +def show_timeline(timeline_df, target_name, mo): + mo.md(f"## Interaction Timeline: {target_name} ({len(timeline_df)} events)") + +@app.cell +def display_timeline(timeline_df): + timeline_df +``` + +## Pattern 5: LLM Filtering with `lib.llm` + +When you need to classify, score, or extract structured information from each entity (e.g. "is this meeting about project X?", "rate the relevance of this email"), use the `llm_call` helper from `workflows/lib`. It sends each item to an LLM and parses the response into a typed Pydantic model. + +**Prerequisites:** Copy `.env.example` to `.env` and fill in your `LLM_API_KEY`. Add `mirascope`, `pydantic`, and `python-dotenv` to the notebook's PEP 723 dependencies. + +```python +# /// script +# requires-python = ">=3.12" +# dependencies = [ +# "marimo", +# "httpx", +# "polars", +# "mirascope[openai]", +# "pydantic", +# "python-dotenv", +# ] +# /// +``` + +### Setup cell — load `.env` and import `llm_call` + +```python +@app.cell +def setup(): + from dotenv import load_dotenv + load_dotenv(".env") # Makes LLM_API_KEY available to lib/llm.py + + import asyncio + import httpx + import marimo as mo + import polars as pl + from pydantic import BaseModel, Field + from lib.llm import llm_call + client = httpx.Client(timeout=30) + return (asyncio, client, llm_call, mo, pl, BaseModel, Field,) +``` + +### Define a response model + +Create a Pydantic model that describes the structured output you want from the LLM: + +```python +@app.cell +def models(BaseModel, Field): + + class RelevanceScore(BaseModel): + relevant: bool + reason: str + score: int # 0-10 + + return (RelevanceScore,) +``` + +### Filter entities through the LLM + +Iterate over fetched entities and call `llm_call` for each one. Since `llm_call` is async, use `asyncio.gather` to process items concurrently: + +```python +@app.cell +async def llm_filter(meetings, llm_call, RelevanceScore, pl, mo, asyncio): + _topic = "Greyhaven" + + async def _score(meeting): + _text = meeting.get("summary") or meeting.get("title") or "" + _result = await llm_call( + prompt=f"Is this meeting about '{_topic}'?\n\nMeeting: {_text}", + response_model=RelevanceScore, + system_prompt="Score the relevance of this meeting to the given topic. Set relevant=true if score >= 5.", + ) + return {**meeting, "llm_relevant": _result.relevant, "llm_reason": _result.reason, "llm_score": _result.score} + + scored_meetings = await asyncio.gather(*[_score(_m) for _m in meetings]) + relevant_meetings = [_m for _m in scored_meetings if _m["llm_relevant"]] + + mo.md(f"**LLM filter:** {len(relevant_meetings)}/{len(meetings)} meetings relevant to '{_topic}'") + return (relevant_meetings,) +``` + +### Tips for LLM filtering + +- **Keep prompts short** — only include the fields the LLM needs (title, summary, snippet), not the entire raw entity. +- **Use structured output** — always pass a `response_model` so you get typed fields back, not free-text. +- **Batch wisely** — `asyncio.gather` sends all requests concurrently. For large datasets (100+ items), process in chunks to avoid rate limits. +- **Cache results** — LLM calls are slow and cost money. If iterating on a notebook, consider storing scored results in a cell variable so you don't re-score on every edit. + +## Do / Don't — Quick Reference for LLM Agents + +When generating marimo notebooks, follow these rules strictly. Violations cause `MultipleDefinitionError` at runtime. + +### Do + +- **Prefix cell-local variables with `_`** — `_resp`, `_rows`, `_m`, `_data`, `_chunk`. Marimo ignores `_`-prefixed names so they won't clash across cells. +- **Put all imports in the `setup` cell** and pass them as cell parameters: `def my_cell(client, mo, pl, asyncio):`. Never `import` inside other cells — even `import asyncio` in two async cells causes `MultipleDefinitionError`. +- **Give returned DataFrames unique names** — `email_df`, `meeting_df`, `timeline_df`. Never use a bare `df` that might collide with another cell. +- **Return only values other cells need** — everything else should be `_`-prefixed and stays private to the cell. +- **Import stdlib modules in `setup` too** — even `from datetime import datetime` creates a top-level name. If two cells both import `datetime`, marimo errors. Import it once in `setup` and receive it as a parameter, or use it inside a `_`-prefixed helper function where it's naturally scoped. +- **Every non-utility cell must show a preview** — see the "Cell Output Previews" section below. +- **Use separate display cells for DataFrames** — the build cell returns the DataFrame and shows a `mo.md()` count/heading; a standalone display cell (e.g., `def show_table(df): df`) renders it as an interactive table the user can sort and filter. +- **Include `room_name` when listing meetings** — the virtual room name provides useful context about where the meeting took place (e.g., `standup-office-bogota`). Show it as a column alongside `title`. +- **Keep cell output expressions at the top level** — if a cell conditionally displays a DataFrame, initialize `_output = None` before the `if`/`else`, assign inside the branches, then put `_output` as the last top-level expression. Expressions inside `if`/`else`/`for` blocks are silently ignored by marimo. +- **Put all user parameters in a `params` cell as the first cell** — date ranges, search terms, target names, limits. Never hardcode these values deeper in the notebook. +- **Declare cells as `async def` when using `await`** — `@app.cell` followed by `async def cell_name(...)`. This includes cells using `asyncio.gather`, `await llm_call(...)`, or any async API. +- **Return classes/models from cells that define them** — if a cell defines `class MyModel(BaseModel)`, return it so other cells can use it as a parameter: `return (MyModel,)`. +- **Use `python-dotenv` to load `.env`** — add `python-dotenv` to PEP 723 dependencies and call `load_dotenv(".env")` early in the setup cell (before importing `lib.llm`). This ensures `LLM_API_KEY` and other env vars are available without requiring them to be pre-set in the shell. + +### Don't + +- **Don't define the same variable name in two cells** — even `resp = ...` in cell A and `resp = ...` in cell B is a fatal error. +- **Don't `import` inside non-setup cells** — every `import X` defines a top-level variable `X`. If two cells both `import asyncio`, marimo raises `MultipleDefinitionError` and refuses to run. Put all imports in the `setup` cell and receive them as function parameters. +- **Don't use generic top-level names** like `df`, `rows`, `resp`, `data`, `result` — either prefix with `_` or give them a unique descriptive name. +- **Don't return temporary variables** — if `_rows` is only used to build a DataFrame, keep it `_`-prefixed and only return the DataFrame. +- **Don't use `await` in a non-async cell** — this causes marimo to save the cell as `_unparsable_cell` (a string literal that won't execute). Always use `async def` for cells that call async functions. +- **Don't define classes in a cell without returning them** — a bare `return` or no return makes classes invisible to the DAG. Other cells can't receive them as parameters. +- **Don't put display expressions inside `if`/`else`/`for` blocks** — marimo only renders the last top-level expression. A DataFrame inside an `if` branch is silently discarded. Use the `_output = None` pattern instead (see [Cell Output Must Be at the Top Level](#cell-output-must-be-at-the-top-level)). + +## Cell Output Previews + +Every cell that fetches, transforms, or produces data **must display a preview** so the user can validate results at each step. The only exceptions are **utility cells** (config, setup, helpers) that only define constants or functions. + +Think from the user's perspective: when they open the notebook in `marimo edit`, each cell should tell them something useful — a count, a sample, a summary. Silent cells that do work but show nothing are hard to debug and validate. + +### What to show + +| Cell type | What to preview | +|-----------|----------------| +| API fetch (list of items) | `mo.md(f"**Fetched {len(items)} meetings**")` | +| DataFrame build | The DataFrame itself as last expression (renders as interactive table) | +| Scalar result | `mo.md(f"**Contact:** {name} (id={contact_id})")` | +| Search / filter | `mo.md(f"**{len(hits)} results** matching '{term}'")` | +| Final output | Full DataFrame or `mo.md()` summary as last expression | + +### Example: fetch cell with preview + +**Bad** — cell runs silently, user sees nothing: + +```python +@app.cell +def fetch_meetings(fetch_all, DATAINDEX, my_id): + meetings = fetch_all(f"{DATAINDEX}/query", { + "entity_types": "meeting", + "contact_ids": str(my_id), + }) + return (meetings,) +``` + +**Good** — cell shows a count so the user knows it worked: + +```python +@app.cell +def fetch_meetings(fetch_all, DATAINDEX, my_id, mo): + meetings = fetch_all(f"{DATAINDEX}/query", { + "entity_types": "meeting", + "contact_ids": str(my_id), + }) + mo.md(f"**Fetched {len(meetings)} meetings**") + return (meetings,) +``` + +### Example: transform cell with table preview + +**Bad** — builds DataFrame but doesn't display it: + +```python +@app.cell +def build_table(meetings, pl): + _rows = [{"date": _m["timestamp"][:10], "title": _m.get("title", "")} for _m in meetings] + meeting_df = pl.DataFrame(_rows) + return (meeting_df,) +``` + +**Good** — the build cell shows a `mo.md()` count, and a **separate display cell** renders the DataFrame as an interactive table: + +```python +@app.cell +def build_table(meetings, pl, mo): + _rows = [{"date": _m["timestamp"][:10], "title": _m.get("title", "")} for _m in meetings] + meeting_df = pl.DataFrame(_rows).sort("date") + mo.md(f"### Meetings ({len(meeting_df)} results)") + return (meeting_df,) + +@app.cell +def show_meeting_table(meeting_df): + meeting_df # Renders as interactive sortable table +``` + +### Separate display cells for DataFrames + +When a cell builds a DataFrame, use **two cells**: one that builds and returns it (with a `mo.md()` summary), and a standalone display cell that renders it as a table. This keeps the build logic clean and gives the user an interactive table they can sort and filter in the marimo UI. + +```python +# Cell 1: build and return the DataFrame, show a count +@app.cell +def build_sentiment_table(analyzed_meetings, pl, mo): + _rows = [...] + sentiment_df = pl.DataFrame(_rows).sort("date", descending=True) + mo.md(f"### Sentiment Analysis ({len(sentiment_df)} meetings)") + return (sentiment_df,) + +# Cell 2: standalone display — just the DataFrame, nothing else +@app.cell +def show_sentiment_table(sentiment_df): + sentiment_df +``` + +This pattern makes every result inspectable. The `mo.md()` cell gives a quick count/heading; the display cell lets the user explore the full data interactively. + +### Utility cells (no preview needed) + +Config, setup, and helper cells that only define constants or functions don't need previews: + +```python +@app.cell +def config(): + BASE = "http://localhost:42000" + CONTACTDB = f"{BASE}/contactdb-api" + DATAINDEX = f"{BASE}/dataindex/api/v1" + return CONTACTDB, DATAINDEX + +@app.cell +def helpers(client): + def fetch_all(url, params): + ... + return (fetch_all,) +``` + +## Tips + +- Use `marimo edit` during development to see cell outputs interactively +- Make raw API responses the last expression in a cell to inspect their structure +- Use `polars` over `pandas` for better performance and type safety +- Set `timeout=30` on httpx clients — some queries over large date ranges are slow +- Name cells descriptively — function names appear in the marimo sidebar diff --git a/.agents/skills/project-history/SKILL.md b/.agents/skills/project-history/SKILL.md new file mode 100644 index 0000000..c8f42f9 --- /dev/null +++ b/.agents/skills/project-history/SKILL.md @@ -0,0 +1,364 @@ +--- +name: project-history +description: Build initial historical timeline for a project. Queries all datasources and creates week-by-week analysis files up to a sync date. Requires project-init to have been run first (datasources.md must exist). +disable-model-invocation: true +argument-hint: [project-name] [date-from] [date-to] +--- + +# Build Project History + +**When to use:** After `/project-init` has been run and the user has reviewed `datasources.md`. This skill gathers historical data and builds the week-by-week timeline. + +**Precondition:** `projects/$0/datasources.md` must exist. If it doesn't, run `/project-init $0` first. + +## Step 1: Read Datasources + +Read `projects/$0/datasources.md` to determine: +- Which Zulip stream IDs and search terms to query +- Which git repository to clone/pull +- Which meeting room names to filter by +- Which entity types to prioritize + +## Step 2: Gather Historical Data + +Query data for the period `$1` to `$2`. + +### A. Query Zulip + +For each PRIMARY stream in datasources.md: + +```python +# Paginate through all threaded conversations +GET /api/v1/query + entity_types=threaded_conversation + connector_ids=zulip + date_from=$1 + date_to=$2 + search={project-search-term} + limit=100 + offset=0 +``` + +### B. Clone/Pull Git Repository + +```bash +# First time +git clone --depth 200 {url} ./tmp/$0-clone +# Or if already cloned +cd ./tmp/$0-clone && git pull + +# Extract commit history for the period +git log --since="$1" --until="$2" --format="%H|%an|%ae|%ad|%s" --date=short +git log --since="$1" --until="$2" --format="%an" | sort | uniq -c | sort -rn +``` + +### C. Query Meeting Recordings + +For each PRIMARY meeting room in datasources.md: + +```python +GET /api/v1/query + entity_types=meeting + date_from=$1 + date_to=$2 + room_name={room-name} + limit=100 +``` + +Also do a semantic search for broader coverage: + +```python +POST /api/v1/search + search_text={project-name} + entity_types=["meeting"] + date_from=$1 + date_to=$2 + limit=50 +``` + +## Step 3: Analyze by Week + +For each week in the period, create a week file. Group the gathered data into calendar weeks (Monday-Sunday). + +For each week, analyze: + +1. **Key Decisions** — Strategic choices, architecture changes, vendor selections, security responses +2. **Technical Work** — Features developed, bug fixes, infrastructure changes, merges/PRs +3. **Team Activity** — Who was active, new people, departures, role changes +4. **Blockers** — Issues, delays, dependencies + +### Week file template + +**File:** `projects/$0/timeline/{year-month}/week-{n}.md` + +```markdown +# $0 - Week {n}, {Month} {Year} + +**Period:** {date-range} +**Status:** [Active/Quiet/Blocked] + +## Key Decisions + +### Decision Title +- **Decision:** What was decided +- **Date:** {date} +- **Who:** {decision-makers} +- **Impact:** Why it matters +- **Context:** Background + +## Technical Work + +- [{Date}] {Description} - {Who} + +## Team Activity + +### Core Contributors +- **Name:** Focus area + +### Occasional Contributors +- Name: What they contributed + +## GitHub Activity + +**Commits:** {count} +**Focus Areas:** +- Area 1 + +**Key Commits:** +- Hash: Description (Author) + +## Zulip Activity + +**Active Streams:** +- Stream: Topics discussed + +## Current Blockers + +1. Blocker description + +## Milestones Reached + +If any milestones were completed this week, document with business objective: +- **Milestone:** What was achieved +- **Business Objective:** WHY this matters (search for this in discussions, PRs, meetings) +- **Impact:** Quantifiable results if available + +## Next Week Focus + +- Priority 1 + +## Notes + +- Context and observations +- Always try to capture the WHY behind decisions and milestones +``` + +### Categorization principles + +**Key Decisions:** +- Technology migrations +- Architecture changes +- Vendor switches +- Security incidents +- Strategic pivots + +**Technical Work:** +- Feature implementations +- Bug fixes +- Infrastructure changes +- Refactoring + +**Skip Unless Meaningful:** +- Routine check-ins +- Minor documentation updates +- Social chat + +### Contributor types + +**Core Contributors:** Regular commits (multiple per week), active in technical discussions, making architectural decisions, reviewing PRs. + +**Occasional Contributors:** Sporadic commits, topic-specific involvement, testing/QA, feedback only. + +## Step 4: Create/Update Timeline Index + +**File:** `projects/$0/timeline/index.md` + +```markdown +# $0 Timeline Index + +## {Year} + +### {Quarter} +- [Month Week 1](./{year}-{month}/week-1.md) +- [Month Week 2](./{year}-{month}/week-2.md) + +## Key Milestones + +| Date | Milestone | Business Objective | Status | +|------|-----------|-------------------|--------| +| Mar 2025 | SQLite → PostgreSQL migration | Improve query performance (107ms→27ms) and enable concurrent access for scaling | Complete | +| Jul 2025 | Chakra UI 3 migration | Modernize UI component library and improve accessibility | Complete | + +## Summary by Quarter + +### Q{X} {Year} +- **Milestone 1:** What happened + Business objective +- **Milestone 2:** What happened + Business objective +``` + +## Step 5: Create Project Dashboard (project.md) + +**File:** `projects/$0/project.md` + +Create the **living document** — the entry point showing current status: + +```markdown +# $0 Project + +**One-liner:** [Brief description] +**Status:** [Active/On Hold/Deprecated] +**Last Updated:** [Date] + +--- + +## This Week's Focus + +### Primary Objective +[What the team is working on right now - from the most recent week] + +### Active Work +- [From recent commits and discussions] + +### Blockers +- [Any current blockers] + +--- + +## Last Week's Focus + +### Delivered +- ✅ [What was completed] + +### Decisions Made +- [Key decisions from last week] + +--- + +## Team + +### Core Contributors (Active) +| Name | Focus | Availability | +|------|-------|--------------| +| [From git analysis] | [Area] | Full-time/Part-time | + +### Occasional Contributors +- [Name] - [Role] + +--- + +## Milestones + +### In Progress 🔄 +| Milestone | Target | Business Objective | +|-----------|--------|-------------------| +| [Active milestones from the data] | [Date] | [WHY this matters] | + +### Recently Completed ✅ +| Milestone | Date | Business Objective | +|-----------|------|-------------------| +| [Recently completed] | [Date] | [WHY this mattered] | + +### Lost in Sight / Paused ⏸️ +| Milestone | Status | Reason | +|-----------|--------|--------| +| [If any] | Paused | [Why] | + +--- + +## Recent Decisions + +### Week [N] (Current) +- **[Decision]** - [Context from data] + +--- + +## Quick Links + +- [📊 Timeline](./timeline/index.md) - Week-by-week history +- [📋 Background](./background.md) - Project architecture +- [🔌 Data Sources](./datasources.md) - How to gather information + +--- + +*This is a living document. It reflects the current state and changes frequently.* +``` + +**Fill in from the analyzed data:** +- Team members from git contributors +- Current focus from the most recent week's activity +- Milestones from major features/deployments found in the data +- Recent decisions from meeting transcripts and Zulip discussions + +## Step 6: Update Sync State + +Update `projects/$0/sync-state.md`: + +```markdown +# Sync State + +status: history_complete +created_at: {original date} +last_sync_date: $2 +initial_history_from: $1 +initial_history_to: $2 +``` + +## Common Patterns + +### Security Incident +```markdown +### Security Incident: {CVE-ID} +- **Discovered:** {date} +- **Severity:** CRITICAL/HIGH/MEDIUM +- **Who:** {discoverers} +- **Impact:** {description} +- **Actions:** + 1. Immediate fix + 2. Secrets rotated + 3. Monitoring added +``` + +### Technology Migration +```markdown +### Migration: {Old} -> {New} +- **Decision:** {date} +- **Who:** {decision-makers} +- **Timeline:** {duration} +- **Rationale:** {why} ← Always include the business objective +- **Status:** Complete/In Progress/Planned +``` + +**Important:** When documenting any milestone or decision, always search for and include the WHY: +- Performance improvements (quantify if possible: "reduced from X to Y") +- Business capabilities enabled ("allows concurrent access for scaling") +- User experience improvements ("improves accessibility") +- Risk mitigation ("addresses security vulnerability") +- Cost reduction ("eliminates cloud dependency") + +Look for this context in: meeting recordings, Zulip planning threads, PR descriptions, release notes. + +### Team Change +```markdown +### Team: {Name} {Joined/Left/Role Change} +- **Date:** {date} +- **From:** {old role} (if applicable) +- **To:** {new role} +- **Impact:** {on project} +``` + +## Key Rules + +- **Link to sources**: Always reference commit hashes, PR numbers, Zulip topic names, meeting dates +- **Be explicit about exclusions**: Document what streams/sources you're NOT analyzing and why +- **Write once**: Week files are historical records — don't modify them after creation +- **Paginate all queries**: Result sets can be large, always loop through all pages diff --git a/.agents/skills/project-init/SKILL.md b/.agents/skills/project-init/SKILL.md new file mode 100644 index 0000000..898a2a7 --- /dev/null +++ b/.agents/skills/project-init/SKILL.md @@ -0,0 +1,264 @@ +--- +name: project-init +description: Initialize a new project analysis. Creates directory structure, discovers relevant data sources (Zulip streams, git repos, meeting rooms), and writes datasources.md, background.md skeleton, and sync-state.md. +disable-model-invocation: true +argument-hint: [project-name] +--- + +# Initialize Project Analysis + +**When to use:** Starting analysis of a new project. This skill sets up the project structure and discovers data sources. It does NOT gather historical data — use `/project-history` for that after reviewing the datasources. + +## Step 1: Create Project Structure + +```bash +mkdir -p projects/$0/timeline +``` + +## Step 2: Discover and Document Data Sources + +Investigate what data sources exist for this project. Use the [connectors skill](../connectors/SKILL.md) and [company skill](../company/SKILL.md) for reference. + +### Discovery process + +1. **Zulip streams**: Search DataIndex for `threaded_conversation` entities matching the project name. Note which stream IDs appear. Cross-reference with the company skill's Zulip channel list to identify primary vs. secondary streams. +2. **Git repositories**: Ask the user for the repository URL, or search Gitea/GitHub if accessible. +3. **Meeting rooms**: Search DataIndex for `meeting` entities matching the project name. Note which `room_name` values appear — these are the relevant meeting rooms. +4. **Search terms**: Identify the project name, key technologies, and domain-specific terms that surface relevant data. +5. **Entity type priority**: Determine which entity types are most relevant (typically `threaded_conversation`, `meeting`, and possibly `email`). + +### Write datasources.md + +**File:** `projects/$0/datasources.md` + +```markdown +# $0 - Data Sources + +## Zulip Streams + +### PRIMARY Streams (Analyze All) +| Stream ID | Name | Topics | Priority | What to Look For | +|-----------|------|--------|----------|------------------| +| XXX | stream-name | N topics | CRITICAL | Development discussions | + +### SECONDARY Streams (Selective) +| Stream ID | Name | Topics to Analyze | Context | +|-----------|------|-------------------|---------| +| YYY | integration-stream | specific-topic | Integration work | + +### EXCLUDE +- stream-id-1: reason +- stream-id-2: reason + +## Git Repository + +**URL:** https://... + +**Commands:** +``` +git clone {url} ./tmp/$0-clone +cd ./tmp/$0-clone +git log --format="%H|%an|%ae|%ad|%s" --date=short > commits.csv +git log --format="%an|%ae" | sort | uniq -c | sort -rn +``` + +## Meeting Rooms + +### PRIMARY +- room-name: Project-specific discussions + +### SECONDARY (Context Only) +- allhands: General updates + +### EXCLUDE +- personal-rooms: Other projects + +## Search Terms + +### Primary +- project-name +- key-technology-1 + +### Technical +- architecture-term-1 + +## Entity Types Priority +1. threaded_conversation (Zulip) +2. meeting (recordings) +3. [Exclude: calendar, email, document if not relevant] +``` + +## Step 3: Create Project Dashboard (Living Document) + +**File:** `projects/$0/project.md` + +This is the **entry point** — the living document showing current status. + +```markdown +# $0 Project + +**One-liner:** [Brief description] +**Status:** [Active/On Hold/Deprecated] +**Repository:** URL +**Last Updated:** [Date] + +--- + +## This Week's Focus + +### Primary Objective +[What the team is working on right now] + +### Active Work +- [Current task 1] +- [Current task 2] + +### Blockers +- [Any blockers] + +--- + +## Last Week's Focus + +### Delivered +- ✅ [What was completed] + +### Decisions Made +- [Key decisions from last week] + +--- + +## Team + +### Core Contributors (Active) +| Name | Focus | Availability | +|------|-------|--------------| +| [Name] | [Area] | Full-time/Part-time | + +### Occasional Contributors +- [Name] - [Role] + +--- + +## Milestones + +### In Progress 🔄 +| Milestone | Target | Business Objective | +|-----------|--------|-------------------| +| [Name] | [Date] | [WHY this matters] | + +### Recently Completed ✅ +| Milestone | Date | Business Objective | +|-----------|------|-------------------| +| [Name] | [Date] | [WHY this mattered] | + +### Lost in Sight / Paused ⏸️ +| Milestone | Status | Reason | +|-----------|--------|--------| +| [Name] | Paused | [Why paused] | + +--- + +## Recent Decisions + +### Week [N] (Current) +- **[Decision]** - [Context] + +### Week [N-1] +- **[Decision]** - [Context] + +--- + +## Quick Links + +- [📊 Timeline](./timeline/index.md) - Week-by-week history +- [📋 Background](./background.md) - Project architecture and details +- [🔌 Data Sources](./datasources.md) - How to gather information +- [⚙️ Sync State](./sync-state.md) - Last sync information + +--- + +*This is a living document. It reflects the current state and changes frequently.* +``` + +## Step 4: Create Background Skeleton + +**File:** `projects/$0/background.md` + +Static/architecture information that rarely changes. + +```markdown +# $0 - Background + +**Type:** [Web app/Mobile app/Library/Service] +**Repository:** URL + +## What is $0? + +[Brief description of what the project does] + +## Architecture + +### Components +- Component 1 - Purpose +- Component 2 - Purpose + +### Technology Stack +- Technology 1 - Usage +- Technology 2 - Usage + +## Data Sources + +See: [datasources.md](./datasources.md) + +## Timeline Structure + +Weekly timeline files are organized in `timeline/` directory. + +## How This Project Is Updated + +1. Gather Data: Query Zulip, Git, meetings +2. Update Timeline: Create week-by-week entries +3. Update Project Dashboard: Refresh [project.md](./project.md) + +For current status, see: [project.md](./project.md) +``` + +## Step 4: Create Timeline Index + +**File:** `projects/$0/timeline/index.md` + +```markdown +# $0 Timeline Index + +## Key Milestones + +| Date | Milestone | Status | +|------|-----------|--------| +| [To be filled by project-history] | | | + +## Summary by Quarter + +[To be filled by project-history] +``` + +## Step 5: Initialize Sync State + +**File:** `projects/$0/sync-state.md` + +```markdown +# Sync State + +status: initialized +created_at: [today's date] +last_sync_date: null +initial_history_from: null +initial_history_to: null +``` + +## Done + +After this skill completes, the user should: +1. **Review `datasources.md`** — confirm the streams, repos, and meeting rooms are correct +2. **Edit `background.md`** — fill in any known project details +3. **Run `/project-history $0 [date-from] [date-to]`** — to build the initial historical timeline diff --git a/.agents/skills/project-sync/SKILL.md b/.agents/skills/project-sync/SKILL.md new file mode 100644 index 0000000..ad86174 --- /dev/null +++ b/.agents/skills/project-sync/SKILL.md @@ -0,0 +1,344 @@ +--- +name: project-sync +description: Sync a project timeline using subagents for parallelism. Splits work by week and datasource to stay within context limits. Handles both first-time and incremental syncs. +disable-model-invocation: true +argument-hint: [project-name] +--- + +# Project Sync + +**When to use:** Keep a project timeline up to date. Works whether the project has been synced before or not. + +**Precondition:** `projects/$0/datasources.md` must exist. If it doesn't, run `/project-init $0` first. + +## Architecture: Coordinator + Subagents + +This skill is designed for **subagent execution** to stay within context limits. The main agent acts as a **coordinator** that delegates data-intensive work to subagents. + +``` +Coordinator +├── Phase 1: Gather (parallel subagents, one per datasource) +│ ├── Subagent: Zulip → writes tmp/$0-sync/zulip.md +│ ├── Subagent: Git → writes tmp/$0-sync/git.md +│ └── Subagent: Meetings → writes tmp/$0-sync/meetings.md +│ +├── Phase 2: Synthesize (parallel subagents, one per week) +│ ├── Subagent: Week 1 → writes timeline/{year-month}/week-{n}.md +│ ├── Subagent: Week 2 → writes timeline/{year-month}/week-{n}.md +│ └── ... +│ +└── Phase 3: Finalize (coordinator directly) + ├── timeline/index.md (add links to new weeks) + ├── project.md (update living document) + └── sync-state.md (update sync status) +``` + +--- + +## Coordinator Steps + +### Step 1: Determine Sync Range + +Check whether `projects/$0/sync-state.md` exists. + +**Case A — First sync (no sync-state.md):** +Default range is **last 12 months through today**. If the user provided explicit dates as extra arguments (`$1`, `$2`), use those instead. + +**Case B — Incremental sync (sync-state.md exists):** +Read `last_sync_date` from `projects/$0/sync-state.md`. Range is `last_sync_date` to today. + +### Step 2: Read Datasources + +Read `projects/$0/datasources.md` to determine: +- Zulip stream IDs and search terms +- Git repository URL +- Meeting room names +- Entity types to prioritize + +### Step 3: Prepare Scratch Directory + +```bash +mkdir -p tmp/$0-sync +``` + +This directory holds intermediate outputs from Phase 1 subagents. It is ephemeral — delete it after the sync completes. + +### Step 4: Compute Week Boundaries + +Split the sync range into ISO calendar weeks (Monday–Sunday). Produce a list of `(week_number, week_start, week_end, year_month)` tuples. This list drives Phase 2. + +--- + +## Phase 1: Gather Data (parallel subagents) + +Launch **one subagent per datasource**, all in parallel. Each subagent covers the **full sync range** and writes its output to a scratch file. The output must be organized by week so Phase 2 subagents can consume it. + +### Subagent: Zulip + +**Input:** Sync range, PRIMARY stream IDs and search terms from datasources.md. + +**Important:** `threaded_conversation` entities only contain the **last 50 messages** in a topic. To get complete message history for a week, you must query `conversation_message` entities. + +**Task:** Two-step process for each PRIMARY stream: + +**Step 1:** List all thread IDs in the stream using `id_prefix`: +``` +GET /api/v1/query + entity_types=threaded_conversation + connector_ids=zulip + id_prefix=zulip:stream:{stream_id} + limit=100 + offset=0 +``` + +This returns all thread entities (e.g., `zulip:stream:155:topic_name`). Save these IDs. + +**Step 2:** For each week in the sync range, query messages from each thread: +``` +GET /api/v1/query + entity_types=conversation_message + connector_ids=zulip + parent_id={thread_id} # e.g., zulip:stream:155:standalone + date_from={week_start} + date_to={week_end} + limit=100 + offset=0 +``` + +Paginate through all messages for each thread/week combination. + +**Output:** Write `tmp/$0-sync/zulip.md` with results grouped by week: + +```markdown +## Week {n} ({week_start} to {week_end}) + +### Stream: {stream_name} +- **Topic:** {topic} ({date}, {message_count} messages, {participant_count} participants) + {brief summary or key quote} +``` + +### Subagent: Git + +**Input:** Sync range, git repository URL from datasources.md. + +**Task:** + +**Important:** Git commands may fail due to gitconfig permission issues. Use a temporary HOME directory: + +```bash +# Set temporary HOME to avoid gitconfig permission issues +export HOME=$(pwd)/.tmp-home +mkdir -p ./tmp + +# Clone if needed, pull if exists +if [ -d ./tmp/$0-clone ]; then + export HOME=$(pwd)/.tmp-home && cd ./tmp/$0-clone && git pull +else + export HOME=$(pwd)/.tmp-home && git clone --depth 500 {url} ./tmp/$0-clone + cd ./tmp/$0-clone +fi + +# Get commits in the date range +export HOME=$(pwd)/.tmp-home && git log --since="{range_start}" --until="{range_end}" --format="%H|%an|%ae|%ad|%s" --date=short + +# Get contributor statistics +export HOME=$(pwd)/.tmp-home && git log --since="{range_start}" --until="{range_end}" --format="%an" | sort | uniq -c | sort -rn +``` + +**Output:** Write `tmp/$0-sync/git.md` with results grouped by week: + +```markdown +## Week {n} ({week_start} to {week_end}) + +**Commits:** {count} +**Contributors:** {name} ({count}), {name} ({count}) + +### Key Commits +- `{short_hash}` {subject} — {author} ({date}) +``` + +### Subagent: Meetings + +**Input:** Sync range, meeting room names from datasources.md. + +**Task:** For each PRIMARY room, query meetings and run semantic search: + +``` +GET /api/v1/query + entity_types=meeting + date_from={range_start} + date_to={range_end} + room_name={room-name} + limit=100 + +POST /api/v1/search + search_text={project-name} + entity_types=["meeting"] + date_from={range_start} + date_to={range_end} + limit=50 +``` + +**Output:** Write `tmp/$0-sync/meetings.md` with results grouped by week: + +```markdown +## Week {n} ({week_start} to {week_end}) + +### Meeting: {title} ({date}, {room}) +**Participants:** {names} +**Summary:** {brief summary} +**Key points:** +- {point} +``` + +--- + +## Phase 2: Synthesize Week Files (parallel subagents) + +After all Phase 1 subagents complete, launch **one subagent per week**, all in parallel. Each produces a single week file. + +### Subagent: Week {n} + +**Input:** The relevant `## Week {n}` sections extracted from each of: +- `tmp/$0-sync/zulip.md` +- `tmp/$0-sync/git.md` +- `tmp/$0-sync/meetings.md` + +Pass only the sections for this specific week — do NOT pass the full files. + +**Task:** Merge and analyze the data from all three sources. Categorize into: + +1. **Key Decisions** — Technology migrations, architecture changes, vendor switches, security incidents, strategic pivots +2. **Technical Work** — Feature implementations, bug fixes, infrastructure changes +3. **Team Activity** — Core vs. occasional contributors, role changes +4. **Blockers** — Issues, delays, dependencies + +**Milestones:** When documenting milestones, capture BOTH: +- **WHAT** — The technical achievement (e.g., "PostgreSQL migration") +- **WHY** — The business objective (e.g., "to improve query performance from 107ms to 27ms and enable concurrent access for scaling") + +Search for business objectives in: meeting discussions about roadmap, Zulip threads about planning, PR descriptions, release notes, and any "why are we doing this" conversations. + +**Skip unless meaningful:** Routine check-ins, minor documentation updates, social chat. + +**Output:** Write `projects/$0/timeline/{year-month}/week-{n}.md` using the week file template from [project-history](../project-history/SKILL.md). Also return a **3-5 line summary** to the coordinator for use in Phase 3. + +Create the month directory first if needed: `mkdir -p projects/$0/timeline/{year-month}` + +--- + +## Phase 3: Finalize (coordinator directly) + +The coordinator collects the summaries returned by all Phase 2 subagents. These summaries are small enough to fit in the coordinator's context. + +### Step 5: Update Timeline Index + +Add links to new week files in `projects/$0/timeline/index.md`. Append entries under the appropriate year/quarter sections. Update milestones if any were reached. + +### Step 6: Update Project Dashboard (project.md) + +**File:** `projects/$0/project.md` + +This is the **living document** — update it with current status from the week summaries: + +**Update these sections:** + +1. **This Week's Focus** - What the team is actively working on now +2. **Last Week's Focus** - What was completed in the most recent week +3. **Team** - Current contributors and their focus areas +4. **Milestones** - Update status and add new ones with business objectives +5. **Recent Decisions** - Key decisions from the last 2-3 weeks + +**Milestone Format:** +```markdown +### In Progress 🔄 +| Milestone | Target | Business Objective | +|-----------|--------|-------------------| +| Standalone deployment | Feb 2026 | Enable non-developers to self-host without complex setup | + +### Recently Completed ✅ +| Milestone | Date | Business Objective | +|-----------|------|-------------------| +| PostgreSQL migration | Mar 2025 | Improve performance (107ms→27ms) and enable scaling | + +### Lost in Sight / Paused ⏸️ +| Milestone | Status | Reason | +|-----------|--------|--------| +| Feature X | Paused | Resources reallocated to higher priority | +``` + +**Note:** Milestones in this company change frequently — update status (in progress/done/paused) as needed. + +### Step 7: Update Sync State + +Create or update `projects/$0/sync-state.md`: + +**First sync (Case A):** + +```markdown +# Sync State + +status: synced +created_at: {today's date} +last_sync_date: {today's date} +initial_history_from: {range_start} +initial_history_to: {range_end} +last_incremental_sync: {today's date} +``` + +**Incremental sync (Case B):** + +```markdown +# Sync State + +status: synced +created_at: {original value} +last_sync_date: {today's date} +initial_history_from: {original value} +initial_history_to: {original value} +last_incremental_sync: {today's date} +``` + +### Step 8: Cleanup + +```bash +rm -rf tmp/$0-sync +``` + +### Step 9: Summary Report + +Output a brief summary: + +```markdown +## Sync Summary: {Date} + +### Period Covered +{range_start} to {range_end} + +### Key Changes +1. Decision: {brief description} +2. Feature: {what was built} +3. Team: {who joined/left} + +### Metrics +- {n} new commits +- {n} active contributors +- {n} weeks analyzed +- {n} new Zulip threads +- {n} meetings recorded + +### Current Status +[Status description] +``` + +--- + +## Key Rules + +- **Link to sources**: Always reference commit hashes, PR numbers, Zulip topic names, meeting dates +- **Be explicit about exclusions**: Document what you're NOT analyzing and why +- **Write once**: Week files are historical records — don't modify existing ones, only create new ones +- **Paginate all queries**: Always loop through all pages of results +- **Distinguish contributor types**: Core (regular activity) vs. occasional (sporadic) +- **Subagent isolation**: Each subagent should be self-contained. Pass only the data it needs — never the full scratch files +- **Fail gracefully**: If a datasource subagent fails (e.g., git clone errors, API down), the coordinator should continue with available data and note the gap in the summary diff --git a/.agents/skills/workflow/SKILL.md b/.agents/skills/workflow/SKILL.md new file mode 100644 index 0000000..5af81f4 --- /dev/null +++ b/.agents/skills/workflow/SKILL.md @@ -0,0 +1,105 @@ +--- +name: workflow +description: Create a marimo notebook for data analysis. Use when the request involves analysis over time periods, large data volumes, or when the user asks to "create a workflow". +disable-model-invocation: true +argument-hint: [topic] +--- + +# Workflow — Create a Marimo Notebook + +## When to create a marimo notebook + +Any request that involves **analysis over a period of time** (e.g., "meetings this month", "emails since January", "interaction trends") is likely to return a **large volume of data** — too much to process inline. In these cases, **always produce a marimo notebook** (a `.py` file following the patterns in the [notebook-patterns skill](.agents/skills/notebook-patterns/SKILL.md)). + +Also create a notebook when the user asks to "create a workflow", "write a workflow", or "build an analysis". + +If you're unsure whether a question is simple enough to answer directly or needs a notebook, **ask the user**. + +## Always create a new workflow + +When the user requests a workflow, **always create a new notebook file**. Do **not** modify or re-run an existing workflow unless the user explicitly asks you to (e.g., "update workflow 001", "fix the sentiment notebook", "re-run the existing analysis"). Each new request gets its own sequentially numbered file — even if it covers a similar topic to an earlier workflow. + +## File naming and location + +All notebooks go in the **`workflows/`** directory. Use a sequential number prefix so workflows stay ordered by creation: + +``` +workflows/__.py +``` + +- `` — zero-padded sequence number (`001`, `002`, …). Look at existing files in `workflows/` to determine the next number. +- `` — what is being analyzed, in snake_case (e.g., `greyhaven_meetings`, `alice_emails`, `hiring_discussions`) +- `` — time range or qualifier (e.g., `january`, `q1_2026`, `last_30d`, `all_time`) + +**Examples:** + +``` +workflows/001_greyhaven_meetings_january.py +workflows/002_alice_emails_q1_2026.py +workflows/003_hiring_discussions_last_30d.py +workflows/004_team_interaction_timeline_all_time.py +``` + +**Before creating a new workflow**, list existing files in `workflows/` to find the highest number and increment it. + +## Plan before you implement + +Before writing any notebook, **always propose a plan first** and get the user's approval. The plan should describe: + +1. **Goal** — What question are we answering? +2. **Data sources** — Which entity types and API endpoints will be used? +3. **Algorithm / ETL steps** — Step-by-step description of the data pipeline: what gets fetched, how it's filtered, joined, or aggregated, and what the final output looks like. +4. **Output format** — Table columns, charts, or summary statistics the user will see. + +Only proceed to implementation after the user confirms the plan. + +## Validate before delivering + +After writing or editing a notebook, **always run `uvx marimo check`** to verify it has no structural errors (duplicate variables, undefined names, branch expressions, etc.): + +```bash +uvx marimo check workflows/NNN_topic_scope.py +``` + +A clean check (no output, exit code 0) means the notebook is valid. Fix any errors before delivering the notebook to the user. + +## Steps + +1. **Identify people** — Use ContactDB to resolve names/emails to `contact_id` values. For "me"/"my" questions, always start with `GET /api/contacts/me`. +2. **Find data** — Use DataIndex `GET /query` (exhaustive, paginated) or `POST /search` (semantic, ranked) with `contact_ids`, `entity_types`, `date_from`/`date_to`, `connector_ids` filters. +3. **Analyze** — For simple answers, process the API response directly. For complex multi-step analysis, build a marimo notebook (see the [notebook-patterns skill](.agents/skills/notebook-patterns/SKILL.md) for detailed patterns). + +## Quick Example (Python) + +> "Find all emails involving Alice since January" + +```python +import httpx + +CONTACTDB = "http://localhost:42000/contactdb-api" +DATAINDEX = "http://localhost:42000/dataindex/api/v1" +client = httpx.Client(timeout=30) + +# 1. Resolve "Alice" to a contact_id +resp = client.get(f"{CONTACTDB}/api/contacts", params={"search": "Alice"}) +alice_id = resp.json()["contacts"][0]["id"] # e.g. 42 + +# 2. Fetch all emails involving Alice (with pagination) +emails = [] +offset = 0 +while True: + resp = client.get(f"{DATAINDEX}/query", params={ + "entity_types": "email", + "contact_ids": str(alice_id), + "date_from": "2025-01-01T00:00:00Z", + "limit": 50, + "offset": offset, + }) + data = resp.json() + emails.extend(data["items"]) + if offset + 50 >= data["total"]: + break + offset += 50 + +print(f"Found {len(emails)} emails involving Alice") +``` diff --git a/AGENTS.md b/AGENTS.md index cf9e0ca..82a327e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -2,16 +2,19 @@ The InternalAI platform aggregates company data from email, calendars, Zulip chat, meetings, and documents into two core APIs. These docs give LLM agents the context they need to build programmatic workflows — typically as marimo notebooks — that answer analytical questions about people and their interactions. +As an agent, assume you're running within our greywall sandbox. + ## Routing Table -| I need to... | Read | -|---------------------------------------------|-------------------------------| -| Know who the user is and what they care about | [MYSELF.md] | -| Understand the company and its tools | [company-context.md] | -| Look up people, contacts, relationships | [contactdb-api.md] | -| Query emails, meetings, chats, documents | [dataindex-api.md] | -| Know which connector provides what data | [connectors-and-sources.md] | -| Create a marimo analysis notebook | [notebook-patterns.md] | +| I need to... | Read | +|---------------------------------------------|---------------------------------------------------| +| Know who the user is and what they care about | [MYSELF.md] | +| Understand the company and its tools | [company skill] | +| Look up people, contacts, relationships | [contactdb skill] | +| Query emails, meetings, chats, documents | [dataindex skill] | +| Know which connector provides what data | [connectors skill] | +| Create a marimo analysis notebook | [workflow skill] + [notebook-patterns skill] | +| Build a weekly checkout | [checkout skill] | ## About the User @@ -28,8 +31,8 @@ If `MYSELF.md` does not exist, ask the user to copy `MYSELF.example.md` to `MYSE | Service | Swagger UI | OpenAPI JSON | |------------|---------------------------------------------------|----------------------------------------| -| ContactDB | `http://localhost:42000/contactdb-api/docs` | `/contactdb-api/openapi.json` | -| DataIndex | `http://localhost:42000/dataindex/docs` | `/dataindex/openapi.json` | +| ContactDB | `http://localhost:42000/contactdb-api/docs` (direct), or `http://caddy/contactdb-api/docs` (via greywall sandbox) | `/contactdb-api/openapi.json` | +| DataIndex | `http://localhost:42000/dataindex/docs` (direct), or `http://caddy/dataindex/docs` (via greywall sandbox) | `/dataindex/openapi.json` | ## Common Questions → API Calls @@ -51,117 +54,22 @@ Use this table to translate natural language questions into API calls. The base **Key pattern:** Any question about "me" / "my" / "I" requires calling `GET /contactdb-api/api/contacts/me` first to get your `contact_id`, then using that ID in subsequent DataIndex queries. -## Workflow - -### When to create a marimo notebook - -Any request that involves **analysis over a period of time** (e.g., "meetings this month", "emails since January", "interaction trends") is likely to return a **large volume of data** — too much to process inline. In these cases, **always produce a marimo notebook** (a `.py` file following the patterns in [notebook-patterns.md]). - -Also create a notebook when the user asks to "create a workflow", "write a workflow", or "build an analysis". - -If you're unsure whether a question is simple enough to answer directly or needs a notebook, **ask the user**. - -### Always create a new workflow - -When the user requests a workflow, **always create a new notebook file**. Do **not** modify or re-run an existing workflow unless the user explicitly asks you to (e.g., "update workflow 001", "fix the sentiment notebook", "re-run the existing analysis"). Each new request gets its own sequentially numbered file — even if it covers a similar topic to an earlier workflow. - -### File naming and location - -All notebooks go in the **`workflows/`** directory. Use a sequential number prefix so workflows stay ordered by creation: - -``` -workflows/__.py -``` - -- `` — zero-padded sequence number (`001`, `002`, …). Look at existing files in `workflows/` to determine the next number. -- `` — what is being analyzed, in snake_case (e.g., `greyhaven_meetings`, `alice_emails`, `hiring_discussions`) -- `` — time range or qualifier (e.g., `january`, `q1_2026`, `last_30d`, `all_time`) - -**Examples:** - -``` -workflows/001_greyhaven_meetings_january.py -workflows/002_alice_emails_q1_2026.py -workflows/003_hiring_discussions_last_30d.py -workflows/004_team_interaction_timeline_all_time.py -``` - -**Before creating a new workflow**, list existing files in `workflows/` to find the highest number and increment it. - -### Plan before you implement - -Before writing any notebook, **always propose a plan first** and get the user's approval. The plan should describe: - -1. **Goal** — What question are we answering? -2. **Data sources** — Which entity types and API endpoints will be used? -3. **Algorithm / ETL steps** — Step-by-step description of the data pipeline: what gets fetched, how it's filtered, joined, or aggregated, and what the final output looks like. -4. **Output format** — Table columns, charts, or summary statistics the user will see. - -Only proceed to implementation after the user confirms the plan. - -### Validate before delivering - -After writing or editing a notebook, **always run `uvx marimo check`** to verify it has no structural errors (duplicate variables, undefined names, branch expressions, etc.): - -```bash -uvx marimo check workflows/NNN_topic_scope.py -``` - -A clean check (no output, exit code 0) means the notebook is valid. Fix any errors before delivering the notebook to the user. - -### Steps - -1. **Identify people** — Use ContactDB to resolve names/emails to `contact_id` values. For "me"/"my" questions, always start with `GET /api/contacts/me`. -2. **Find data** — Use DataIndex `GET /query` (exhaustive, paginated) or `POST /search` (semantic, ranked) with `contact_ids`, `entity_types`, `date_from`/`date_to`, `connector_ids` filters. -3. **Analyze** — For simple answers, process the API response directly. For complex multi-step analysis, build a marimo notebook (see [notebook-patterns.md]). - -### Quick Example (Python) - -> "Find all emails involving Alice since January" - -```python -import httpx - -CONTACTDB = "http://localhost:42000/contactdb-api" -DATAINDEX = "http://localhost:42000/dataindex/api/v1" -client = httpx.Client(timeout=30) - -# 1. Resolve "Alice" to a contact_id -resp = client.get(f"{CONTACTDB}/api/contacts", params={"search": "Alice"}) -alice_id = resp.json()["contacts"][0]["id"] # e.g. 42 - -# 2. Fetch all emails involving Alice (with pagination) -emails = [] -offset = 0 -while True: - resp = client.get(f"{DATAINDEX}/query", params={ - "entity_types": "email", - "contact_ids": str(alice_id), - "date_from": "2025-01-01T00:00:00Z", - "limit": 50, - "offset": offset, - }) - data = resp.json() - emails.extend(data["items"]) - if offset + 50 >= data["total"]: - break - offset += 50 - -print(f"Found {len(emails)} emails involving Alice") -``` - ## File Index - [MYSELF.md] — User identity, role, collaborators, and preferences (gitignored, copy from `MYSELF.example.md`) -- [company-context.md] — Business context, team structure, vocabulary -- [contactdb-api.md] — ContactDB entities and REST endpoints -- [dataindex-api.md] — DataIndex entity types, query modes, REST endpoints -- [connectors-and-sources.md] — Connector-to-entity-type mapping -- [notebook-patterns.md] — Marimo notebook patterns and common API workflows +- [company skill] — Business context, team structure, vocabulary +- [contactdb skill] — ContactDB entities and REST endpoints +- [dataindex skill] — DataIndex entity types, query modes, REST endpoints +- [connectors skill] — Connector-to-entity-type mapping +- [workflow skill] — How to create marimo analysis notebooks +- [notebook-patterns skill] — Marimo notebook patterns and common API workflows +- [checkout skill] — Weekly review builder [MYSELF.md]: ./MYSELF.md -[company-context.md]: ./docs/company-context.md -[contactdb-api.md]: ./docs/contactdb-api.md -[dataindex-api.md]: ./docs/dataindex-api.md -[connectors-and-sources.md]: ./docs/connectors-and-sources.md -[notebook-patterns.md]: ./docs/notebook-patterns.md +[company skill]: ./.agents/skills/company/SKILL.md +[contactdb skill]: ./.agents/skills/contactdb/SKILL.md +[dataindex skill]: ./.agents/skills/dataindex/SKILL.md +[connectors skill]: ./.agents/skills/connectors/SKILL.md +[workflow skill]: ./.agents/skills/workflow/SKILL.md +[notebook-patterns skill]: ./.agents/skills/notebook-patterns/SKILL.md +[checkout skill]: ./.agents/skills/checkout/SKILL.md diff --git a/MYSELF.example.md b/MYSELF.example.md deleted file mode 100644 index 195ee04..0000000 --- a/MYSELF.example.md +++ /dev/null @@ -1,28 +0,0 @@ -# About Me - -Copy this file to `MYSELF.md` and fill in your details. The agent reads it to personalize workflows and understand your role. `MYSELF.md` is gitignored — it stays local and private. - -## Identity - -- **Name:** Your Name -- **Role:** e.g. Engineering Lead, Product Manager, Designer -- **Contact ID** Your contact id from the contactdb - useful to prevent a call of me - -## What I work on - -Brief description of your current projects, responsibilities, or focus areas. This helps the agent scope queries — e.g., if you work on GreyHaven, the agent can default to filtering meetings/emails related to that project. - -## People I work with frequently - -List the names of people you interact with most. The agent can use these to suggest relevant filters or default `TARGET_PERSON` values in workflows. - -- Alice — role or context -- Bob — role or context - -## Preferences - -Any preferences for how you want workflows or analysis structured: - -- **Default date range:** e.g. "last 30 days", "current quarter" -- **Preferred output format:** e.g. "tables with counts", "timeline view" -- **Topics of interest:** e.g. "hiring", "client feedback", "sprint blockers" diff --git a/README.md b/README.md index 3050255..756be32 100644 --- a/README.md +++ b/README.md @@ -1,90 +1,77 @@ -# InternalAI Agent +# InternalAI Workspace -A documentation and pattern library that gives LLM agents the context they need to build data analysis workflows against Monadical's internal systems — ContactDB (people directory) and DataIndex (unified data from email, calendar, Zulip, meetings, documents). +Agent-assisted workspace for analyzing and tracking Monadical's internal data: meetings, emails, Zulip conversations, calendar events, documents, and git activity. -The goal is to use [opencode](https://opencode.ai) (or any LLM-powered coding tool) to iteratively create [marimo](https://marimo.io) notebook workflows that query and analyze company data. +## Skills -## Setup +Skills are agent instructions stored in `.agents/skills/`. They follow the [Agent Skills](https://agentskills.io) standard (same structure as `.claude/skills/`). Some are invoked by the user via `/name`, others are background knowledge the agent loads automatically when relevant. -1. Install [opencode](https://opencode.ai) -2. Make sure InternalAI is running locally (ContactDB + DataIndex accessible via http://localhost:42000) -3. Configure LiteLLM — add to `~/.config/opencode/config.json`: +### Task Skills (user-invoked) -```json -{ - "$schema": "https://opencode.ai/config.json", - "provider": { - "litellm": { - "npm": "@ai-sdk/openai-compatible", - "name": "Litellm", - "options": { - "baseURL": "https://litellm.app.monadical.io", - "apiKey": "xxxxx" - }, - "models": { - "Kimi-K2.5-dev": { - "name": "Kimi-K2.5-dev" - } - } - } - } -} -``` +These are workflows you trigger explicitly. The agent will not run them on its own. -Replace `xxxxx` with your actual LiteLLM API key. +| Skill | Invocation | Purpose | +|-------|-----------|---------| +| **project-init** | `/project-init [name]` | Set up a new project: create directory structure, discover data sources (Zulip streams, git repos, meeting rooms), write `datasources.md` and `background.md` skeleton. Stops before gathering data so you can review the sources. | +| **project-history** | `/project-history [name] [from] [to]` | Build the initial timeline for a project. Queries all datasources for a date range, creates week-by-week analysis files, builds the timeline index, and synthesizes the background. Requires `project-init` first. | +| **project-sync** | `/project-sync [name]` | Incremental update of a project timeline. Reads the last sync date from `sync-state.md`, fetches new data through today, creates new week files, and refreshes the timeline and background. | +| **checkout** | `/checkout` | Build a weekly review (Sunday through today). Gathers meetings, emails, Zulip conversations, and Gitea activity, then produces a structured checkout summary. | +| **workflow** | `/workflow [topic]` | Create a marimo notebook for data analysis. Use for any request involving analysis over time periods or large data volumes. | -4. **Set up your profile** — copy the example and fill in your name, role, and contact ID so the agent can personalize workflows: +### Reference Skills (agent-loaded automatically) -```bash -cp MYSELF.example.md MYSELF.md -``` +These provide background knowledge the agent loads when relevant. They don't appear in the `/` menu. -5. **(Optional) LLM filtering in workflows** — if your workflows need to classify or score entities via an LLM, copy `.env.example` to `.env` and fill in your key: +| Skill | What the agent learns | +|-------|----------------------| +| **connectors** | Which data connectors exist and what entity types they produce (reflector, zulip, email, calendar, etc.) | +| **dataindex** | How to query the DataIndex REST API (`GET /query`, `POST /search`, `GET /entities/{id}`) | +| **contactdb** | How to resolve people to contact IDs via the ContactDB REST API | +| **company** | Monadical org structure, Zulip channel layout, communication tools, meeting/calendar relationships | +| **notebook-patterns** | Marimo notebook rules: cell scoping, async patterns, pagination helpers, analysis templates | -```bash -cp .env.example .env -``` +## Project Tracking -The `workflows/lib` module provides an `llm_call` helper (using [mirascope](https://mirascope.io)) for structured LLM calls — see Pattern 5 in `docs/notebook-patterns.md`. +Project analysis files live in `projects/`. See [projects/README.md](projects/README.md) for the directory structure and categorization guidelines. -## Quickstart - -1. Run `opencode` from the project root -2. Ask it to create a workflow, e.g.: *"Create a workflow that shows all meetings about Greyhaven in January"* -3. The agent reads `AGENTS.md`, proposes a plan, and generates a notebook like `workflows/001_greyhaven_meetings_january.py` -4. Run it: `uvx marimo edit workflows/001_greyhaven_meetings_january.py` -5. Iterate — review the output in marimo, go back to opencode and ask for refinements - -## How AGENTS.md is Structured - -`AGENTS.md` is the entry point that opencode reads automatically. It routes the agent to the right documentation: - -| Topic | File | -|-------|------| -| Your identity, role, preferences | `MYSELF.md` (copy from `MYSELF.example.md`) | -| Company context, tools, connectors | `docs/company-context.md` | -| People, contacts, relationships | `docs/contactdb-api.md` | -| Querying emails, meetings, chats, docs | `docs/dataindex-api.md` | -| Connector-to-entity-type mappings | `docs/connectors-and-sources.md` | -| Notebook templates and patterns | `docs/notebook-patterns.md` | - -It also includes API base URLs, a translation table mapping natural-language questions to API calls, and rules for when/how to create workflow notebooks. - -## Project Structure +**Typical workflow:** ``` -internalai-agent/ -├── AGENTS.md # LLM agent routing guide (entry point) -├── MYSELF.example.md # User profile template (copy to MYSELF.md) -├── .env.example # LLM credentials template -├── docs/ -│ ├── company-context.md # Monadical org, tools, key concepts -│ ├── contactdb-api.md # ContactDB REST API reference -│ ├── dataindex-api.md # DataIndex REST API reference -│ ├── connectors-and-sources.md # Connector → entity type mappings -│ └── notebook-patterns.md # Marimo notebook templates and patterns -└── workflows/ - └── lib/ # Shared helpers for notebooks - ├── __init__.py - └── llm.py # llm_call() — structured LLM calls via mirascope +/project-init myproject # 1. Discover sources, create skeleton +# Review datasources.md, adjust if needed +/project-history myproject 2025-06-01 2026-02-17 # 2. Backfill history +# ... time passes ... +/project-sync myproject # 3. Incremental update ``` + +Each project produces: + +``` +projects/{name}/ +├── datasources.md # Where to find data (Zulip streams, git repos, meeting rooms) +├── background.md # Living doc: current status, team, architecture +├── sync-state.md # Tracks last sync date for incremental updates +└── timeline/ + ├── index.md # Navigation and milestones + └── {year-month}/ + └── week-{n}.md # One week of history (write-once) +``` + +## Data Analysis Workflows + +Analysis notebooks live in `workflows/`. Each is a marimo `.py` file. + +``` +/workflow meetings-with-alice # Creates workflows/NNN_meetings_with_alice.py +``` + +See the [workflow skill](.agents/skills/workflow/SKILL.md) for naming conventions and the [notebook-patterns skill](.agents/skills/notebook-patterns/SKILL.md) for marimo coding rules. + +## Data Sources + +All data flows through two APIs: + +- **DataIndex** (`localhost:42000/dataindex/api/v1` direct, `http://caddy/dataindex/api/v1` via greywall sandbox) — unified query interface for all entity types +- **ContactDB** (`localhost:42000/contactdb-api` direct, `http://caddy/contactdb-api/` via greywall sandbox) — people directory, resolves names/emails to contact IDs + +Connectors that feed DataIndex: `reflector` (meetings), `zulip` (chat), `mbsync_email` (email), `ics_calendar` (calendar), `hedgedoc` (documents), `browser_history` (web pages), `babelfish` (translations). diff --git a/docs/contactdb-api.md b/docs/contactdb-api.md index b8a3935..93b7102 100644 --- a/docs/contactdb-api.md +++ b/docs/contactdb-api.md @@ -2,7 +2,7 @@ ContactDB is the people directory. It stores contacts, their platform identities, relationships, notes, and links. Every person across all data sources resolves to a single ContactDB `contact_id`. -**Base URL:** `http://localhost:42000/contactdb-api` (via Caddy) or `http://localhost:42800` (direct) +**Base URL:** `http://localhost:42000/contactdb-api/` (direct) or `http://caddy/contactdb-api/` (via greywall sandbox) ## Core Entities diff --git a/docs/dataindex-api.md b/docs/dataindex-api.md index 534ccd9..aa946f8 100644 --- a/docs/dataindex-api.md +++ b/docs/dataindex-api.md @@ -2,7 +2,7 @@ DataIndex aggregates data from all connected sources (email, calendar, Zulip, meetings, documents) into a unified query interface. Every piece of data is an **entity** with a common base structure plus type-specific fields. -**Base URL:** `http://localhost:42000/dataindex/api/v1` (via Caddy) or `http://localhost:42180/api/v1` (direct) +**Base URL:** `http://localhost:42000/dataindex/api/v1/` (direct) or `http://caddy/dataindex/api/v1/` (via greywall sandbox) ## Entity Types