# DataIndex API Reference DataIndex aggregates data from all connected sources (email, calendar, Zulip, meetings, documents) into a unified query interface. Every piece of data is an **entity** with a common base structure plus type-specific fields. **Base URL:** `http://localhost:42000/dataindex/api/v1` (via Caddy) or `http://localhost:42180/api/v1` (direct) ## Entity Types All entities share these base fields: | Field | Type | Description | |----------------------|-------------|---------------------------------------------| | `id` | string | Format: `connector_name:native_id` | | `entity_type` | string | One of the types below | | `timestamp` | datetime | When the entity occurred | | `contact_ids` | string[] | ContactDB IDs of people involved | | `connector_id` | string | Which connector produced this | | `title` | string? | Display title | | `parent_id` | string? | Parent entity (e.g., thread for a message) | | `raw_data` | dict | Original source data (excluded by default) | ### `calendar_event` From ICS calendar feeds. | Field | Type | Description | |-----------------------|-------------|--------------------------------| | `start_time` | datetime? | Event start | | `end_time` | datetime? | Event end | | `all_day` | bool | All-day event flag | | `description` | string? | Event description | | `location` | string? | Event location | | `attendees` | dict[] | Attendee list | | `organizer_contact_id`| string? | ContactDB ID of organizer | | `status` | string? | Event status | | `calendar_name` | string? | Source calendar name | | `meeting_url` | string? | Video call link | ### `meeting` From Reflector (recorded meetings with transcripts). | Field | Type | Description | |--------------------|---------------------|-----------------------------------| | `start_time` | datetime? | Meeting start | | `end_time` | datetime? | Meeting end | | `participants` | MeetingParticipant[]| People in the meeting | | `meeting_platform` | string? | Platform (e.g., "jitsi") | | `transcript` | string? | Full transcript text | | `summary` | string? | AI-generated summary | | `meeting_url` | string? | Meeting link | | `recording_url` | string? | Recording link | | `location` | string? | Physical location | | `room_name` | string? | Virtual room name (also indicates meeting location — see below) | **MeetingParticipant** fields: `display_name`, `contact_id?`, `platform_user_id?`, `email?`, `speaker?` > **`room_name` as location indicator:** The `room_name` field often encodes where the meeting took place (e.g., a Jitsi room name like `standup-office-bogota`). Use it to infer the meeting location when `location` is not set. > **Participant and contact coverage is incomplete.** Meeting data comes from Reflector, which only tracks users who are logged into the Reflector platform. This means: > > - **`contact_ids`** only contains ContactDB IDs for Reflector-logged participants who were matched to a known contact. It will often be a **subset** of the actual attendees — do not assume it is the full list. > - **`participants`** is more complete than `contact_ids` but still only includes people detected by Reflector. Not all participants have accounts or could be identified — some attendees may be entirely absent from this list. > - **`contact_id` within a participant** may be `null` if the person was detected but couldn't be matched to a ContactDB entry. > > **Consequence for queries:** Filtering meetings by `contact_ids` will **miss meetings** where the person attended but wasn't logged into Reflector or wasn't resolved. To get better coverage, combine multiple strategies: > > 1. Filter by `contact_ids` for resolved participants > 2. Search `participants[].display_name` client-side for name matches > 3. Use `POST /search` with the person's name to search meeting transcripts and summaries ### `email` From mbsync email sync. | Field | Type | Description | |--------------------|-----------|--------------------------------------| | `thread_id` | string? | Email thread grouping | | `text_content` | string? | Plain text body | | `html_content` | string? | HTML body | | `snippet` | string? | Preview snippet | | `from_contact_id` | string? | Sender's ContactDB ID | | `to_contact_ids` | string[] | Recipient ContactDB IDs | | `cc_contact_ids` | string[] | CC recipient ContactDB IDs | | `has_attachments` | bool | Has attachments flag | | `attachments` | dict[] | Attachment metadata | ### `conversation` A Zulip stream/channel. | Field | Type | Description | |--------------------|---------|----------------------------------------| | `recent_messages` | dict[] | Recent messages in the conversation | ### `conversation_message` A single message in a Zulip conversation. | Field | Type | Description | |-------------------------|-----------|-----------------------------------| | `message` | string? | Message text content | | `mentioned_contact_ids` | string[] | ContactDB IDs of mentioned people | ### `threaded_conversation` A Zulip topic thread (group of messages under a topic). | Field | Type | Description | |--------------------|---------|----------------------------------------| | `recent_messages` | dict[] | Recent messages in the thread | ### `document` From HedgeDoc, API ingestion, or other document sources. | Field | Type | Description | |----------------|-----------|------------------------------| | `content` | string? | Document body text | | `description` | string? | Document description | | `mimetype` | string? | MIME type | | `url` | string? | Source URL | | `revision_id` | string? | Revision identifier | ### `webpage` From browser history extension. | Field | Type | Description | |----------------|-----------|------------------------------| | `url` | string | Page URL | | `visit_time` | datetime | When visited | | `text_content` | string? | Page text content | ## REST Endpoints ### GET `/api/v1/query` — Exhaustive Filtered Enumeration Use when you need **all** entities matching specific criteria. Supports pagination. **When to use:** "List all meetings since January", "Get all emails from Alice", "Count calendar events this week" **Query parameters:** | Parameter | Type | Description | |------------------|---------------|------------------------------------------------| | `entity_types` | string (repeat) | Filter by type — repeat param for multiple: `?entity_types=email&entity_types=meeting` | | `contact_ids` | string | Comma-separated ContactDB IDs: `"1,42"` | | `connector_ids` | string | Comma-separated connector IDs: `"zulip,reflector"` | | `date_from` | string | ISO datetime lower bound (UTC if no timezone) | | `date_to` | string | ISO datetime upper bound | | `search` | string? | Text filter on content fields | | `parent_id` | string? | Filter by parent entity | | `thread_id` | string? | Filter emails by thread ID | | `room_name` | string? | Filter meetings by room name | | `limit` | int | Max results per page (default 50) | | `offset` | int | Pagination offset (default 0) | | `sort_by` | string | `"timestamp"` (default), `"title"`, `"contact_activity"`, etc. | | `sort_order` | string | `"desc"` (default) or `"asc"` | | `include_raw_data`| bool | Include raw_data field (default false) | **Response format:** ```json { "items": [...], "total": 152, "page": 1, "size": 50, "pages": 4 } ``` **Pagination:** loop with offset increments until `offset >= total`. See [notebook-patterns.md] for a reusable helper. ### POST `/api/v1/search` — Semantic Search Use when you need **relevant** results for a natural-language question. Returns ranked text chunks. No pagination — set a higher `limit` instead. **When to use:** "What was discussed about the product roadmap?", "Find conversations about hiring" **Request body (JSON):** ```json { "search_text": "product roadmap decisions", "entity_types": ["meeting", "threaded_conversation"], "contact_ids": ["1", "42"], "date_from": "2025-01-01T00:00:00Z", "date_to": "2025-06-01T00:00:00Z", "connector_ids": ["reflector", "zulip"], "limit": 20 } ``` **Response:** `{results: [...chunks], total_count}` — each chunk has `entity_ids`, `entity_type`, `connector_id`, `content`, `timestamp`. ### GET `/api/v1/entities/{id}` — Get Entity by ID Retrieve full details of a single entity. The `entity_id` format is `connector_name:native_id`. ### GET `/api/v1/connectors/status` — Connector Status Get sync status for all connectors (last sync time, entity count, health). ## Common Query Recipes | Question | entity_type + connector_id | |---------------------------------------|------------------------------------------| | Meetings I attended | `meeting` + `reflector`, with your contact_id | | Upcoming calendar events | `calendar_event` + `ics_calendar`, date_from=now | | Emails from someone | `email` + `mbsync_email`, with their contact_id | | Zulip threads about a topic | `threaded_conversation` + `zulip`, search="topic" | | All documents | `document` + `hedgedoc` | | Chat messages mentioning someone | `conversation_message` + `zulip`, with contact_id | | What was discussed about X? | Use `POST /search` with `search_text` | [notebook-patterns.md]: ./notebook-patterns.md