internalai-agent/docs/dataindex-api.md

# DataIndex API Reference

DataIndex aggregates data from all connected sources (email, calendar, Zulip, meetings, documents) into a unified query interface. Every piece of data is an **entity** with a common base structure plus type-specific fields.

**Base URL:** `http://localhost:42000/dataindex/api/v1` (via Caddy) or `http://localhost:42180/api/v1` (direct)

## Entity Types

All entities share these base fields:

| Field                | Type        | Description                                 |
|----------------------|-------------|---------------------------------------------|
| `id`                 | string      | Format: `connector_name:native_id`          |
| `entity_type`        | string      | One of the types below                      |
| `timestamp`          | datetime    | When the entity occurred                    |
| `contact_ids`        | string[]    | ContactDB IDs of people involved            |
| `connector_id`       | string      | Which connector produced this               |
| `title`              | string?     | Display title                               |
| `parent_id`          | string?     | Parent entity (e.g., thread for a message)  |
| `raw_data`           | dict        | Original source data (excluded by default)  |

### `calendar_event`

From ICS calendar feeds.

| Field                 | Type        | Description                    |
|-----------------------|-------------|--------------------------------|
| `start_time`          | datetime?   | Event start                    |
| `end_time`            | datetime?   | Event end                      |
| `all_day`             | bool        | All-day event flag             |
| `description`         | string?     | Event description              |
| `location`            | string?     | Event location                 |
| `attendees`           | dict[]      | Attendee list                  |
| `organizer_contact_id`| string?     | ContactDB ID of organizer      |
| `status`              | string?     | Event status                   |
| `calendar_name`       | string?     | Source calendar name           |
| `meeting_url`         | string?     | Video call link                |

### `meeting`

From Reflector (recorded meetings with transcripts).

| Field              | Type                | Description                       |
|--------------------|---------------------|-----------------------------------|
| `start_time`       | datetime?           | Meeting start                     |
| `end_time`         | datetime?           | Meeting end                       |
| `participants`     | MeetingParticipant[]| People in the meeting             |
| `meeting_platform` | string?             | Platform (e.g., "jitsi")          |
| `transcript`       | string?             | Full transcript text              |
| `summary`          | string?             | AI-generated summary              |
| `meeting_url`      | string?             | Meeting link                      |
| `recording_url`    | string?             | Recording link                    |
| `location`         | string?             | Physical location                 |
| `room_name`        | string?             | Virtual room name (also indicates meeting location — see below) |

**MeetingParticipant** fields: `display_name`, `contact_id?`, `platform_user_id?`, `email?`, `speaker?`

> **`room_name` as location indicator:** The `room_name` field often encodes where the meeting took place (e.g., a Jitsi room name like `standup-office-bogota`). Use it to infer the meeting location when `location` is not set.

> **Participant and contact coverage is incomplete.** Meeting data comes from Reflector, which only tracks users who are logged into the Reflector platform. This means:
>
> - **`contact_ids`** only contains ContactDB IDs for Reflector-logged participants who were matched to a known contact. It will often be a **subset** of the actual attendees — do not assume it is the full list.
> - **`participants`** is more complete than `contact_ids` but still only includes people detected by Reflector. Not all participants have accounts or could be identified — some attendees may be entirely absent from this list.
> - **`contact_id` within a participant** may be `null` if the person was detected but couldn't be matched to a ContactDB entry.
>
> **Consequence for queries:** Filtering meetings by `contact_ids` will **miss meetings** where the person attended but wasn't logged into Reflector or wasn't resolved. To get better coverage, combine multiple strategies:
>
> 1. Filter by `contact_ids` for resolved participants
> 2. Search `participants[].display_name` client-side for name matches
> 3. Use `POST /search` with the person's name to search meeting transcripts and summaries

### `email`

From mbsync email sync.

| Field              | Type      | Description                          |
|--------------------|-----------|--------------------------------------|
| `thread_id`        | string?   | Email thread grouping                |
| `text_content`     | string?   | Plain text body                      |
| `html_content`     | string?   | HTML body                            |
| `snippet`          | string?   | Preview snippet                      |
| `from_contact_id`  | string?   | Sender's ContactDB ID               |
| `to_contact_ids`   | string[]  | Recipient ContactDB IDs             |
| `cc_contact_ids`   | string[]  | CC recipient ContactDB IDs          |
| `has_attachments`  | bool      | Has attachments flag                 |
| `attachments`      | dict[]    | Attachment metadata                  |

### `conversation`

A Zulip stream/channel.

| Field              | Type    | Description                            |
|--------------------|---------|----------------------------------------|
| `recent_messages`  | dict[]  | Recent messages in the conversation    |

### `conversation_message`

A single message in a Zulip conversation.

| Field                   | Type      | Description                       |
|-------------------------|-----------|-----------------------------------|
| `message`               | string?   | Message text content              |
| `mentioned_contact_ids` | string[]  | ContactDB IDs of mentioned people |

### `threaded_conversation`

A Zulip topic thread (group of messages under a topic).

| Field              | Type    | Description                            |
|--------------------|---------|----------------------------------------|
| `recent_messages`  | dict[]  | Recent messages in the thread          |

### `document`

From HedgeDoc, API ingestion, or other document sources.

| Field          | Type      | Description                  |
|----------------|-----------|------------------------------|
| `content`      | string?   | Document body text           |
| `description`  | string?   | Document description         |
| `mimetype`     | string?   | MIME type                    |
| `url`          | string?   | Source URL                   |
| `revision_id`  | string?   | Revision identifier          |

### `webpage`

From browser history extension.

| Field          | Type      | Description                  |
|----------------|-----------|------------------------------|
| `url`          | string    | Page URL                     |
| `visit_time`   | datetime  | When visited                 |
| `text_content` | string?   | Page text content            |

## REST Endpoints

### GET `/api/v1/query` — Exhaustive Filtered Enumeration

Use when you need **all** entities matching specific criteria. Supports pagination.

**When to use:** "List all meetings since January", "Get all emails from Alice", "Count calendar events this week"

**Query parameters:**

| Parameter        | Type          | Description                                    |
|------------------|---------------|------------------------------------------------|
| `entity_types`   | string (repeat) | Filter by type — repeat param for multiple: `?entity_types=email&entity_types=meeting` |
| `contact_ids`    | string        | Comma-separated ContactDB IDs: `"1,42"`        |
| `connector_ids`  | string        | Comma-separated connector IDs: `"zulip,reflector"` |
| `date_from`      | string        | ISO datetime lower bound (UTC if no timezone)  |
| `date_to`        | string        | ISO datetime upper bound                       |
| `search`         | string?       | Text filter on content fields                  |
| `parent_id`      | string?       | Filter by parent entity                        |
| `thread_id`      | string?       | Filter emails by thread ID                     |
| `room_name`      | string?       | Filter meetings by room name                   |
| `limit`          | int           | Max results per page (default 50)              |
| `offset`         | int           | Pagination offset (default 0)                  |
| `sort_by`        | string        | `"timestamp"` (default), `"title"`, `"contact_activity"`, etc. |
| `sort_order`     | string        | `"desc"` (default) or `"asc"`                  |
| `include_raw_data`| bool         | Include raw_data field (default false)         |

**Response format:**

```json
{
  "items": [...],
  "total": 152,
  "page": 1,
  "size": 50,
  "pages": 4
}
```

**Pagination:** loop with offset increments until `offset >= total`. See [notebook-patterns.md] for a reusable helper.

### POST `/api/v1/search` — Semantic Search

Use when you need **relevant** results for a natural-language question. Returns ranked text chunks. No pagination — set a higher `limit` instead.

**When to use:** "What was discussed about the product roadmap?", "Find conversations about hiring"

**Request body (JSON):**

```json
{
  "search_text": "product roadmap decisions",
  "entity_types": ["meeting", "threaded_conversation"],
  "contact_ids": ["1", "42"],
  "date_from": "2025-01-01T00:00:00Z",
  "date_to": "2025-06-01T00:00:00Z",
  "connector_ids": ["reflector", "zulip"],
  "limit": 20
}
```

**Response:** `{results: [...chunks], total_count}` — each chunk has `entity_ids`, `entity_type`, `connector_id`, `content`, `timestamp`.

### GET `/api/v1/entities/{id}` — Get Entity by ID

Retrieve full details of a single entity. The `entity_id` format is `connector_name:native_id`.

### GET `/api/v1/connectors/status` — Connector Status

Get sync status for all connectors (last sync time, entity count, health).

## Common Query Recipes

| Question                              | entity_type + connector_id               |
|---------------------------------------|------------------------------------------|
| Meetings I attended                   | `meeting` + `reflector`, with your contact_id |
| Upcoming calendar events              | `calendar_event` + `ics_calendar`, date_from=now |
| Emails from someone                   | `email` + `mbsync_email`, with their contact_id |
| Zulip threads about a topic           | `threaded_conversation` + `zulip`, search="topic" |
| All documents                         | `document` + `hedgedoc`                  |
| Chat messages mentioning someone      | `conversation_message` + `zulip`, with contact_id |
| What was discussed about X?           | Use `POST /search` with `search_text`    |

[notebook-patterns.md]: ./notebook-patterns.md