Files
internalai-agent/docs/dataindex-api.md
2026-02-10 18:19:30 -06:00

219 lines
11 KiB
Markdown

# DataIndex API Reference
DataIndex aggregates data from all connected sources (email, calendar, Zulip, meetings, documents) into a unified query interface. Every piece of data is an **entity** with a common base structure plus type-specific fields.
**Base URL:** `http://localhost:42000/dataindex/api/v1` (via Caddy) or `http://localhost:42180/api/v1` (direct)
## Entity Types
All entities share these base fields:
| Field | Type | Description |
|----------------------|-------------|---------------------------------------------|
| `id` | string | Format: `connector_name:native_id` |
| `entity_type` | string | One of the types below |
| `timestamp` | datetime | When the entity occurred |
| `contact_ids` | string[] | ContactDB IDs of people involved |
| `connector_id` | string | Which connector produced this |
| `title` | string? | Display title |
| `parent_id` | string? | Parent entity (e.g., thread for a message) |
| `raw_data` | dict | Original source data (excluded by default) |
### `calendar_event`
From ICS calendar feeds.
| Field | Type | Description |
|-----------------------|-------------|--------------------------------|
| `start_time` | datetime? | Event start |
| `end_time` | datetime? | Event end |
| `all_day` | bool | All-day event flag |
| `description` | string? | Event description |
| `location` | string? | Event location |
| `attendees` | dict[] | Attendee list |
| `organizer_contact_id`| string? | ContactDB ID of organizer |
| `status` | string? | Event status |
| `calendar_name` | string? | Source calendar name |
| `meeting_url` | string? | Video call link |
### `meeting`
From Reflector (recorded meetings with transcripts).
| Field | Type | Description |
|--------------------|---------------------|-----------------------------------|
| `start_time` | datetime? | Meeting start |
| `end_time` | datetime? | Meeting end |
| `participants` | MeetingParticipant[]| People in the meeting |
| `meeting_platform` | string? | Platform (e.g., "jitsi") |
| `transcript` | string? | Full transcript text |
| `summary` | string? | AI-generated summary |
| `meeting_url` | string? | Meeting link |
| `recording_url` | string? | Recording link |
| `location` | string? | Physical location |
| `room_name` | string? | Virtual room name (also indicates meeting location — see below) |
**MeetingParticipant** fields: `display_name`, `contact_id?`, `platform_user_id?`, `email?`, `speaker?`
> **`room_name` as location indicator:** The `room_name` field often encodes where the meeting took place (e.g., a Jitsi room name like `standup-office-bogota`). Use it to infer the meeting location when `location` is not set.
> **Participant and contact coverage is incomplete.** Meeting data comes from Reflector, which only tracks users who are logged into the Reflector platform. This means:
>
> - **`contact_ids`** only contains ContactDB IDs for Reflector-logged participants who were matched to a known contact. It will often be a **subset** of the actual attendees — do not assume it is the full list.
> - **`participants`** is more complete than `contact_ids` but still only includes people detected by Reflector. Not all participants have accounts or could be identified — some attendees may be entirely absent from this list.
> - **`contact_id` within a participant** may be `null` if the person was detected but couldn't be matched to a ContactDB entry.
>
> **Consequence for queries:** Filtering meetings by `contact_ids` will **miss meetings** where the person attended but wasn't logged into Reflector or wasn't resolved. To get better coverage, combine multiple strategies:
>
> 1. Filter by `contact_ids` for resolved participants
> 2. Search `participants[].display_name` client-side for name matches
> 3. Use `POST /search` with the person's name to search meeting transcripts and summaries
### `email`
From mbsync email sync.
| Field | Type | Description |
|--------------------|-----------|--------------------------------------|
| `thread_id` | string? | Email thread grouping |
| `text_content` | string? | Plain text body |
| `html_content` | string? | HTML body |
| `snippet` | string? | Preview snippet |
| `from_contact_id` | string? | Sender's ContactDB ID |
| `to_contact_ids` | string[] | Recipient ContactDB IDs |
| `cc_contact_ids` | string[] | CC recipient ContactDB IDs |
| `has_attachments` | bool | Has attachments flag |
| `attachments` | dict[] | Attachment metadata |
### `conversation`
A Zulip stream/channel.
| Field | Type | Description |
|--------------------|---------|----------------------------------------|
| `recent_messages` | dict[] | Recent messages in the conversation |
### `conversation_message`
A single message in a Zulip conversation.
| Field | Type | Description |
|-------------------------|-----------|-----------------------------------|
| `message` | string? | Message text content |
| `mentioned_contact_ids` | string[] | ContactDB IDs of mentioned people |
### `threaded_conversation`
A Zulip topic thread (group of messages under a topic).
| Field | Type | Description |
|--------------------|---------|----------------------------------------|
| `recent_messages` | dict[] | Recent messages in the thread |
### `document`
From HedgeDoc, API ingestion, or other document sources.
| Field | Type | Description |
|----------------|-----------|------------------------------|
| `content` | string? | Document body text |
| `description` | string? | Document description |
| `mimetype` | string? | MIME type |
| `url` | string? | Source URL |
| `revision_id` | string? | Revision identifier |
### `webpage`
From browser history extension.
| Field | Type | Description |
|----------------|-----------|------------------------------|
| `url` | string | Page URL |
| `visit_time` | datetime | When visited |
| `text_content` | string? | Page text content |
## REST Endpoints
### GET `/api/v1/query` — Exhaustive Filtered Enumeration
Use when you need **all** entities matching specific criteria. Supports pagination.
**When to use:** "List all meetings since January", "Get all emails from Alice", "Count calendar events this week"
**Query parameters:**
| Parameter | Type | Description |
|------------------|---------------|------------------------------------------------|
| `entity_types` | string (repeat) | Filter by type — repeat param for multiple: `?entity_types=email&entity_types=meeting` |
| `contact_ids` | string | Comma-separated ContactDB IDs: `"1,42"` |
| `connector_ids` | string | Comma-separated connector IDs: `"zulip,reflector"` |
| `date_from` | string | ISO datetime lower bound (UTC if no timezone) |
| `date_to` | string | ISO datetime upper bound |
| `search` | string? | Text filter on content fields |
| `parent_id` | string? | Filter by parent entity |
| `thread_id` | string? | Filter emails by thread ID |
| `room_name` | string? | Filter meetings by room name |
| `limit` | int | Max results per page (default 50) |
| `offset` | int | Pagination offset (default 0) |
| `sort_by` | string | `"timestamp"` (default), `"title"`, `"contact_activity"`, etc. |
| `sort_order` | string | `"desc"` (default) or `"asc"` |
| `include_raw_data`| bool | Include raw_data field (default false) |
**Response format:**
```json
{
"items": [...],
"total": 152,
"page": 1,
"size": 50,
"pages": 4
}
```
**Pagination:** loop with offset increments until `offset >= total`. See [notebook-patterns.md] for a reusable helper.
### POST `/api/v1/search` — Semantic Search
Use when you need **relevant** results for a natural-language question. Returns ranked text chunks. No pagination — set a higher `limit` instead.
**When to use:** "What was discussed about the product roadmap?", "Find conversations about hiring"
**Request body (JSON):**
```json
{
"search_text": "product roadmap decisions",
"entity_types": ["meeting", "threaded_conversation"],
"contact_ids": ["1", "42"],
"date_from": "2025-01-01T00:00:00Z",
"date_to": "2025-06-01T00:00:00Z",
"connector_ids": ["reflector", "zulip"],
"limit": 20
}
```
**Response:** `{results: [...chunks], total_count}` — each chunk has `entity_ids`, `entity_type`, `connector_id`, `content`, `timestamp`.
### GET `/api/v1/entities/{id}` — Get Entity by ID
Retrieve full details of a single entity. The `entity_id` format is `connector_name:native_id`.
### GET `/api/v1/connectors/status` — Connector Status
Get sync status for all connectors (last sync time, entity count, health).
## Common Query Recipes
| Question | entity_type + connector_id |
|---------------------------------------|------------------------------------------|
| Meetings I attended | `meeting` + `reflector`, with your contact_id |
| Upcoming calendar events | `calendar_event` + `ics_calendar`, date_from=now |
| Emails from someone | `email` + `mbsync_email`, with their contact_id |
| Zulip threads about a topic | `threaded_conversation` + `zulip`, search="topic" |
| All documents | `document` + `hedgedoc` |
| Chat messages mentioning someone | `conversation_message` + `zulip`, with contact_id |
| What was discussed about X? | Use `POST /search` with `search_text` |
[notebook-patterns.md]: ./notebook-patterns.md