Files
internalai-agent/docs/dataindex-api.md

11 KiB

DataIndex API Reference

DataIndex aggregates data from all connected sources (email, calendar, Zulip, meetings, documents) into a unified query interface. Every piece of data is an entity with a common base structure plus type-specific fields.

Base URL: http://localhost:42000/dataindex/api/v1/ (direct) or http://caddy/dataindex/api/v1/ (via greywall sandbox)

Entity Types

All entities share these base fields:

Field Type Description
id string Format: connector_name:native_id
entity_type string One of the types below
timestamp datetime When the entity occurred
contact_ids string[] ContactDB IDs of people involved
connector_id string Which connector produced this
title string? Display title
parent_id string? Parent entity (e.g., thread for a message)
raw_data dict Original source data (excluded by default)

calendar_event

From ICS calendar feeds.

Field Type Description
start_time datetime? Event start
end_time datetime? Event end
all_day bool All-day event flag
description string? Event description
location string? Event location
attendees dict[] Attendee list
organizer_contact_id string? ContactDB ID of organizer
status string? Event status
calendar_name string? Source calendar name
meeting_url string? Video call link

meeting

From Reflector (recorded meetings with transcripts).

Field Type Description
start_time datetime? Meeting start
end_time datetime? Meeting end
participants MeetingParticipant[] People in the meeting
meeting_platform string? Platform (e.g., "jitsi")
transcript string? Full transcript text
summary string? AI-generated summary
meeting_url string? Meeting link
recording_url string? Recording link
location string? Physical location
room_name string? Virtual room name (also indicates meeting location — see below)

MeetingParticipant fields: display_name, contact_id?, platform_user_id?, email?, speaker?

room_name as location indicator: The room_name field often encodes where the meeting took place (e.g., a Jitsi room name like standup-office-bogota). Use it to infer the meeting location when location is not set.

Participant and contact coverage is incomplete. Meeting data comes from Reflector, which only tracks users who are logged into the Reflector platform. This means:

  • contact_ids only contains ContactDB IDs for Reflector-logged participants who were matched to a known contact. It will often be a subset of the actual attendees — do not assume it is the full list.
  • participants is more complete than contact_ids but still only includes people detected by Reflector. Not all participants have accounts or could be identified — some attendees may be entirely absent from this list.
  • contact_id within a participant may be null if the person was detected but couldn't be matched to a ContactDB entry.

Consequence for queries: Filtering meetings by contact_ids will miss meetings where the person attended but wasn't logged into Reflector or wasn't resolved. To get better coverage, combine multiple strategies:

  1. Filter by contact_ids for resolved participants
  2. Search participants[].display_name client-side for name matches
  3. Use POST /search with the person's name to search meeting transcripts and summaries

email

From mbsync email sync.

Field Type Description
thread_id string? Email thread grouping
text_content string? Plain text body
html_content string? HTML body
snippet string? Preview snippet
from_contact_id string? Sender's ContactDB ID
to_contact_ids string[] Recipient ContactDB IDs
cc_contact_ids string[] CC recipient ContactDB IDs
has_attachments bool Has attachments flag
attachments dict[] Attachment metadata

conversation

A Zulip stream/channel.

Field Type Description
recent_messages dict[] Recent messages in the conversation

conversation_message

A single message in a Zulip conversation.

Field Type Description
message string? Message text content
mentioned_contact_ids string[] ContactDB IDs of mentioned people

threaded_conversation

A Zulip topic thread (group of messages under a topic).

Field Type Description
recent_messages dict[] Recent messages in the thread

document

From HedgeDoc, API ingestion, or other document sources.

Field Type Description
content string? Document body text
description string? Document description
mimetype string? MIME type
url string? Source URL
revision_id string? Revision identifier

webpage

From browser history extension.

Field Type Description
url string Page URL
visit_time datetime When visited
text_content string? Page text content

REST Endpoints

GET /api/v1/query — Exhaustive Filtered Enumeration

Use when you need all entities matching specific criteria. Supports pagination.

When to use: "List all meetings since January", "Get all emails from Alice", "Count calendar events this week"

Query parameters:

Parameter Type Description
entity_types string (repeat) Filter by type — repeat param for multiple: ?entity_types=email&entity_types=meeting
contact_ids string Comma-separated ContactDB IDs: "1,42"
connector_ids string Comma-separated connector IDs: "zulip,reflector"
date_from string ISO datetime lower bound (UTC if no timezone)
date_to string ISO datetime upper bound
search string? Text filter on content fields
parent_id string? Filter by parent entity
thread_id string? Filter emails by thread ID
room_name string? Filter meetings by room name
limit int Max results per page (default 50)
offset int Pagination offset (default 0)
sort_by string "timestamp" (default), "title", "contact_activity", etc.
sort_order string "desc" (default) or "asc"
include_raw_data bool Include raw_data field (default false)

Response format:

{
  "items": [...],
  "total": 152,
  "page": 1,
  "size": 50,
  "pages": 4
}

Pagination: loop with offset increments until offset >= total. See notebook-patterns.md for a reusable helper.

POST /api/v1/search — Semantic Search

Use when you need relevant results for a natural-language question. Returns ranked text chunks. No pagination — set a higher limit instead.

When to use: "What was discussed about the product roadmap?", "Find conversations about hiring"

Request body (JSON):

{
  "search_text": "product roadmap decisions",
  "entity_types": ["meeting", "threaded_conversation"],
  "contact_ids": ["1", "42"],
  "date_from": "2025-01-01T00:00:00Z",
  "date_to": "2025-06-01T00:00:00Z",
  "connector_ids": ["reflector", "zulip"],
  "limit": 20
}

Response: {results: [...chunks], total_count} — each chunk has entity_ids, entity_type, connector_id, content, timestamp.

GET /api/v1/entities/{id} — Get Entity by ID

Retrieve full details of a single entity. The entity_id format is connector_name:native_id.

GET /api/v1/connectors/status — Connector Status

Get sync status for all connectors (last sync time, entity count, health).

Common Query Recipes

Question entity_type + connector_id
Meetings I attended meeting + reflector, with your contact_id
Upcoming calendar events calendar_event + ics_calendar, date_from=now
Emails from someone email + mbsync_email, with their contact_id
Zulip threads about a topic threaded_conversation + zulip, search="topic"
All documents document + hedgedoc
Chat messages mentioning someone conversation_message + zulip, with contact_id
What was discussed about X? Use POST /search with search_text