Files

Mathieu Virbel eefac81e57 feat: migrate to skills-based approach

2026-02-19 11:36:32 -06:00

11 KiB

Raw Blame History

DataIndex API Reference

DataIndex aggregates data from all connected sources (email, calendar, Zulip, meetings, documents) into a unified query interface. Every piece of data is an entity with a common base structure plus type-specific fields.

Base URL: http://localhost:42000/dataindex/api/v1/ (direct) or http://caddy/dataindex/api/v1/ (via greywall sandbox)

Entity Types

All entities share these base fields:

Field	Type	Description
`id`	string	Format: `connector_name:native_id`
`entity_type`	string	One of the types below
`timestamp`	datetime	When the entity occurred
`contact_ids`	string[]	ContactDB IDs of people involved
`connector_id`	string	Which connector produced this
`title`	string?	Display title
`parent_id`	string?	Parent entity (e.g., thread for a message)
`raw_data`	dict	Original source data (excluded by default)

`calendar_event`

From ICS calendar feeds.

Field	Type	Description
`start_time`	datetime?	Event start
`end_time`	datetime?	Event end
`all_day`	bool	All-day event flag
`description`	string?	Event description
`location`	string?	Event location
`attendees`	dict[]	Attendee list
`organizer_contact_id`	string?	ContactDB ID of organizer
`status`	string?	Event status
`calendar_name`	string?	Source calendar name
`meeting_url`	string?	Video call link

`meeting`

From Reflector (recorded meetings with transcripts).

Field	Type	Description
`start_time`	datetime?	Meeting start
`end_time`	datetime?	Meeting end
`participants`	MeetingParticipant[]	People in the meeting
`meeting_platform`	string?	Platform (e.g., "jitsi")
`transcript`	string?	Full transcript text
`summary`	string?	AI-generated summary
`meeting_url`	string?	Meeting link
`recording_url`	string?	Recording link
`location`	string?	Physical location
`room_name`	string?	Virtual room name (also indicates meeting location — see below)

MeetingParticipant fields: display_name, contact_id?, platform_user_id?, email?, speaker?

room_name as location indicator: The room_name field often encodes where the meeting took place (e.g., a Jitsi room name like standup-office-bogota). Use it to infer the meeting location when location is not set.

Participant and contact coverage is incomplete. Meeting data comes from Reflector, which only tracks users who are logged into the Reflector platform. This means:

contact_ids only contains ContactDB IDs for Reflector-logged participants who were matched to a known contact. It will often be a subset of the actual attendees — do not assume it is the full list.

participants is more complete than contact_ids but still only includes people detected by Reflector. Not all participants have accounts or could be identified — some attendees may be entirely absent from this list.

contact_id within a participant may be null if the person was detected but couldn't be matched to a ContactDB entry.

Consequence for queries: Filtering meetings by contact_ids will miss meetings where the person attended but wasn't logged into Reflector or wasn't resolved. To get better coverage, combine multiple strategies:

Filter by contact_ids for resolved participants

Search participants[].display_name client-side for name matches

Use POST /search with the person's name to search meeting transcripts and summaries

`email`

From mbsync email sync.

Field	Type	Description
`thread_id`	string?	Email thread grouping
`text_content`	string?	Plain text body
`html_content`	string?	HTML body
`snippet`	string?	Preview snippet
`from_contact_id`	string?	Sender's ContactDB ID
`to_contact_ids`	string[]	Recipient ContactDB IDs
`cc_contact_ids`	string[]	CC recipient ContactDB IDs
`has_attachments`	bool	Has attachments flag
`attachments`	dict[]	Attachment metadata

`conversation`

A Zulip stream/channel.

Field	Type	Description
`recent_messages`	dict[]	Recent messages in the conversation

`conversation_message`

A single message in a Zulip conversation.

Field	Type	Description
`message`	string?	Message text content
`mentioned_contact_ids`	string[]	ContactDB IDs of mentioned people

`threaded_conversation`

A Zulip topic thread (group of messages under a topic).

Field	Type	Description
`recent_messages`	dict[]	Recent messages in the thread

`document`

From HedgeDoc, API ingestion, or other document sources.

Field	Type	Description
`content`	string?	Document body text
`description`	string?	Document description
`mimetype`	string?	MIME type
`url`	string?	Source URL
`revision_id`	string?	Revision identifier

`webpage`

From browser history extension.

Field	Type	Description
`url`	string	Page URL
`visit_time`	datetime	When visited
`text_content`	string?	Page text content

REST Endpoints

GET `/api/v1/query` — Exhaustive Filtered Enumeration

Use when you need all entities matching specific criteria. Supports pagination.

When to use: "List all meetings since January", "Get all emails from Alice", "Count calendar events this week"

Query parameters:

Parameter	Type	Description
`entity_types`	string (repeat)	Filter by type — repeat param for multiple: `?entity_types=email&entity_types=meeting`
`contact_ids`	string	Comma-separated ContactDB IDs: `"1,42"`
`connector_ids`	string	Comma-separated connector IDs: `"zulip,reflector"`
`date_from`	string	ISO datetime lower bound (UTC if no timezone)
`date_to`	string	ISO datetime upper bound
`search`	string?	Text filter on content fields
`parent_id`	string?	Filter by parent entity
`thread_id`	string?	Filter emails by thread ID
`room_name`	string?	Filter meetings by room name
`limit`	int	Max results per page (default 50)
`offset`	int	Pagination offset (default 0)
`sort_by`	string	`"timestamp"` (default), `"title"`, `"contact_activity"`, etc.
`sort_order`	string	`"desc"` (default) or `"asc"`
`include_raw_data`	bool	Include raw_data field (default false)

Response format:

{
  "items": [...],
  "total": 152,
  "page": 1,
  "size": 50,
  "pages": 4
}

Pagination: loop with offset increments until offset >= total. See notebook-patterns.md for a reusable helper.

POST `/api/v1/search` — Semantic Search

Use when you need relevant results for a natural-language question. Returns ranked text chunks. No pagination — set a higher limit instead.

When to use: "What was discussed about the product roadmap?", "Find conversations about hiring"

Request body (JSON):

{
  "search_text": "product roadmap decisions",
  "entity_types": ["meeting", "threaded_conversation"],
  "contact_ids": ["1", "42"],
  "date_from": "2025-01-01T00:00:00Z",
  "date_to": "2025-06-01T00:00:00Z",
  "connector_ids": ["reflector", "zulip"],
  "limit": 20
}

Response: {results: [...chunks], total_count} — each chunk has entity_ids, entity_type, connector_id, content, timestamp.

GET `/api/v1/entities/{id}` — Get Entity by ID

Retrieve full details of a single entity. The entity_id format is connector_name:native_id.

GET `/api/v1/connectors/status` — Connector Status

Get sync status for all connectors (last sync time, entity count, health).

Common Query Recipes

Question	entity_type + connector_id
Meetings I attended	`meeting` + `reflector`, with your contact_id
Upcoming calendar events	`calendar_event` + `ics_calendar`, date_from=now
Emails from someone	`email` + `mbsync_email`, with their contact_id
Zulip threads about a topic	`threaded_conversation` + `zulip`, search="topic"
All documents	`document` + `hedgedoc`
Chat messages mentioning someone	`conversation_message` + `zulip`, with contact_id
What was discussed about X?	Use `POST /search` with `search_text`

11 KiB Raw Blame History

DataIndex API Reference

Entity Types

calendar_event

meeting

email

conversation

conversation_message

threaded_conversation

document

webpage