Compare commits

..

30 Commits

Author SHA1 Message Date
ccffdba75b chore(main): release 0.8.1 (#591) 2025-08-29 11:56:11 -06:00
84a381220b fix: make webhook secret/url allowing null (#590) 2025-08-29 11:55:18 -06:00
5f2f0e9317 chore(main): release 0.8.0 (#579) 2025-08-29 11:34:24 -06:00
88ed7cfa78 feat(rooms): add webhook for transcript completion (#578)
* feat(rooms): add webhook notifications for transcript completion

- Add webhook_url and webhook_secret fields to rooms table
- Create Celery task with 24-hour retry window using exponential backoff
- Send transcript metadata, diarized text, topics, and summaries via webhook
- Add HMAC signature verification for webhook security
- Add test endpoint POST /v1/rooms/{room_id}/webhook/test
- Update frontend with webhook configuration UI and test button
- Auto-generate webhook secret if not provided
- Trigger webhook after successful file pipeline processing for room recordings

* style: linting

* fix: remove unwanted files

* fix: update openapi gen

* fix: self-review

* docs: add comprehensive webhook documentation

- Document webhook configuration, events, and payloads
- Include transcript.completed and test event examples
- Add security considerations and best practices
- Provide example webhook receiver implementation
- Document retry policy and signature verification

* fix: remove audio_mp3_url from webhook payload

- Remove audio download URL generation from webhook
- Update documentation to reflect the change
- Keep only frontend_url for accessing transcripts

* docs: remove unwanted section

* fix: correct API method name and type imports for rooms

- Fix v1RoomsRetrieve to v1RoomsGet
- Update Room type to RoomDetails throughout frontend
- Fix type imports in useRoomList, RoomList, RoomTable, and RoomCards

* feat: add show/hide toggle for webhook secret field

- Add eye icon button to reveal/hide webhook secret when editing
- Show password dots when webhook secret is hidden
- Reset visibility state when opening/closing dialog
- Only show toggle button when editing existing room with secret

* fix: resolve event loop conflict in webhook test endpoint

- Extract webhook test logic into shared async function
- Call async function directly from FastAPI endpoint
- Keep Celery task wrapper for background processing
- Fixes RuntimeError: event loop already running

* refactor: remove unnecessary Celery task for webhook testing

- Webhook testing is synchronous and provides immediate feedback
- No need for background processing via Celery
- Keep only the async function called directly from API endpoint

* feat: improve webhook test error messages and display

- Show HTTP status code in error messages
- Parse JSON error responses to extract meaningful messages
- Improved UI layout for webhook test results
- Added colored background for success/error states
- Better text wrapping for long error messages

* docs: adjust doc

* fix: review

* fix: update attempts to match close 24h

* fix: add event_id

* fix: changed to uuid, to have new event_id when reprocess.

* style: linting

* fix: alembic revision
2025-08-29 10:07:49 -06:00
6f0c7c1a5e feat(cleanup): add automatic data retention for public instances (#574)
* feat(cleanup): add automatic data retention for public instances

- Add Celery task to clean up anonymous data after configurable retention period
- Delete transcripts, meetings, and orphaned recordings older than retention days
- Only runs when PUBLIC_MODE is enabled to prevent accidental data loss
- Properly removes all associated files (local and S3 storage)
- Add manual cleanup tool for testing and intervention
- Configure retention via PUBLIC_DATA_RETENTION_DAYS setting (default: 7 days)

Fixes #571

* fix: apply pre-commit formatting fixes

* fix: properly delete recording files from storage during cleanup

- Add storage deletion for orphaned recordings in both cleanup task and manual tool
- Delete from storage before removing database records
- Log warnings if storage deletion fails but continue with database cleanup

* Apply suggestion from @pr-agent-monadical[bot]

Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>

* Apply suggestion from @pr-agent-monadical[bot]

Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>

* refactor: cleanup_old_data for better logging

* fix: linting

* test: fix meeting cleanup test to not require room controller

- Simplify test by directly inserting meetings into database
- Remove dependency on non-existent rooms_controller.create method
- Tests now pass successfully

* fix: linting

* refactor: simplify cleanup tool to use worker implementation

- Remove duplicate cleanup logic from manual tool
- Use the same _cleanup_old_public_data function from worker
- Remove dry-run feature as requested
- Prevent code duplication and ensure consistency
- Update documentation to reflect changes

* refactor: split cleanup worker into smaller functions

- Move all imports to the top of the file
- Extract cleanup logic into separate functions:
  - cleanup_old_transcripts()
  - cleanup_old_meetings()
  - cleanup_orphaned_recordings()
  - log_cleanup_results()
- Make code more maintainable and testable
- Add days parameter support to Celery task
- Update manual tool to work with refactored code

* feat: add TypedDict typing for cleanup stats

- Add CleanupStats TypedDict for better type safety
- Update all function signatures to use proper typing
- Add return type annotations to _cleanup_old_public_data
- Improves code maintainability and IDE support

* feat: add CASCADE DELETE to meeting_consent foreign key

- Add ondelete="CASCADE" to meeting_consent.meeting_id foreign key
- Generate and apply migration to update existing constraint
- Remove manual consent deletion from cleanup code
- Add unit test to verify CASCADE DELETE behavior

* style: linting

* fix: alembic migration branchpoint

* fix: correct downgrade constraint name in CASCADE DELETE migration

* fix: regenerate CASCADE DELETE migration with proper constraint names

- Delete problematic migration and regenerate with correct names
- Use explicit constraint name in both upgrade and downgrade
- Ensure migration works bidirectionally
- All tests passing including CASCADE DELETE test

* style: linting

* refactor: simplify cleanup to use transcripts as entry point

- Remove orphaned_recordings cleanup (not part of this PR scope)
- Remove separate old_meetings cleanup
- Transcripts are now the main entry point for cleanup
- Associated meetings and recordings are deleted with their transcript
- Use single database connection for all operations
- Update tests to reflect new approach

* refactor: cleanup and rename functions for clarity

- Rename _cleanup_old_public_data to cleanup_old_public_data (make public)
- Rename celery task to cleanup_old_public_data_task for clarity
- Update docstrings and improve code organization
- Remove unnecessary comments and simplify deletion logic
- Update tests to use new function names
- All tests passing

* style: linting\

* style: typing and review

* fix: add transaction on cleanup_single_transcript

* fix: naming

---------

Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>
2025-08-29 08:47:14 -06:00
9dfd76996f fix: file pipeline status reporting and websocket updates (#589)
* feat: use file pipeline for upload and reprocess action

* fix: make file pipeline correctly report status events

* fix: duplication of transcripts_controller

* fix: tests

* test: fix file upload test

* test: fix reprocess

* fix: also patch from main_file_pipeline

(how patch is done is dependent of file import unfortunately)
2025-08-29 00:58:14 -06:00
55cc8637c6 ci: restrict workflow execution to main branch and add concurrency (#586)
* ci: try adding concurrency

* ci: restrict push on main branch

* ci: fix concurrency key

* ci: fix build concurrency

* refactor: apply suggestion from @pr-agent-monadical[bot]

Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>

---------

Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>
2025-08-28 16:43:17 -06:00
f5331a2107 style: more type annotations to parakeet transcriber (#581)
* feat: add comprehensive type annotations to Parakeet transcriber

- Add TypedDict for WordTiming with word, start, end fields
- Add NamedTuple for TimeSegment, AudioSegment, and TranscriptResult
- Add type hints to all generator functions (vad_segment_generator, batch_speech_segments, etc.)
- Add enforce_word_timing_constraints function to prevent word timing overlaps
- Refactor batch_segment_to_audio_segment to reuse pad_audio function

* doc: add note about space
2025-08-28 12:22:07 -06:00
Igor Loskutov
124ce03bf8 fix: Igor/evaluation (#575)
* fix: impossible import error (#563)

* evaluation cli - database events experiment

* hallucinations

* evaluation - unhallucinate

* evaluation - unhallucinate

* roll back reliability link

* self reviewio

* lint

* self review

* add file pipeline to cli

* add file pipeline to cli + sorting

* remove cli tests

* remove ai comments

* comments
2025-08-28 12:07:34 -04:00
7030e0f236 fix: optimize parakeet transcription batching algorithm (#577)
* refactor: optimize transcription batching to accumulate speech segments

- Changed VAD segment generator to return full audio array instead of segments
- Removed segment filtering step
- Modified batch_segments to accumulate maximum speech including silence
- Transcribe larger continuous chunks instead of individual speech segments

* fix: correct transcribe_batch call to use list and fix batch unpacking

* fix: simplify

* fix: remove unused variables

* fix: add typing
2025-08-27 10:32:04 -06:00
37f0110892 doc: update local model readme 2025-08-22 17:50:24 -06:00
cf2896a7f4 doc: update readme about installation instructions
Add a note about installation instructions being inaccurate.
2025-08-22 17:48:35 -06:00
aabf2c2572 chore(main): release 0.7.3 (#565) 2025-08-22 16:35:52 -06:00
6a7b08f016 doc: change readme intro 2025-08-22 16:26:25 -06:00
e2736563d9 doc: update readme with new images 2025-08-22 16:15:54 -06:00
0f54b7782d chore: ignore www/.env.[development,production] 2025-08-22 14:41:09 -06:00
359280dd34 fix: cleaned repo, and get git-leaks clean 2025-08-22 11:51:34 -06:00
9265d201b5 fix: restore previous behavior on live pipeline + audio downscaler (#561)
This commit restore the original behavior with frame cutting. While
silero is used on our gpu for files, look like it's not working great on
the live pipeline. To be investigated, but at the moment, what we keep
is:

- refactored to extract the downscale for further processing in the
pipeline
- remove any downscale implementation from audio_chunker and audio_merge
- removed batching from audio_merge too for now
2025-08-22 10:49:26 -06:00
52f9f533d7 chore(main): release 0.7.2 (#559) 2025-08-21 21:00:05 -06:00
0c3878ac3c fix: docker image not loading libgomp.so.1 for torch (#560)
On ARM64, the docker iamge crash because torch cannot load libgomp.so.1
-- Look like pytorch does not install the same packages depending the
platform.

AMD64:

/app/.venv/lib/python3.12/site-packages/torch/lib/libgomp.so.1
/app/.venv/lib/python3.12/site-packages/ctranslate2.libs/libgomp-a34b3233.so.1.0.0
/app/.venv/lib/python3.12/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0

ARM64:

/app/.venv/lib/python3.12/site-packages/ctranslate2.libs/libgomp-d22c30c5.so.1.0.0
/app/.venv/lib/python3.12/site-packages/scikit_learn.libs/libgomp-947d5fa1.so.1.0.0
/app/.venv/lib/python3.12/site-packages/torch.libs/libgomp-947d5fa1.so.1.0.0
2025-08-21 16:41:35 -06:00
Igor Loskutov
d70beee51b fix: include shared rooms to search (#558)
* include shared rooms to search

* tests vibe

* tests vibe

* tests vibe

* tests vibe

* tests vibe

* tests vibe

* tests vibe

* remove tests, thats too much
2025-08-21 14:52:29 -04:00
bc5b351d2b chore(main): release 0.7.1 (#557) 2025-08-20 23:23:27 -06:00
Igor Loskutov
07981e8090 fix: webvtt db null expectation mismatch (#556) 2025-08-20 23:22:41 -06:00
7e366f6338 chore(main): release 0.7.0 (#541) 2025-08-20 22:24:36 -06:00
7592679a35 build: separate silero-vad and force torch to be resolved without nvidia (#555)
* build: separate silero-vad and force torch to be resolved without nvidia

* build: also add torchaudio as cpu version
2025-08-20 22:23:48 -06:00
af16178f86 ci: use github-token to get around potential api throttling + rework dockerfile (#554)
* ci: use github-token to get around potential api throttling

* build: put pyannote-audio separate to the project

* fix: now that we have a readme, use it

* build: add UV_NO_CACHE
2025-08-20 21:59:29 -06:00
3ea7f6b7b6 feat: pipeline improvement with file processing, parakeet, silero-vad (#540)
* feat: improve pipeline threading, and transcriber (parakeet and silero vad)

* refactor: remove whisperx, implement parakeet

* refactor: make audio_chunker more smart and wait for speech, instead of fixed frame

* refactor: make audio merge to always downscale the audio to 16k for transcription

* refactor: make the audio transcript modal accepting batches

* refactor: improve type safety and remove prometheus metrics

- Add DiarizationSegment TypedDict for proper diarization typing
- Replace List/Optional with modern Python list/| None syntax
- Remove all Prometheus metrics from TranscriptDiarizationAssemblerProcessor
- Add comprehensive file processing pipeline with parallel execution
- Update processor imports and type annotations throughout
- Implement optimized file pipeline as default in process.py tool

* refactor: convert FileDiarizationProcessor I/O types to BaseModel

Update FileDiarizationInput and FileDiarizationOutput to inherit from
BaseModel instead of plain classes, following the standard pattern
used by other processors in the codebase.

* test: add tests for file transcript and diarization with pytest-recording

* build: add pytest-recording

* feat: add local pyannote for testing

* fix: replace PyAV AudioResampler with torchaudio for reliable audio processing

- Replace problematic PyAV AudioResampler that was causing ValueError: [Errno 22] Invalid argument
- Use torchaudio.functional.resample for robust sample rate conversion
- Optimize processing: skip conversion for already 16kHz mono audio
- Add direct WAV writing with Python wave module for better performance
- Consolidate duplicate downsample checks for cleaner code
- Maintain list[av.AudioFrame] input interface
- Required for Silero VAD which needs 16kHz mono audio

* fix: replace PyAV AudioResampler with torchaudio solution

- Resolves ValueError: [Errno 22] Invalid argument in AudioMergeProcessor
- Replaces problematic PyAV AudioResampler with torchaudio.functional.resample
- Optimizes processing to skip unnecessary conversions when audio is already 16kHz mono
- Uses direct WAV writing with Python's wave module for better performance
- Fixes test_basic_process to disable diarization (pyannote dependency not installed)
- Updates test expectations to match actual processor behavior
- Removes unused pydub dependency from pyproject.toml
- Adds comprehensive TEST_ANALYSIS.md documenting test suite status

* feat: add parameterized test for both diarization modes

- Adds @pytest.mark.parametrize to test_basic_process with enable_diarization=[False, True]
- Test with diarization=False always passes (tests core AudioMergeProcessor functionality)
- Test with diarization=True gracefully skips when pyannote.audio is not installed
- Provides comprehensive test coverage for both pipeline configurations

* fix: resolve pipeline property naming conflict in AudioDiarizationPyannoteProcessor

- Renames 'pipeline' property to 'diarization_pipeline' to avoid conflict with base Processor.pipeline attribute
- Fixes AttributeError: 'property 'pipeline' object has no setter' when set_pipeline() is called
- Updates property usage in _diarize method to use new name
- Now correctly supports pipeline initialization for diarization processing

* fix: add local for pyannote

* test: add diarization test

* fix: resample on audio merge now working

* fix: correctly restore timestamp

* fix: display exception in a threaded processor if that happen

* Update pyproject.toml

* ci: remove option

* ci: update astral-sh/setup-uv

* test: add monadical url for pytest-recording

* refactor: remove previous version

* build: move faster whisper to local dep

* test: fix missing import

* refactor: improve main_file_pipeline organization and error handling

- Move all imports to the top of the file
- Create unified EmptyPipeline class to replace duplicate mock pipeline code
- Remove timeout and fallback logic - let processors handle their own retries
- Fix error handling to raise any exception from parallel tasks
- Add proper type hints and validation for captured results

* fix: wrong function

* fix: remove task_done

* feat: add configurable file processing timeouts for modal processors

- Add TRANSCRIPT_FILE_TIMEOUT setting (default: 600s) for file transcription
- Add DIARIZATION_FILE_TIMEOUT setting (default: 600s) for file diarization
- Replace hardcoded timeout=600 with configurable settings in modal processors
- Allows customization of timeout values via environment variables

* fix: use logger

* fix: worker process meetings now use file pipeline

* fix: topic not gathered

* refactor: remove prepare(), pipeline now work

* refactor: implement many review from Igor

* test: add test for test_pipeline_main_file

* refactor: remove doc

* doc: add doc

* ci: update build to use native arm64 builder

* fix: merge fixes

* refactor: changes from Igor review + add test (not by default) to test gpu modal part

* ci: update to our own runner linux-amd64

* ci: try using suggested mode=min

* fix: update diarizer for latest modal, and use volume

* fix: modal file extension detection

* fix: put the diarizer as A100
2025-08-20 20:07:19 -06:00
Igor Loskutov
009590c080 feat: search frontend (#551)
* feat: better highlight

* feat(search): add long_summary to search vector for improved search results

- Update search vector to include long_summary with weight B (between title A and webvtt C)
- Modify SearchController to fetch long_summary and prioritize its snippets
- Generate snippets from long_summary first (max 2), then from webvtt for remaining slots
- Add comprehensive tests for long_summary search functionality
- Create migration to update search_vector_en column in PostgreSQL

This improves search quality by including summarized content which often contains
key topics and themes that may not be explicitly mentioned in the transcript.

* fix: address code review feedback for search enhancements

- Fix test file inconsistencies by removing references to non-existent model fields
  - Comment out tests for unimplemented features (room_ids, status filters, date ranges)
  - Update tests to only use currently available fields (room_id singular, no room_name/processing_status)
  - Mark future functionality tests with @pytest.mark.skip

- Make snippet counts configurable
  - Add LONG_SUMMARY_MAX_SNIPPETS constant (default: 2)
  - Replace hardcoded value with configurable constant

- Improve error handling consistency in WebVTT parsing
  - Use different log levels for different error types (debug for malformed, warning for decode, error for unexpected)
  - Add catch-all exception handler for unexpected errors
  - Include stack trace for critical errors

All existing tests pass with these changes.

* fix: correct datetime test to include required duration field

* feat: better highlight

* feat: search room names

* feat: acknowledge deleted room

* feat: search filters fix and rank removal

* chore: minor refactoring

* feat: better matches frontend

* chore: self-review (vibe)

* chore: self-review WIP

* chore: self-review WIP

* chore: self-review WIP

* chore: self-review WIP

* chore: self-review WIP

* chore: self-review WIP

* chore: self-review WIP

* remove swc (vibe)

* search url query sync (vibe)

* search url query sync (vibe)

* better casts and cap while

* PR review + simplify frontend hook

* pr: remove search db timeouts

* cleanup tests

* tests cleanup

* frontend cleanup

* index declarations

* refactor frontend (self-review)

* fix search pagination

* clear "x" for search input

* pagination max pages fix

* chore: cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* cleanup

* lockfile

* pr review
2025-08-20 20:56:45 -04:00
Igor Loskutov
fe5d344cff diarization cli: throw on modal errors (#553) 2025-08-20 10:21:52 -04:00
Igor Loskutov
86455ce573 chore: type fixes (#544)
* chore: type fixes

* chore: type fixes
2025-08-18 16:31:23 -04:00
128 changed files with 9350 additions and 7233 deletions

View File

@@ -2,6 +2,8 @@ name: Test Database Migrations
on:
push:
branches:
- main
paths:
- "server/migrations/**"
- "server/reflector/db/**"
@@ -17,6 +19,9 @@ on:
jobs:
test-migrations:
runs-on: ubuntu-latest
concurrency:
group: db-ubuntu-latest-${{ github.ref }}
cancel-in-progress: true
services:
postgres:
image: postgres:17

View File

@@ -8,18 +8,30 @@ env:
ECR_REPOSITORY: reflector
jobs:
deploy:
runs-on: ubuntu-latest
build:
strategy:
matrix:
include:
- platform: linux/amd64
runner: linux-amd64
arch: amd64
- platform: linux/arm64
runner: linux-arm64
arch: arm64
runs-on: ${{ matrix.runner }}
permissions:
deployments: write
contents: read
outputs:
registry: ${{ steps.login-ecr.outputs.registry }}
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@0e613a0980cbf65ed5b322eb7a1e075d28913a83
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
@@ -27,21 +39,52 @@ jobs:
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@62f4f872db3836360b72999f4b87f1ff13310f3a
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
uses: aws-actions/amazon-ecr-login@v2
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
uses: docker/setup-buildx-action@v3
- name: Build and push
id: docker_build
uses: docker/build-push-action@v4
- name: Build and push ${{ matrix.arch }}
uses: docker/build-push-action@v5
with:
context: server
platforms: linux/amd64,linux/arm64
platforms: ${{ matrix.platform }}
push: true
tags: ${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:latest
cache-from: type=gha
cache-to: type=gha,mode=max
tags: ${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}:latest-${{ matrix.arch }}
cache-from: type=gha,scope=${{ matrix.arch }}
cache-to: type=gha,mode=max,scope=${{ matrix.arch }}
github-token: ${{ secrets.GHA_CACHE_TOKEN }}
provenance: false
create-manifest:
runs-on: ubuntu-latest
needs: [build]
permissions:
deployments: write
contents: read
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
uses: aws-actions/amazon-ecr-login@v2
- name: Create and push multi-arch manifest
run: |
# Get the registry URL (since we can't easily access job outputs in matrix)
ECR_REGISTRY=$(aws ecr describe-registry --query 'registryId' --output text).dkr.ecr.${{ env.AWS_REGION }}.amazonaws.com
docker manifest create \
$ECR_REGISTRY/${{ env.ECR_REPOSITORY }}:latest \
$ECR_REGISTRY/${{ env.ECR_REPOSITORY }}:latest-amd64 \
$ECR_REGISTRY/${{ env.ECR_REPOSITORY }}:latest-arm64
docker manifest push $ECR_REGISTRY/${{ env.ECR_REPOSITORY }}:latest
echo "✅ Multi-arch manifest pushed: $ECR_REGISTRY/${{ env.ECR_REPOSITORY }}:latest"

View File

@@ -5,12 +5,17 @@ on:
paths:
- "server/**"
push:
branches:
- main
paths:
- "server/**"
jobs:
pytest:
runs-on: ubuntu-latest
concurrency:
group: pytest-${{ github.ref }}
cancel-in-progress: true
services:
redis:
image: redis:6
@@ -19,29 +24,47 @@ jobs:
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v3
uses: astral-sh/setup-uv@v6
with:
enable-cache: true
working-directory: server
- name: Tests
run: |
cd server
uv run -m pytest -v tests
docker:
runs-on: ubuntu-latest
docker-amd64:
runs-on: linux-amd64
concurrency:
group: docker-amd64-${{ github.ref }}
cancel-in-progress: true
steps:
- uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Build and push
id: docker_build
uses: docker/build-push-action@v4
uses: docker/setup-buildx-action@v3
- name: Build AMD64
uses: docker/build-push-action@v6
with:
context: server
platforms: linux/amd64,linux/arm64
cache-from: type=gha
cache-to: type=gha,mode=max
platforms: linux/amd64
cache-from: type=gha,scope=amd64
cache-to: type=gha,mode=max,scope=amd64
github-token: ${{ secrets.GHA_CACHE_TOKEN }}
docker-arm64:
runs-on: linux-arm64
concurrency:
group: docker-arm64-${{ github.ref }}
cancel-in-progress: true
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build ARM64
uses: docker/build-push-action@v6
with:
context: server
platforms: linux/arm64
cache-from: type=gha,scope=arm64
cache-to: type=gha,mode=max,scope=arm64
github-token: ${{ secrets.GHA_CACHE_TOKEN }}

4
.gitignore vendored
View File

@@ -14,4 +14,6 @@ data/
www/REFACTOR.md
www/reload-frontend
server/test.sqlite
CLAUDE.local.md
CLAUDE.local.md
www/.env.development
www/.env.production

1
.gitleaksignore Normal file
View File

@@ -0,0 +1 @@
b9d891d3424f371642cb032ecfd0e2564470a72c:server/tests/test_transcripts_recording_deletion.py:generic-api-key:15

View File

@@ -27,3 +27,8 @@ repos:
files: ^server/
- id: ruff-format
files: ^server/
- repo: https://github.com/gitleaks/gitleaks
rev: v8.28.0
hooks:
- id: gitleaks

View File

@@ -1,5 +1,67 @@
# Changelog
## [0.8.1](https://github.com/Monadical-SAS/reflector/compare/v0.8.0...v0.8.1) (2025-08-29)
### Bug Fixes
* make webhook secret/url allowing null ([#590](https://github.com/Monadical-SAS/reflector/issues/590)) ([84a3812](https://github.com/Monadical-SAS/reflector/commit/84a381220bc606231d08d6f71d4babc818fa3c75))
## [0.8.0](https://github.com/Monadical-SAS/reflector/compare/v0.7.3...v0.8.0) (2025-08-29)
### Features
* **cleanup:** add automatic data retention for public instances ([#574](https://github.com/Monadical-SAS/reflector/issues/574)) ([6f0c7c1](https://github.com/Monadical-SAS/reflector/commit/6f0c7c1a5e751713366886c8e764c2009e12ba72))
* **rooms:** add webhook for transcript completion ([#578](https://github.com/Monadical-SAS/reflector/issues/578)) ([88ed7cf](https://github.com/Monadical-SAS/reflector/commit/88ed7cfa7804794b9b54cad4c3facc8a98cf85fd))
### Bug Fixes
* file pipeline status reporting and websocket updates ([#589](https://github.com/Monadical-SAS/reflector/issues/589)) ([9dfd769](https://github.com/Monadical-SAS/reflector/commit/9dfd76996f851cc52be54feea078adbc0816dc57))
* Igor/evaluation ([#575](https://github.com/Monadical-SAS/reflector/issues/575)) ([124ce03](https://github.com/Monadical-SAS/reflector/commit/124ce03bf86044c18313d27228a25da4bc20c9c5))
* optimize parakeet transcription batching algorithm ([#577](https://github.com/Monadical-SAS/reflector/issues/577)) ([7030e0f](https://github.com/Monadical-SAS/reflector/commit/7030e0f23649a8cf6c1eb6d5889684a41ce849ec))
## [0.7.3](https://github.com/Monadical-SAS/reflector/compare/v0.7.2...v0.7.3) (2025-08-22)
### Bug Fixes
* cleaned repo, and get git-leaks clean ([359280d](https://github.com/Monadical-SAS/reflector/commit/359280dd340433ba4402ed69034094884c825e67))
* restore previous behavior on live pipeline + audio downscaler ([#561](https://github.com/Monadical-SAS/reflector/issues/561)) ([9265d20](https://github.com/Monadical-SAS/reflector/commit/9265d201b590d23c628c5f19251b70f473859043))
## [0.7.2](https://github.com/Monadical-SAS/reflector/compare/v0.7.1...v0.7.2) (2025-08-21)
### Bug Fixes
* docker image not loading libgomp.so.1 for torch ([#560](https://github.com/Monadical-SAS/reflector/issues/560)) ([773fccd](https://github.com/Monadical-SAS/reflector/commit/773fccd93e887c3493abc2e4a4864dddce610177))
* include shared rooms to search ([#558](https://github.com/Monadical-SAS/reflector/issues/558)) ([499eced](https://github.com/Monadical-SAS/reflector/commit/499eced3360b84fb3a90e1c8a3b554290d21adc2))
## [0.7.1](https://github.com/Monadical-SAS/reflector/compare/v0.7.0...v0.7.1) (2025-08-21)
### Bug Fixes
* webvtt db null expectation mismatch ([#556](https://github.com/Monadical-SAS/reflector/issues/556)) ([e67ad1a](https://github.com/Monadical-SAS/reflector/commit/e67ad1a4a2054467bfeb1e0258fbac5868aaaf21))
## [0.7.0](https://github.com/Monadical-SAS/reflector/compare/v0.6.1...v0.7.0) (2025-08-21)
### Features
* delete recording with transcript ([#547](https://github.com/Monadical-SAS/reflector/issues/547)) ([99cc984](https://github.com/Monadical-SAS/reflector/commit/99cc9840b3f5de01e0adfbfae93234042d706d13))
* pipeline improvement with file processing, parakeet, silero-vad ([#540](https://github.com/Monadical-SAS/reflector/issues/540)) ([bcc29c9](https://github.com/Monadical-SAS/reflector/commit/bcc29c9e0050ae215f89d460e9d645aaf6a5e486))
* postgresql migration and removal of sqlite in pytest ([#546](https://github.com/Monadical-SAS/reflector/issues/546)) ([cd1990f](https://github.com/Monadical-SAS/reflector/commit/cd1990f8f0fe1503ef5069512f33777a73a93d7f))
* search backend ([#537](https://github.com/Monadical-SAS/reflector/issues/537)) ([5f9b892](https://github.com/Monadical-SAS/reflector/commit/5f9b89260c9ef7f3c921319719467df22830453f))
* search frontend ([#551](https://github.com/Monadical-SAS/reflector/issues/551)) ([3657242](https://github.com/Monadical-SAS/reflector/commit/365724271ca6e615e3425125a69ae2b46ce39285))
### Bug Fixes
* evaluation cli event wrap ([#536](https://github.com/Monadical-SAS/reflector/issues/536)) ([941c3db](https://github.com/Monadical-SAS/reflector/commit/941c3db0bdacc7b61fea412f3746cc5a7cb67836))
* use structlog not logging ([#550](https://github.com/Monadical-SAS/reflector/issues/550)) ([27e2f81](https://github.com/Monadical-SAS/reflector/commit/27e2f81fda5232e53edc729d3e99c5ef03adbfe9))
## [0.6.1](https://github.com/Monadical-SAS/reflector/compare/v0.6.0...v0.6.1) (2025-08-06)

View File

@@ -1,497 +0,0 @@
# ICS Calendar Integration - Implementation Guide
## Overview
This document provides detailed implementation guidance for integrating ICS calendar feeds with Reflector rooms. Unlike CalDAV which requires complex authentication and protocol handling, ICS integration uses simple HTTP(S) fetching of calendar files.
## Key Differences from CalDAV Approach
| Aspect | CalDAV | ICS |
|--------|--------|-----|
| Protocol | WebDAV extension | HTTP/HTTPS GET |
| Authentication | Username/password, OAuth | Tokens embedded in URL |
| Data Access | Selective event queries | Full calendar download |
| Implementation | Complex (caldav library) | Simple (requests + icalendar) |
| Real-time Updates | Supported | Polling only |
| Write Access | Yes | No (read-only) |
## Technical Architecture
### 1. ICS Fetching Service
```python
# reflector/services/ics_sync.py
import requests
from icalendar import Calendar
from typing import List, Optional
from datetime import datetime, timedelta
class ICSFetchService:
def __init__(self):
self.session = requests.Session()
self.session.headers.update({'User-Agent': 'Reflector/1.0'})
def fetch_ics(self, url: str) -> str:
"""Fetch ICS file from URL (authentication via URL token if needed)."""
response = self.session.get(url, timeout=30)
response.raise_for_status()
return response.text
def parse_ics(self, ics_content: str) -> Calendar:
"""Parse ICS content into calendar object."""
return Calendar.from_ical(ics_content)
def extract_room_events(self, calendar: Calendar, room_url: str) -> List[dict]:
"""Extract events that match the room URL."""
events = []
for component in calendar.walk():
if component.name == "VEVENT":
# Check if event matches this room
if self._event_matches_room(component, room_url):
events.append(self._parse_event(component))
return events
def _event_matches_room(self, event, room_url: str) -> bool:
"""Check if event location or description contains room URL."""
location = str(event.get('LOCATION', ''))
description = str(event.get('DESCRIPTION', ''))
# Support various URL formats
patterns = [
room_url,
room_url.replace('https://', ''),
room_url.split('/')[-1], # Just room name
]
for pattern in patterns:
if pattern in location or pattern in description:
return True
return False
```
### 2. Database Schema
```sql
-- Modify room table
ALTER TABLE room ADD COLUMN ics_url TEXT; -- encrypted to protect embedded tokens
ALTER TABLE room ADD COLUMN ics_fetch_interval INTEGER DEFAULT 300; -- seconds
ALTER TABLE room ADD COLUMN ics_enabled BOOLEAN DEFAULT FALSE;
ALTER TABLE room ADD COLUMN ics_last_sync TIMESTAMP;
ALTER TABLE room ADD COLUMN ics_last_etag TEXT; -- for caching
-- Calendar events table
CREATE TABLE calendar_event (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
room_id UUID REFERENCES room(id) ON DELETE CASCADE,
external_id TEXT NOT NULL, -- ICS UID
title TEXT,
description TEXT,
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP NOT NULL,
attendees JSONB,
location TEXT,
ics_raw_data TEXT, -- Store raw VEVENT for reference
last_synced TIMESTAMP DEFAULT NOW(),
is_deleted BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
UNIQUE(room_id, external_id)
);
-- Index for efficient queries
CREATE INDEX idx_calendar_event_room_start ON calendar_event(room_id, start_time);
CREATE INDEX idx_calendar_event_deleted ON calendar_event(is_deleted) WHERE NOT is_deleted;
```
### 3. Background Tasks
```python
# reflector/worker/tasks/ics_sync.py
from celery import shared_task
from datetime import datetime, timedelta
import hashlib
@shared_task
def sync_ics_calendars():
"""Sync all enabled ICS calendars based on their fetch intervals."""
rooms = Room.query.filter_by(ics_enabled=True).all()
for room in rooms:
# Check if it's time to sync based on fetch interval
if should_sync(room):
sync_room_calendar.delay(room.id)
@shared_task
def sync_room_calendar(room_id: str):
"""Sync calendar for a specific room."""
room = Room.query.get(room_id)
if not room or not room.ics_enabled:
return
try:
# Fetch ICS file (decrypt URL first)
service = ICSFetchService()
decrypted_url = decrypt_ics_url(room.ics_url)
ics_content = service.fetch_ics(decrypted_url)
# Check if content changed (using ETag or hash)
content_hash = hashlib.md5(ics_content.encode()).hexdigest()
if room.ics_last_etag == content_hash:
logger.info(f"No changes in ICS for room {room_id}")
return
# Parse and extract events
calendar = service.parse_ics(ics_content)
events = service.extract_room_events(calendar, room.url)
# Update database
sync_events_to_database(room_id, events)
# Update sync metadata
room.ics_last_sync = datetime.utcnow()
room.ics_last_etag = content_hash
db.session.commit()
except Exception as e:
logger.error(f"Failed to sync ICS for room {room_id}: {e}")
def should_sync(room) -> bool:
"""Check if room calendar should be synced."""
if not room.ics_last_sync:
return True
time_since_sync = datetime.utcnow() - room.ics_last_sync
return time_since_sync.total_seconds() >= room.ics_fetch_interval
```
### 4. Celery Beat Schedule
```python
# reflector/worker/celeryconfig.py
from celery.schedules import crontab
beat_schedule = {
'sync-ics-calendars': {
'task': 'reflector.worker.tasks.ics_sync.sync_ics_calendars',
'schedule': 60.0, # Check every minute which calendars need syncing
},
'pre-create-meetings': {
'task': 'reflector.worker.tasks.ics_sync.pre_create_calendar_meetings',
'schedule': 60.0, # Check every minute for upcoming meetings
},
}
```
## API Endpoints
### Room ICS Configuration
```python
# PATCH /v1/rooms/{room_id}
{
"ics_url": "https://calendar.google.com/calendar/ical/.../private-token/basic.ics",
"ics_fetch_interval": 300, # seconds
"ics_enabled": true
# URL will be encrypted in database to protect embedded tokens
}
```
### Manual Sync Trigger
```python
# POST /v1/rooms/{room_name}/ics/sync
# Response:
{
"status": "syncing",
"last_sync": "2024-01-15T10:30:00Z",
"events_found": 5
}
```
### ICS Status
```python
# GET /v1/rooms/{room_name}/ics/status
# Response:
{
"enabled": true,
"last_sync": "2024-01-15T10:30:00Z",
"next_sync": "2024-01-15T10:35:00Z",
"fetch_interval": 300,
"events_count": 12,
"upcoming_events": 3
}
```
## ICS Parsing Details
### Event Field Mapping
| ICS Field | Database Field | Notes |
|-----------|---------------|-------|
| UID | external_id | Unique identifier |
| SUMMARY | title | Event title |
| DESCRIPTION | description | Full description |
| DTSTART | start_time | Convert to UTC |
| DTEND | end_time | Convert to UTC |
| LOCATION | location | Check for room URL |
| ATTENDEE | attendees | Parse into JSON |
| ORGANIZER | attendees | Add as organizer |
| STATUS | (internal) | Filter cancelled events |
### Handling Recurring Events
```python
def expand_recurring_events(event, start_date, end_date):
"""Expand recurring events into individual occurrences."""
from dateutil.rrule import rrulestr
if 'RRULE' not in event:
return [event]
# Parse recurrence rule
rrule_str = event['RRULE'].to_ical().decode()
dtstart = event['DTSTART'].dt
# Generate occurrences
rrule = rrulestr(rrule_str, dtstart=dtstart)
occurrences = []
for dt in rrule.between(start_date, end_date):
# Clone event with new date
occurrence = event.copy()
occurrence['DTSTART'].dt = dt
if 'DTEND' in event:
duration = event['DTEND'].dt - event['DTSTART'].dt
occurrence['DTEND'].dt = dt + duration
# Unique ID for each occurrence
occurrence['UID'] = f"{event['UID']}_{dt.isoformat()}"
occurrences.append(occurrence)
return occurrences
```
### Timezone Handling
```python
def normalize_datetime(dt):
"""Convert various datetime formats to UTC."""
import pytz
from datetime import datetime
if hasattr(dt, 'dt'): # icalendar property
dt = dt.dt
if isinstance(dt, datetime):
if dt.tzinfo is None:
# Assume local timezone if naive
dt = pytz.timezone('UTC').localize(dt)
else:
# Convert to UTC
dt = dt.astimezone(pytz.UTC)
return dt
```
## Security Considerations
### 1. URL Validation
```python
def validate_ics_url(url: str) -> bool:
"""Validate ICS URL for security."""
from urllib.parse import urlparse
parsed = urlparse(url)
# Must be HTTPS in production
if not settings.DEBUG and parsed.scheme != 'https':
return False
# Prevent local file access
if parsed.scheme in ('file', 'ftp'):
return False
# Prevent internal network access
if is_internal_ip(parsed.hostname):
return False
return True
```
### 2. Rate Limiting
```python
# Implement per-room rate limiting
RATE_LIMITS = {
'min_fetch_interval': 60, # Minimum 1 minute between fetches
'max_requests_per_hour': 60, # Max 60 requests per hour per room
'max_file_size': 10 * 1024 * 1024, # Max 10MB ICS file
}
```
### 3. ICS URL Encryption
```python
from cryptography.fernet import Fernet
class URLEncryption:
def __init__(self):
self.cipher = Fernet(settings.ENCRYPTION_KEY)
def encrypt_url(self, url: str) -> str:
"""Encrypt ICS URL to protect embedded tokens."""
return self.cipher.encrypt(url.encode()).decode()
def decrypt_url(self, encrypted: str) -> str:
"""Decrypt ICS URL for fetching."""
return self.cipher.decrypt(encrypted.encode()).decode()
def mask_url(self, url: str) -> str:
"""Mask sensitive parts of URL for display."""
from urllib.parse import urlparse, urlunparse
parsed = urlparse(url)
# Keep scheme, host, and path structure but mask tokens
if '/private-' in parsed.path:
# Google Calendar format
parts = parsed.path.split('/private-')
masked_path = parts[0] + '/private-***' + parts[1].split('/')[-1]
elif 'token=' in url:
# Query parameter token
masked_path = parsed.path
parsed = parsed._replace(query='token=***')
else:
# Generic masking of path segments that look like tokens
import re
masked_path = re.sub(r'/[a-zA-Z0-9]{20,}/', '/***/', parsed.path)
return urlunparse(parsed._replace(path=masked_path))
```
## Testing Strategy
### 1. Unit Tests
```python
# tests/test_ics_sync.py
def test_ics_parsing():
"""Test ICS file parsing."""
ics_content = """BEGIN:VCALENDAR
VERSION:2.0
BEGIN:VEVENT
UID:test-123
SUMMARY:Team Meeting
LOCATION:https://reflector.monadical.com/engineering
DTSTART:20240115T100000Z
DTEND:20240115T110000Z
END:VEVENT
END:VCALENDAR"""
service = ICSFetchService()
calendar = service.parse_ics(ics_content)
events = service.extract_room_events(
calendar,
"https://reflector.monadical.com/engineering"
)
assert len(events) == 1
assert events[0]['title'] == 'Team Meeting'
```
### 2. Integration Tests
```python
def test_full_sync_flow():
"""Test complete sync workflow."""
# Create room with ICS URL (encrypt URL to protect tokens)
encryption = URLEncryption()
room = Room(
name="test-room",
ics_url=encryption.encrypt_url("https://example.com/calendar.ics?token=secret"),
ics_enabled=True
)
# Mock ICS fetch
with patch('requests.get') as mock_get:
mock_get.return_value.text = sample_ics_content
# Run sync
sync_room_calendar(room.id)
# Verify events created
events = CalendarEvent.query.filter_by(room_id=room.id).all()
assert len(events) > 0
```
## Common ICS Provider Configurations
### Google Calendar
- URL Format: `https://calendar.google.com/calendar/ical/{calendar_id}/private-{token}/basic.ics`
- Authentication via token embedded in URL
- Updates every 3-8 hours by default
### Outlook/Office 365
- URL Format: `https://outlook.office365.com/owa/calendar/{id}/calendar.ics`
- May include token in URL path or query parameters
- Real-time updates
### Apple iCloud
- URL Format: `webcal://p{XX}-caldav.icloud.com/published/2/{token}`
- Convert webcal:// to https://
- Token embedded in URL path
- Public calendars only
### Nextcloud/ownCloud
- URL Format: `https://cloud.example.com/remote.php/dav/public-calendars/{token}`
- Token embedded in URL path
- Configurable update frequency
## Migration from CalDAV
If migrating from an existing CalDAV implementation:
1. **Database Migration**: Rename fields from `caldav_*` to `ics_*`
2. **URL Conversion**: Most CalDAV servers provide ICS export endpoints
3. **Authentication**: Convert from username/password to URL-embedded tokens
4. **Remove Dependencies**: Uninstall caldav library, add icalendar
5. **Update Background Tasks**: Replace CalDAV sync with ICS fetch
## Performance Optimizations
1. **Caching**: Use ETag/Last-Modified headers to avoid refetching unchanged calendars
2. **Incremental Sync**: Store last sync timestamp, only process new/modified events
3. **Batch Processing**: Process multiple room calendars in parallel
4. **Connection Pooling**: Reuse HTTP connections for multiple requests
5. **Compression**: Support gzip encoding for large ICS files
## Monitoring and Debugging
### Metrics to Track
- Sync success/failure rate per room
- Average sync duration
- ICS file sizes
- Number of events processed
- Failed event matches
### Debug Logging
```python
logger.debug(f"Fetching ICS from {room.ics_url}")
logger.debug(f"ICS content size: {len(ics_content)} bytes")
logger.debug(f"Found {len(events)} matching events")
logger.debug(f"Event UIDs: {[e['external_id'] for e in events]}")
```
### Common Issues
1. **SSL Certificate Errors**: Add certificate validation options
2. **Timeout Issues**: Increase timeout for large calendars
3. **Encoding Problems**: Handle various character encodings
4. **Timezone Mismatches**: Always convert to UTC
5. **Memory Issues**: Stream large ICS files instead of loading entirely

337
PLAN.md
View File

@@ -1,337 +0,0 @@
# ICS Calendar Integration Plan
## Core Concept
ICS calendar URLs are attached to rooms (not users) to enable automatic meeting tracking and management through periodic fetching of calendar data.
## Database Schema Updates
### 1. Add ICS configuration to rooms
- Add `ics_url` field to room table (URL to .ics file, may include auth token)
- Add `ics_fetch_interval` field to room table (default: 5 minutes, configurable)
- Add `ics_enabled` boolean field to room table
- Add `ics_last_sync` timestamp field to room table
### 2. Create calendar_events table
- `id` - UUID primary key
- `room_id` - Foreign key to room
- `external_id` - ICS event UID
- `title` - Event title
- `description` - Event description
- `start_time` - Event start timestamp
- `end_time` - Event end timestamp
- `attendees` - JSON field with attendee list and status
- `location` - Meeting location (should contain room name)
- `last_synced` - Last sync timestamp
- `is_deleted` - Boolean flag for soft delete (preserve past events)
- `ics_raw_data` - TEXT field to store raw VEVENT data for reference
### 3. Update meeting table
- Add `calendar_event_id` - Foreign key to calendar_events
- Add `calendar_metadata` - JSON field for additional calendar data
- Remove unique constraint on room_id + active status (allow multiple active meetings per room)
## Backend Implementation
### 1. ICS Sync Service
- Create background task that runs based on room's `ics_fetch_interval` (default: 5 minutes)
- For each room with ICS enabled, fetch the .ics file via HTTP/HTTPS
- Parse ICS file using icalendar library
- Extract VEVENT components and filter events looking for room URL (e.g., "https://reflector.monadical.com/max")
- Store matching events in calendar_events table
- Mark events as "upcoming" if start_time is within next 30 minutes
- Pre-create Whereby meetings 1 minute before start (ensures no delay when users join)
- Soft-delete future events that were removed from calendar (set is_deleted=true)
- Never delete past events (preserve for historical record)
- Support authenticated ICS feeds via tokens embedded in URL
### 2. Meeting Management Updates
- Allow multiple active meetings per room
- Pre-create meeting record 1 minute before calendar event starts (ensures meeting is ready)
- Link meeting to calendar_event for metadata
- Keep meeting active for 15 minutes after last participant leaves (grace period)
- Don't auto-close if new participant joins within grace period
### 3. API Endpoints
- `GET /v1/rooms/{room_name}/meetings` - List all active and upcoming meetings for a room
- Returns filtered data based on user role (owner vs participant)
- `GET /v1/rooms/{room_name}/meetings/upcoming` - List upcoming meetings (next 30 min)
- Returns filtered data based on user role
- `POST /v1/rooms/{room_name}/meetings/{meeting_id}/join` - Join specific meeting
- `PATCH /v1/rooms/{room_id}` - Update room settings (including ICS configuration)
- ICS fields only visible/editable by room owner
- `POST /v1/rooms/{room_name}/ics/sync` - Trigger manual ICS sync
- Only accessible by room owner
- `GET /v1/rooms/{room_name}/ics/status` - Get ICS sync status and last fetch time
- Only accessible by room owner
## Frontend Implementation
### 1. Room Settings Page
- Add ICS configuration section
- Field for ICS URL (e.g., Google Calendar private URL, Outlook ICS export)
- Field for fetch interval (dropdown: 1 min, 5 min, 10 min, 30 min, 1 hour)
- Test connection button (validates ICS file can be fetched and parsed)
- Manual sync button
- Show last sync time and next scheduled sync
### 2. Meeting Selection Page (New)
- Show when accessing `/room/{room_name}`
- **Host view** (room owner):
- Full calendar event details
- Meeting title and description
- Complete attendee list with RSVP status
- Number of current participants
- Duration (how long it's been running)
- **Participant view** (non-owners):
- Meeting title only
- Date and time
- Number of current participants
- Duration (how long it's been running)
- No attendee list or description (privacy)
- Display upcoming meetings (visible 30min before):
- Show countdown to start
- Can click to join early → redirected to waiting page
- Waiting page shows countdown until meeting starts
- Meeting pre-created by background task (ready when users arrive)
- Option to create unscheduled meeting (uses existing flow)
### 3. Meeting Room Updates
- Show calendar metadata in meeting info
- Display invited attendees vs actual participants
- Show meeting title from calendar event
## Meeting Lifecycle
### 1. Meeting Creation
- Automatic: Pre-created 1 minute before calendar event starts (ensures Whereby room is ready)
- Manual: User creates unscheduled meeting (existing `/rooms/{room_name}/meeting` endpoint)
- Background task handles pre-creation to avoid delays when users join
### 2. Meeting Join Rules
- Can join active meetings immediately
- Can see upcoming meetings 30 minutes before start
- Can click to join upcoming meetings early → sent to waiting page
- Waiting page automatically transitions to meeting at scheduled time
- Unscheduled meetings always joinable (current behavior)
### 3. Meeting Closure Rules
- All meetings: 15-minute grace period after last participant leaves
- If participant rejoins within grace period, keep meeting active
- Calendar meetings: Force close 30 minutes after scheduled end time
- Unscheduled meetings: Keep active for 8 hours (current behavior)
## ICS Parsing Logic
### 1. Event Matching
- Parse ICS file using Python icalendar library
- Iterate through VEVENT components
- Check LOCATION field for full FQDN URL (e.g., "https://reflector.monadical.com/max")
- Check DESCRIPTION for room URL or mention
- Support multiple formats:
- Full URL: "https://reflector.monadical.com/max"
- With /room path: "https://reflector.monadical.com/room/max"
- Partial paths: "room/max", "/max room"
### 2. Attendee Extraction
- Parse ATTENDEE properties from VEVENT
- Extract email (MAILTO), name (CN parameter), and RSVP status (PARTSTAT)
- Store as JSON in calendar_events.attendees
### 3. Sync Strategy
- Fetch complete ICS file (contains all events)
- Filter events from (now - 1 hour) to (now + 24 hours) for processing
- Update existing events if LAST-MODIFIED or SEQUENCE changed
- Delete future events that no longer exist in ICS (start_time > now)
- Keep past events for historical record (never delete if start_time < now)
- Handle recurring events (RRULE) - expand to individual instances
- Track deleted calendar events to clean up future meetings
- Cache ICS file hash to detect changes and skip unnecessary processing
## Security Considerations
### 1. ICS URL Security
- ICS URLs may contain authentication tokens (e.g., Google Calendar private URLs)
- Store full ICS URLs encrypted using Fernet to protect embedded tokens
- Validate ICS URLs (must be HTTPS for production)
- Never expose full ICS URLs in API responses (return masked version)
- Rate limit ICS fetching to prevent abuse
### 2. Room Access
- Only room owner can configure ICS URL
- ICS URL shown as masked version to room owner (hides embedded tokens)
- ICS settings not visible to other users
- Meeting list visible to all room participants
- ICS fetch logs only visible to room owner
### 3. Meeting Privacy
- Full calendar details visible only to room owner
- Participants see limited info: title, date/time only
- Attendee list and description hidden from non-owners
- Meeting titles visible in room listing to all
## Implementation Phases
### Phase 1: Database and ICS Setup (Week 1) ✅ COMPLETED (2025-08-18)
1. ✅ Created database migrations for ICS fields and calendar_events table
- Added ics_url, ics_fetch_interval, ics_enabled, ics_last_sync, ics_last_etag to room table
- Created calendar_event table with ics_uid (instead of external_id) and proper typing
- Added calendar_event_id and calendar_metadata (JSONB) to meeting table
- Removed server_default from datetime fields for consistency
2. ✅ Installed icalendar Python library for ICS parsing
- Added icalendar>=6.0.0 to dependencies
- No encryption needed - ICS URLs are read-only
3. ✅ Built ICS fetch and sync service
- Simple HTTP fetching without unnecessary validation
- Proper TypedDict typing for event data structures
- Supports any standard ICS format
- Event matching on full room URL only
4. ✅ API endpoints for ICS configuration
- Room model updated to support ICS fields via existing PATCH endpoint
- POST /v1/rooms/{room_name}/ics/sync - Trigger manual sync (owner only)
- GET /v1/rooms/{room_name}/ics/status - Get sync status (owner only)
- GET /v1/rooms/{room_name}/meetings - List meetings with privacy controls
- GET /v1/rooms/{room_name}/meetings/upcoming - List upcoming meetings
5. ✅ Celery background tasks for periodic sync
- sync_room_ics - Sync individual room calendar
- sync_all_ics_calendars - Check all rooms and queue sync based on fetch intervals
- pre_create_upcoming_meetings - Pre-create Whereby meetings 1 minute before start
- Tasks scheduled in beat schedule (every minute for checking, respects individual intervals)
6. ✅ Tests written and passing
- 6 tests for Room ICS fields
- 7 tests for CalendarEvent model
- 7 tests for ICS sync service
- 11 tests for API endpoints
- 6 tests for background tasks
- All 31 ICS-related tests passing
### Phase 2: Meeting Management (Week 2) ✅ COMPLETED (2025-08-19)
1. ✅ Updated meeting lifecycle logic with grace period support
- 15-minute grace period after last participant leaves
- Automatic reactivation when participants rejoin
- Force close calendar meetings 30 minutes after scheduled end
2. ✅ Support multiple active meetings per room
- Removed unique constraint on active meetings
- Added get_all_active_for_room() method
- Added get_active_by_calendar_event() method
3. ✅ Implemented grace period logic
- Added last_participant_left_at and grace_period_minutes fields
- Process meetings task handles grace period checking
- Whereby webhooks clear grace period on participant join
4. ✅ Link meetings to calendar events
- Pre-created meetings properly linked via calendar_event_id
- Calendar metadata stored with meeting
- API endpoints for listing and joining specific meetings
### Phase 3: Frontend Meeting Selection (Week 3)
1. Build meeting selection page
2. Show active and upcoming meetings
3. Implement waiting page for early joiners
4. Add automatic transition from waiting to meeting
5. Support unscheduled meeting creation
### Phase 4: Calendar Integration UI (Week 4)
1. Add ICS settings to room configuration
2. Display calendar metadata in meetings
3. Show attendee information
4. Add sync status indicators
5. Show fetch interval and next sync time
## Success Metrics
- Zero merged meetings from consecutive calendar events
- Successful ICS sync from major providers (Google Calendar, Outlook, Apple Calendar, Nextcloud)
- Meeting join accuracy: correct meeting 100% of the time
- Grace period prevents 90% of accidental meeting closures
- Configurable fetch intervals reduce unnecessary API calls
## Design Decisions
1. **ICS attached to room, not user** - Prevents duplicate meetings from multiple calendars
2. **Multiple active meetings per room** - Supported with meeting selection page
3. **Grace period for rejoining** - 15 minutes after last participant leaves
4. **Upcoming meeting visibility** - Show 30 minutes before, join only on time
5. **Calendar data storage** - Attached to meeting record for full context
6. **No "ad-hoc" meetings** - Use existing meeting creation flow (unscheduled meetings)
7. **ICS configuration via room PATCH** - Reuse existing room configuration endpoint
8. **Event deletion handling** - Soft-delete future events, preserve past meetings
9. **Configurable fetch interval** - Balance between freshness and server load
10. **ICS over CalDAV** - Simpler implementation, wider compatibility, no complex auth
## Phase 2 Implementation Files
### Database Migrations
- `/server/migrations/versions/6025e9b2bef2_remove_one_active_meeting_per_room_.py` - Remove unique constraint
- `/server/migrations/versions/d4a1c446458c_add_grace_period_fields_to_meeting.py` - Add grace period fields
### Updated Models
- `/server/reflector/db/meetings.py` - Added grace period fields and new query methods
### Updated Services
- `/server/reflector/worker/process.py` - Enhanced with grace period logic and multiple meeting support
### Updated API
- `/server/reflector/views/rooms.py` - Added endpoints for listing active meetings and joining specific meetings
- `/server/reflector/views/whereby.py` - Clear grace period on participant join
### Tests
- `/server/tests/test_multiple_active_meetings.py` - Comprehensive tests for Phase 2 features (5 tests)
## Phase 1 Implementation Files Created
### Database Models
- `/server/reflector/db/rooms.py` - Updated with ICS fields (url, fetch_interval, enabled, last_sync, etag)
- `/server/reflector/db/calendar_events.py` - New CalendarEvent model with ics_uid and proper typing
- `/server/reflector/db/meetings.py` - Updated with calendar_event_id and calendar_metadata (JSONB)
### Services
- `/server/reflector/services/ics_sync.py` - ICS fetching and parsing with TypedDict for proper typing
### API Endpoints
- `/server/reflector/views/rooms.py` - Added ICS management endpoints with privacy controls
### Background Tasks
- `/server/reflector/worker/ics_sync.py` - Celery tasks for automatic periodic sync
- `/server/reflector/worker/app.py` - Updated beat schedule for ICS tasks
### Tests
- `/server/tests/test_room_ics.py` - Room model ICS fields tests (6 tests)
- `/server/tests/test_calendar_event.py` - CalendarEvent model tests (7 tests)
- `/server/tests/test_ics_sync.py` - ICS sync service tests (7 tests)
- `/server/tests/test_room_ics_api.py` - API endpoint tests (11 tests)
- `/server/tests/test_ics_background_tasks.py` - Background task tests (6 tests)
### Key Design Decisions
- No encryption needed - ICS URLs are read-only access
- Using ics_uid instead of external_id for clarity
- Proper TypedDict typing for event data structures
- Removed unnecessary URL validation and webcal handling
- calendar_metadata in meetings stores flexible calendar data (organizer, recurrence, etc)
- Background tasks query all rooms directly to avoid filtering issues
- Sync intervals respected per-room configuration
## Implementation Approach
### ICS Fetching vs CalDAV
- **ICS Benefits**:
- Simpler implementation (HTTP GET vs CalDAV protocol)
- Wider compatibility (all calendar apps can export ICS)
- No authentication complexity (simple URL with optional token)
- Easier debugging (ICS is plain text)
- Lower server requirements (no CalDAV library dependencies)
### Supported Calendar Providers
1. **Google Calendar**: Private ICS URL from calendar settings
2. **Outlook/Office 365**: ICS export URL from calendar sharing
3. **Apple Calendar**: Published calendar ICS URL
4. **Nextcloud**: Public/private calendar ICS export
5. **Any CalDAV server**: Via ICS export endpoint
### ICS URL Examples
- Google: `https://calendar.google.com/calendar/ical/{calendar_id}/private-{token}/basic.ics`
- Outlook: `https://outlook.live.com/owa/calendar/{id}/calendar.ics`
- Custom: `https://example.com/calendars/room-schedule.ics`
### Fetch Interval Configuration
- 1 minute: For critical/high-activity rooms
- 5 minutes (default): Balance of freshness and efficiency
- 10 minutes: Standard meeting rooms
- 30 minutes: Low-activity rooms
- 1 hour: Rarely-used rooms or stable schedules

View File

@@ -1,43 +1,60 @@
<div align="center">
<img width="100" alt="image" src="https://github.com/user-attachments/assets/66fb367b-2c89-4516-9912-f47ac59c6a7f"/>
# Reflector
Reflector Audio Management and Analysis is a cutting-edge web application under development by Monadical. It utilizes AI to record meetings, providing a permanent record with transcripts, translations, and automated summaries.
Reflector is an AI-powered audio transcription and meeting analysis platform that provides real-time transcription, speaker diarization, translation and summarization for audio content and live meetings. It works 100% with local models (whisper/parakeet, pyannote, seamless-m4t, and your local llm like phi-4).
[![Tests](https://github.com/monadical-sas/reflector/actions/workflows/pytests.yml/badge.svg?branch=main&event=push)](https://github.com/monadical-sas/reflector/actions/workflows/pytests.yml)
[![Tests](https://github.com/monadical-sas/reflector/actions/workflows/test_server.yml/badge.svg?branch=main&event=push)](https://github.com/monadical-sas/reflector/actions/workflows/test_server.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](https://opensource.org/licenses/MIT)
</div>
## Screenshots
</div>
<table>
<tr>
<td>
<a href="https://github.com/user-attachments/assets/3a976930-56c1-47ef-8c76-55d3864309e3">
<img width="700" alt="image" src="https://github.com/user-attachments/assets/3a976930-56c1-47ef-8c76-55d3864309e3" />
<a href="https://github.com/user-attachments/assets/21f5597c-2930-4899-a154-f7bd61a59e97">
<img width="700" alt="image" src="https://github.com/user-attachments/assets/21f5597c-2930-4899-a154-f7bd61a59e97" />
</a>
</td>
<td>
<a href="https://github.com/user-attachments/assets/bfe3bde3-08af-4426-a9a1-11ad5cd63b33">
<img width="700" alt="image" src="https://github.com/user-attachments/assets/bfe3bde3-08af-4426-a9a1-11ad5cd63b33" />
<a href="https://github.com/user-attachments/assets/f6b9399a-5e51-4bae-b807-59128d0a940c">
<img width="700" alt="image" src="https://github.com/user-attachments/assets/f6b9399a-5e51-4bae-b807-59128d0a940c" />
</a>
</td>
<td>
<a href="https://github.com/user-attachments/assets/7b60c9d0-efe4-474f-a27b-ea13bd0fabdc">
<img width="700" alt="image" src="https://github.com/user-attachments/assets/7b60c9d0-efe4-474f-a27b-ea13bd0fabdc" />
<a href="https://github.com/user-attachments/assets/a42ce460-c1fd-4489-a995-270516193897">
<img width="700" alt="image" src="https://github.com/user-attachments/assets/a42ce460-c1fd-4489-a995-270516193897" />
</a>
</td>
<td>
<a href="https://github.com/user-attachments/assets/21929f6d-c309-42fe-9c11-f1299e50fbd4">
<img width="700" alt="image" src="https://github.com/user-attachments/assets/21929f6d-c309-42fe-9c11-f1299e50fbd4" />
</a>
</td>
</tr>
</table>
## What is Reflector?
Reflector is a web application that utilizes local models to process audio content, providing:
- **Real-time Transcription**: Convert speech to text using [Whisper](https://github.com/openai/whisper) (multi-language) or [Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) (English) models
- **Speaker Diarization**: Identify and label different speakers using [Pyannote](https://github.com/pyannote/pyannote-audio) 3.1
- **Live Translation**: Translate audio content in real-time to many languages with [Facebook Seamless-M4T](https://github.com/facebookresearch/seamless_communication)
- **Topic Detection & Summarization**: Extract key topics and generate concise summaries using LLMs
- **Meeting Recording**: Create permanent records of meetings with searchable transcripts
Currently we provide [modal.com](https://modal.com/) gpu template to deploy.
## Background
The project architecture consists of three primary components:
- **Front-End**: NextJS React project hosted on Vercel, located in `www/`.
- **Back-End**: Python server that offers an API and data persistence, found in `server/`.
- **GPU implementation**: Providing services such as speech-to-text transcription, topic generation, automated summaries, and translations. Most reliable option is Modal deployment
- **Front-End**: NextJS React project hosted on Vercel, located in `www/`.
- **GPU implementation**: Providing services such as speech-to-text transcription, topic generation, automated summaries, and translations.
It also uses authentik for authentication if activated, and Vercel for deployment and configuration of the front-end.
It also uses authentik for authentication if activated.
## Contribution Guidelines
@@ -72,6 +89,8 @@ Note: We currently do not have instructions for Windows users.
## Installation
*Note: we're working toward better installation, theses instructions are not accurate for now*
### Frontend
Start with `cd www`.

3
server/.gitignore vendored
View File

@@ -176,7 +176,8 @@ artefacts/
audio_*.wav
# ignore local database
reflector.sqlite3
*.sqlite3
*.db
data/
dump.rdb

View File

@@ -1,7 +1,8 @@
FROM python:3.12-slim
ENV PYTHONUNBUFFERED=1 \
UV_LINK_MODE=copy
UV_LINK_MODE=copy \
UV_NO_CACHE=1
# builder install base dependencies
WORKDIR /tmp
@@ -13,8 +14,8 @@ ENV PATH="/root/.local/bin/:$PATH"
# install application dependencies
RUN mkdir -p /app
WORKDIR /app
COPY pyproject.toml uv.lock /app/
RUN touch README.md && env uv sync --compile-bytecode --locked
COPY pyproject.toml uv.lock README.md /app/
RUN uv sync --compile-bytecode --locked
# pre-download nltk packages
RUN uv run python -c "import nltk; nltk.download('punkt_tab'); nltk.download('averaged_perceptron_tagger_eng')"
@@ -26,4 +27,15 @@ COPY migrations /app/migrations
COPY reflector /app/reflector
WORKDIR /app
# Create symlink for libgomp if it doesn't exist (for ARM64 compatibility)
RUN if [ "$(uname -m)" = "aarch64" ] && [ ! -f /usr/lib/libgomp.so.1 ]; then \
LIBGOMP_PATH=$(find /app/.venv/lib -path "*/torch.libs/libgomp*.so.*" 2>/dev/null | head -n1); \
if [ -n "$LIBGOMP_PATH" ]; then \
ln -sf "$LIBGOMP_PATH" /usr/lib/libgomp.so.1; \
fi \
fi
# Pre-check just to make sure the image will not fail
RUN uv run python -c "import silero_vad.model"
CMD ["./runserver.sh"]

View File

@@ -40,3 +40,5 @@ uv run python -c "from reflector.pipelines.main_live_pipeline import task_pipeli
```bash
uv run python -c "from reflector.pipelines.main_live_pipeline import pipeline_post; pipeline_post(transcript_id='TRANSCRIPT_ID')"
```
.

View File

@@ -0,0 +1,95 @@
# Data Retention and Cleanup
## Overview
For public instances of Reflector, a data retention policy is automatically enforced to delete anonymous user data after a configurable period (default: 7 days). This ensures compliance with privacy expectations and prevents unbounded storage growth.
## Configuration
### Environment Variables
- `PUBLIC_MODE` (bool): Must be set to `true` to enable automatic cleanup
- `PUBLIC_DATA_RETENTION_DAYS` (int): Number of days to retain anonymous data (default: 7)
### What Gets Deleted
When data reaches the retention period, the following items are automatically removed:
1. **Transcripts** from anonymous users (where `user_id` is NULL):
- Database records
- Local files (audio.wav, audio.mp3, audio.json waveform)
- Storage files (cloud storage if configured)
## Automatic Cleanup
### Celery Beat Schedule
When `PUBLIC_MODE=true`, a Celery beat task runs daily at 3 AM to clean up old data:
```python
# Automatically scheduled when PUBLIC_MODE=true
"cleanup_old_public_data": {
"task": "reflector.worker.cleanup.cleanup_old_public_data",
"schedule": crontab(hour=3, minute=0), # Daily at 3 AM
}
```
### Running the Worker
Ensure both Celery worker and beat scheduler are running:
```bash
# Start Celery worker
uv run celery -A reflector.worker.app worker --loglevel=info
# Start Celery beat scheduler (in another terminal)
uv run celery -A reflector.worker.app beat
```
## Manual Cleanup
For testing or manual intervention, use the cleanup tool:
```bash
# Delete data older than 7 days (default)
uv run python -m reflector.tools.cleanup_old_data
# Delete data older than 30 days
uv run python -m reflector.tools.cleanup_old_data --days 30
```
Note: The manual tool uses the same implementation as the Celery worker task to ensure consistency.
## Important Notes
1. **User Data Deletion**: Only anonymous data (where `user_id` is NULL) is deleted. Authenticated user data is preserved.
2. **Storage Cleanup**: The system properly cleans up both local files and cloud storage when configured.
3. **Error Handling**: If individual deletions fail, the cleanup continues and logs errors. Failed deletions are reported in the task output.
4. **Public Instance Only**: The automatic cleanup task only runs when `PUBLIC_MODE=true` to prevent accidental data loss in private deployments.
## Testing
Run the cleanup tests:
```bash
uv run pytest tests/test_cleanup.py -v
```
## Monitoring
Check Celery logs for cleanup task execution:
```bash
# Look for cleanup task logs
grep "cleanup_old_public_data" celery.log
grep "Starting cleanup of old public data" celery.log
```
Task statistics are logged after each run:
- Number of transcripts deleted
- Number of meetings deleted
- Number of orphaned recordings deleted
- Any errors encountered

212
server/docs/webhook.md Normal file
View File

@@ -0,0 +1,212 @@
# Reflector Webhook Documentation
## Overview
Reflector supports webhook notifications to notify external systems when transcript processing is completed. Webhooks can be configured per room and are triggered automatically after a transcript is successfully processed.
## Configuration
Webhooks are configured at the room level with two fields:
- `webhook_url`: The HTTPS endpoint to receive webhook notifications
- `webhook_secret`: Optional secret key for HMAC signature verification (auto-generated if not provided)
## Events
### `transcript.completed`
Triggered when a transcript has been fully processed, including transcription, diarization, summarization, and topic detection.
### `test`
A test event that can be triggered manually to verify webhook configuration.
## Webhook Request Format
### Headers
All webhook requests include the following headers:
| Header | Description | Example |
|--------|-------------|---------|
| `Content-Type` | Always `application/json` | `application/json` |
| `User-Agent` | Identifies Reflector as the source | `Reflector-Webhook/1.0` |
| `X-Webhook-Event` | The event type | `transcript.completed` or `test` |
| `X-Webhook-Retry` | Current retry attempt number | `0`, `1`, `2`... |
| `X-Webhook-Signature` | HMAC signature (if secret configured) | `t=1735306800,v1=abc123...` |
### Signature Verification
If a webhook secret is configured, Reflector includes an HMAC-SHA256 signature in the `X-Webhook-Signature` header to verify the webhook authenticity.
The signature format is: `t={timestamp},v1={signature}`
To verify the signature:
1. Extract the timestamp and signature from the header
2. Create the signed payload: `{timestamp}.{request_body}`
3. Compute HMAC-SHA256 of the signed payload using your webhook secret
4. Compare the computed signature with the received signature
Example verification (Python):
```python
import hmac
import hashlib
def verify_webhook_signature(payload: bytes, signature_header: str, secret: str) -> bool:
# Parse header: "t=1735306800,v1=abc123..."
parts = dict(part.split("=") for part in signature_header.split(","))
timestamp = parts["t"]
received_signature = parts["v1"]
# Create signed payload
signed_payload = f"{timestamp}.{payload.decode('utf-8')}"
# Compute expected signature
expected_signature = hmac.new(
secret.encode("utf-8"),
signed_payload.encode("utf-8"),
hashlib.sha256
).hexdigest()
# Compare signatures
return hmac.compare_digest(expected_signature, received_signature)
```
## Event Payloads
### `transcript.completed` Event
This event includes a convenient URL for accessing the transcript:
- `frontend_url`: Direct link to view the transcript in the web interface
```json
{
"event": "transcript.completed",
"event_id": "transcript.completed-abc-123-def-456",
"timestamp": "2025-08-27T12:34:56.789012Z",
"transcript": {
"id": "abc-123-def-456",
"room_id": "room-789",
"created_at": "2025-08-27T12:00:00Z",
"duration": 1800.5,
"title": "Q3 Product Planning Meeting",
"short_summary": "Team discussed Q3 product roadmap, prioritizing mobile app features and API improvements.",
"long_summary": "The product team met to finalize the Q3 roadmap. Key decisions included...",
"webvtt": "WEBVTT\n\n00:00:00.000 --> 00:00:05.000\n<v Speaker 1>Welcome everyone to today's meeting...",
"topics": [
{
"title": "Introduction and Agenda",
"summary": "Meeting kickoff with agenda review",
"timestamp": 0.0,
"duration": 120.0,
"webvtt": "WEBVTT\n\n00:00:00.000 --> 00:00:05.000\n<v Speaker 1>Welcome everyone..."
},
{
"title": "Mobile App Features Discussion",
"summary": "Team reviewed proposed mobile app features for Q3",
"timestamp": 120.0,
"duration": 600.0,
"webvtt": "WEBVTT\n\n00:02:00.000 --> 00:02:10.000\n<v Speaker 2>Let's talk about the mobile app..."
}
],
"participants": [
{
"id": "participant-1",
"name": "John Doe",
"speaker": "Speaker 1"
},
{
"id": "participant-2",
"name": "Jane Smith",
"speaker": "Speaker 2"
}
],
"source_language": "en",
"target_language": "en",
"status": "completed",
"frontend_url": "https://app.reflector.com/transcripts/abc-123-def-456"
},
"room": {
"id": "room-789",
"name": "Product Team Room"
}
}
```
### `test` Event
```json
{
"event": "test",
"event_id": "test.2025-08-27T12:34:56.789012Z",
"timestamp": "2025-08-27T12:34:56.789012Z",
"message": "This is a test webhook from Reflector",
"room": {
"id": "room-789",
"name": "Product Team Room"
}
}
```
## Retry Policy
Webhooks are delivered with automatic retry logic to handle transient failures. When a webhook delivery fails due to server errors or network issues, Reflector will automatically retry the delivery multiple times over an extended period.
### Retry Mechanism
Reflector implements an exponential backoff strategy for webhook retries:
- **Initial retry delay**: 60 seconds after the first failure
- **Exponential backoff**: Each subsequent retry waits approximately twice as long as the previous one
- **Maximum retry interval**: 1 hour (backoff is capped at this duration)
- **Maximum retry attempts**: 30 attempts total
- **Total retry duration**: Retries continue for approximately 24 hours
### How Retries Work
When a webhook fails, Reflector will:
1. Wait 60 seconds, then retry (attempt #1)
2. If it fails again, wait ~2 minutes, then retry (attempt #2)
3. Continue doubling the wait time up to a maximum of 1 hour between attempts
4. Keep retrying at 1-hour intervals until successful or 30 attempts are exhausted
The `X-Webhook-Retry` header indicates the current retry attempt number (0 for the initial attempt, 1 for first retry, etc.), allowing your endpoint to track retry attempts.
### Retry Behavior by HTTP Status Code
| Status Code | Behavior |
|-------------|----------|
| 2xx (Success) | No retry, webhook marked as delivered |
| 4xx (Client Error) | No retry, request is considered permanently failed |
| 5xx (Server Error) | Automatic retry with exponential backoff |
| Network/Timeout Error | Automatic retry with exponential backoff |
**Important Notes:**
- Webhooks timeout after 30 seconds. If your endpoint takes longer to respond, it will be considered a timeout error and retried.
- During the retry period (~24 hours), you may receive the same webhook multiple times if your endpoint experiences intermittent failures.
- There is no mechanism to manually retry failed webhooks after the retry period expires.
## Testing Webhooks
You can test your webhook configuration before processing transcripts:
```http
POST /v1/rooms/{room_id}/webhook/test
```
Response:
```json
{
"success": true,
"status_code": 200,
"message": "Webhook test successful",
"response_preview": "OK"
}
```
Or in case of failure:
```json
{
"success": false,
"error": "Webhook request timed out (10 seconds)"
}
```

View File

@@ -4,7 +4,8 @@ This repository hold an API for the GPU implementation of the Reflector API serv
and use [Modal.com](https://modal.com)
- `reflector_diarizer.py` - Diarization API
- `reflector_transcriber.py` - Transcription API
- `reflector_transcriber.py` - Transcription API (Whisper)
- `reflector_transcriber_parakeet.py` - Transcription API (NVIDIA Parakeet)
- `reflector_translator.py` - Translation API
## Modal.com deployment
@@ -19,6 +20,10 @@ $ modal deploy reflector_transcriber.py
...
└── 🔨 Created web => https://xxxx--reflector-transcriber-web.modal.run
$ modal deploy reflector_transcriber_parakeet.py
...
└── 🔨 Created web => https://xxxx--reflector-transcriber-parakeet-web.modal.run
$ modal deploy reflector_llm.py
...
└── 🔨 Created web => https://xxxx--reflector-llm-web.modal.run
@@ -68,6 +73,86 @@ Authorization: bearer <REFLECTOR_APIKEY>
### Transcription
#### Parakeet Transcriber (`reflector_transcriber_parakeet.py`)
NVIDIA Parakeet is a state-of-the-art ASR model optimized for real-time transcription with superior word-level timestamps.
**GPU Configuration:**
- **A10G GPU** - Used for `/v1/audio/transcriptions` endpoint (small files, live transcription)
- Higher concurrency (max_inputs=10)
- Optimized for multiple small audio files
- Supports batch processing for efficiency
- **L40S GPU** - Used for `/v1/audio/transcriptions-from-url` endpoint (large files)
- Lower concurrency but more powerful processing
- Optimized for single large audio files
- VAD-based chunking for long-form audio
##### `/v1/audio/transcriptions` - Small file transcription
**request** (multipart/form-data)
- `file` or `files[]` - audio file(s) to transcribe
- `model` - model name (default: `nvidia/parakeet-tdt-0.6b-v2`)
- `language` - language code (default: `en`)
- `batch` - whether to use batch processing for multiple files (default: `true`)
**response**
```json
{
"text": "transcribed text",
"words": [
{"word": "hello", "start": 0.0, "end": 0.5},
{"word": "world", "start": 0.5, "end": 1.0}
],
"filename": "audio.mp3"
}
```
For multiple files with batch=true:
```json
{
"results": [
{
"filename": "audio1.mp3",
"text": "transcribed text",
"words": [...]
},
{
"filename": "audio2.mp3",
"text": "transcribed text",
"words": [...]
}
]
}
```
##### `/v1/audio/transcriptions-from-url` - Large file transcription
**request** (application/json)
```json
{
"audio_file_url": "https://example.com/audio.mp3",
"model": "nvidia/parakeet-tdt-0.6b-v2",
"language": "en",
"timestamp_offset": 0.0
}
```
**response**
```json
{
"text": "transcribed text from large file",
"words": [
{"word": "hello", "start": 0.0, "end": 0.5},
{"word": "world", "start": 0.5, "end": 1.0}
]
}
```
**Supported file types:** mp3, mp4, mpeg, mpga, m4a, wav, webm
#### Whisper Transcriber (`reflector_transcriber.py`)
`POST /transcribe`
**request** (multipart/form-data)

View File

@@ -4,14 +4,80 @@ Reflector GPU backend - diarizer
"""
import os
import uuid
from typing import Mapping, NewType
from urllib.parse import urlparse
import modal.gpu
from modal import App, Image, Secret, asgi_app, enter, method
from pydantic import BaseModel
import modal
PYANNOTE_MODEL_NAME: str = "pyannote/speaker-diarization-3.1"
MODEL_DIR = "/root/diarization_models"
app = App(name="reflector-diarizer")
UPLOADS_PATH = "/uploads"
SUPPORTED_FILE_EXTENSIONS = ["mp3", "mp4", "mpeg", "mpga", "m4a", "wav", "webm"]
DiarizerUniqFilename = NewType("DiarizerUniqFilename", str)
AudioFileExtension = NewType("AudioFileExtension", str)
app = modal.App(name="reflector-diarizer")
# Volume for temporary file uploads
upload_volume = modal.Volume.from_name("diarizer-uploads", create_if_missing=True)
def detect_audio_format(url: str, headers: Mapping[str, str]) -> AudioFileExtension:
parsed_url = urlparse(url)
url_path = parsed_url.path
for ext in SUPPORTED_FILE_EXTENSIONS:
if url_path.lower().endswith(f".{ext}"):
return AudioFileExtension(ext)
content_type = headers.get("content-type", "").lower()
if "audio/mpeg" in content_type or "audio/mp3" in content_type:
return AudioFileExtension("mp3")
if "audio/wav" in content_type:
return AudioFileExtension("wav")
if "audio/mp4" in content_type:
return AudioFileExtension("mp4")
raise ValueError(
f"Unsupported audio format for URL: {url}. "
f"Supported extensions: {', '.join(SUPPORTED_FILE_EXTENSIONS)}"
)
def download_audio_to_volume(
audio_file_url: str,
) -> tuple[DiarizerUniqFilename, AudioFileExtension]:
import requests
from fastapi import HTTPException
print(f"Checking audio file at: {audio_file_url}")
response = requests.head(audio_file_url, allow_redirects=True)
if response.status_code == 404:
raise HTTPException(status_code=404, detail="Audio file not found")
print(f"Downloading audio file from: {audio_file_url}")
response = requests.get(audio_file_url, allow_redirects=True)
if response.status_code != 200:
print(f"Download failed with status {response.status_code}: {response.text}")
raise HTTPException(
status_code=response.status_code,
detail=f"Failed to download audio file: {response.status_code}",
)
audio_suffix = detect_audio_format(audio_file_url, response.headers)
unique_filename = DiarizerUniqFilename(f"{uuid.uuid4()}.{audio_suffix}")
file_path = f"{UPLOADS_PATH}/{unique_filename}"
print(f"Writing file to: {file_path} (size: {len(response.content)} bytes)")
with open(file_path, "wb") as f:
f.write(response.content)
upload_volume.commit()
print(f"File saved as: {unique_filename}")
return unique_filename, audio_suffix
def migrate_cache_llm():
@@ -39,7 +105,7 @@ def download_pyannote_audio():
diarizer_image = (
Image.debian_slim(python_version="3.10.8")
modal.Image.debian_slim(python_version="3.10.8")
.pip_install(
"pyannote.audio==3.1.0",
"requests",
@@ -55,7 +121,8 @@ diarizer_image = (
"hf-transfer",
)
.run_function(
download_pyannote_audio, secrets=[Secret.from_name("my-huggingface-secret")]
download_pyannote_audio,
secrets=[modal.Secret.from_name("hf_token")],
)
.run_function(migrate_cache_llm)
.env(
@@ -70,53 +137,60 @@ diarizer_image = (
@app.cls(
gpu=modal.gpu.A100(size="40GB"),
gpu="A100",
timeout=60 * 30,
scaledown_window=60,
allow_concurrent_inputs=1,
image=diarizer_image,
volumes={UPLOADS_PATH: upload_volume},
enable_memory_snapshot=True,
experimental_options={"enable_gpu_snapshot": True},
secrets=[
modal.Secret.from_name("hf_token"),
],
)
@modal.concurrent(max_inputs=1)
class Diarizer:
@enter()
@modal.enter(snap=True)
def enter(self):
import torch
from pyannote.audio import Pipeline
self.use_gpu = torch.cuda.is_available()
self.device = "cuda" if self.use_gpu else "cpu"
print(f"Using device: {self.device}")
self.diarization_pipeline = Pipeline.from_pretrained(
PYANNOTE_MODEL_NAME, cache_dir=MODEL_DIR
PYANNOTE_MODEL_NAME,
cache_dir=MODEL_DIR,
use_auth_token=os.environ["HF_TOKEN"],
)
self.diarization_pipeline.to(torch.device(self.device))
@method()
def diarize(self, audio_data: str, audio_suffix: str, timestamp: float):
import tempfile
@modal.method()
def diarize(self, filename: str, timestamp: float = 0.0):
import torchaudio
with tempfile.NamedTemporaryFile("wb+", suffix=f".{audio_suffix}") as fp:
fp.write(audio_data)
upload_volume.reload()
print("Diarizing audio")
waveform, sample_rate = torchaudio.load(fp.name)
diarization = self.diarization_pipeline(
{"waveform": waveform, "sample_rate": sample_rate}
file_path = f"{UPLOADS_PATH}/{filename}"
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
print(f"Diarizing audio from: {file_path}")
waveform, sample_rate = torchaudio.load(file_path)
diarization = self.diarization_pipeline(
{"waveform": waveform, "sample_rate": sample_rate}
)
words = []
for diarization_segment, _, speaker in diarization.itertracks(yield_label=True):
words.append(
{
"start": round(timestamp + diarization_segment.start, 3),
"end": round(timestamp + diarization_segment.end, 3),
"speaker": int(speaker[-2:]),
}
)
words = []
for diarization_segment, _, speaker in diarization.itertracks(
yield_label=True
):
words.append(
{
"start": round(timestamp + diarization_segment.start, 3),
"end": round(timestamp + diarization_segment.end, 3),
"speaker": int(speaker[-2:]),
}
)
print("Diarization complete")
return {"diarization": words}
print("Diarization complete")
return {"diarization": words}
# -------------------------------------------------------------------
@@ -127,17 +201,18 @@ class Diarizer:
@app.function(
timeout=60 * 10,
scaledown_window=60 * 3,
allow_concurrent_inputs=40,
secrets=[
Secret.from_name("reflector-gpu"),
modal.Secret.from_name("reflector-gpu"),
],
volumes={UPLOADS_PATH: upload_volume},
image=diarizer_image,
)
@asgi_app()
@modal.concurrent(max_inputs=40)
@modal.asgi_app()
def web():
import requests
from fastapi import Depends, FastAPI, HTTPException, status
from fastapi.security import OAuth2PasswordBearer
from pydantic import BaseModel
diarizerstub = Diarizer()
@@ -153,35 +228,26 @@ def web():
headers={"WWW-Authenticate": "Bearer"},
)
def validate_audio_file(audio_file_url: str):
# Check if the audio file exists
response = requests.head(audio_file_url, allow_redirects=True)
if response.status_code == 404:
raise HTTPException(
status_code=response.status_code,
detail="The audio file does not exist.",
)
class DiarizationResponse(BaseModel):
result: dict
@app.post(
"/diarize", dependencies=[Depends(apikey_auth), Depends(validate_audio_file)]
)
def diarize(
audio_file_url: str, timestamp: float = 0.0
) -> HTTPException | DiarizationResponse:
# Currently the uploaded files are in mp3 format
audio_suffix = "mp3"
@app.post("/diarize", dependencies=[Depends(apikey_auth)])
def diarize(audio_file_url: str, timestamp: float = 0.0) -> DiarizationResponse:
unique_filename, audio_suffix = download_audio_to_volume(audio_file_url)
print("Downloading audio file")
response = requests.get(audio_file_url, allow_redirects=True)
print("Audio file downloaded successfully")
func = diarizerstub.diarize.spawn(
audio_data=response.content, audio_suffix=audio_suffix, timestamp=timestamp
)
result = func.get()
return result
try:
func = diarizerstub.diarize.spawn(
filename=unique_filename, timestamp=timestamp
)
result = func.get()
return result
finally:
try:
file_path = f"{UPLOADS_PATH}/{unique_filename}"
print(f"Deleting file: {file_path}")
os.remove(file_path)
upload_volume.commit()
except Exception as e:
print(f"Error cleaning up {unique_filename}: {e}")
return app

View File

@@ -0,0 +1,658 @@
import logging
import os
import sys
import threading
import uuid
from typing import Generator, Mapping, NamedTuple, NewType, TypedDict
from urllib.parse import urlparse
import modal
MODEL_NAME = "nvidia/parakeet-tdt-0.6b-v2"
SUPPORTED_FILE_EXTENSIONS = ["mp3", "mp4", "mpeg", "mpga", "m4a", "wav", "webm"]
SAMPLERATE = 16000
UPLOADS_PATH = "/uploads"
CACHE_PATH = "/cache"
VAD_CONFIG = {
"batch_max_duration": 30.0,
"silence_padding": 0.5,
"window_size": 512,
}
ParakeetUniqFilename = NewType("ParakeetUniqFilename", str)
AudioFileExtension = NewType("AudioFileExtension", str)
class TimeSegment(NamedTuple):
"""Represents a time segment with start and end times."""
start: float
end: float
class AudioSegment(NamedTuple):
"""Represents an audio segment with timing and audio data."""
start: float
end: float
audio: any
class TranscriptResult(NamedTuple):
"""Represents a transcription result with text and word timings."""
text: str
words: list["WordTiming"]
class WordTiming(TypedDict):
"""Represents a word with its timing information."""
word: str
start: float
end: float
app = modal.App("reflector-transcriber-parakeet")
# Volume for caching model weights
model_cache = modal.Volume.from_name("parakeet-model-cache", create_if_missing=True)
# Volume for temporary file uploads
upload_volume = modal.Volume.from_name("parakeet-uploads", create_if_missing=True)
image = (
modal.Image.from_registry(
"nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04", add_python="3.12"
)
.env(
{
"HF_HUB_ENABLE_HF_TRANSFER": "1",
"HF_HOME": "/cache",
"DEBIAN_FRONTEND": "noninteractive",
"CXX": "g++",
"CC": "g++",
}
)
.apt_install("ffmpeg")
.pip_install(
"hf_transfer==0.1.9",
"huggingface_hub[hf-xet]==0.31.2",
"nemo_toolkit[asr]==2.3.0",
"cuda-python==12.8.0",
"fastapi==0.115.12",
"numpy<2",
"librosa==0.10.1",
"requests",
"silero-vad==5.1.0",
"torch",
)
.entrypoint([]) # silence chatty logs by container on start
)
def detect_audio_format(url: str, headers: Mapping[str, str]) -> AudioFileExtension:
parsed_url = urlparse(url)
url_path = parsed_url.path
for ext in SUPPORTED_FILE_EXTENSIONS:
if url_path.lower().endswith(f".{ext}"):
return AudioFileExtension(ext)
content_type = headers.get("content-type", "").lower()
if "audio/mpeg" in content_type or "audio/mp3" in content_type:
return AudioFileExtension("mp3")
if "audio/wav" in content_type:
return AudioFileExtension("wav")
if "audio/mp4" in content_type:
return AudioFileExtension("mp4")
raise ValueError(
f"Unsupported audio format for URL: {url}. "
f"Supported extensions: {', '.join(SUPPORTED_FILE_EXTENSIONS)}"
)
def download_audio_to_volume(
audio_file_url: str,
) -> tuple[ParakeetUniqFilename, AudioFileExtension]:
import requests
from fastapi import HTTPException
response = requests.head(audio_file_url, allow_redirects=True)
if response.status_code == 404:
raise HTTPException(status_code=404, detail="Audio file not found")
response = requests.get(audio_file_url, allow_redirects=True)
response.raise_for_status()
audio_suffix = detect_audio_format(audio_file_url, response.headers)
unique_filename = ParakeetUniqFilename(f"{uuid.uuid4()}.{audio_suffix}")
file_path = f"{UPLOADS_PATH}/{unique_filename}"
with open(file_path, "wb") as f:
f.write(response.content)
upload_volume.commit()
return unique_filename, audio_suffix
def pad_audio(audio_array, sample_rate: int = SAMPLERATE):
"""Add 0.5 seconds of silence if audio is less than 500ms.
This is a workaround for a Parakeet bug where very short audio (<500ms) causes:
ValueError: `char_offsets`: [] and `processed_tokens`: [157, 834, 834, 841]
have to be of the same length
See: https://github.com/NVIDIA/NeMo/issues/8451
"""
import numpy as np
audio_duration = len(audio_array) / sample_rate
if audio_duration < 0.5:
silence_samples = int(sample_rate * 0.5)
silence = np.zeros(silence_samples, dtype=np.float32)
return np.concatenate([audio_array, silence])
return audio_array
@app.cls(
gpu="A10G",
timeout=600,
scaledown_window=300,
image=image,
volumes={CACHE_PATH: model_cache, UPLOADS_PATH: upload_volume},
enable_memory_snapshot=True,
experimental_options={"enable_gpu_snapshot": True},
)
@modal.concurrent(max_inputs=10)
class TranscriberParakeetLive:
@modal.enter(snap=True)
def enter(self):
import nemo.collections.asr as nemo_asr
logging.getLogger("nemo_logger").setLevel(logging.CRITICAL)
self.lock = threading.Lock()
self.model = nemo_asr.models.ASRModel.from_pretrained(model_name=MODEL_NAME)
device = next(self.model.parameters()).device
print(f"Model is on device: {device}")
@modal.method()
def transcribe_segment(
self,
filename: str,
):
import librosa
upload_volume.reload()
file_path = f"{UPLOADS_PATH}/{filename}"
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
audio_array, sample_rate = librosa.load(file_path, sr=SAMPLERATE, mono=True)
padded_audio = pad_audio(audio_array, sample_rate)
with self.lock:
with NoStdStreams():
(output,) = self.model.transcribe([padded_audio], timestamps=True)
text = output.text.strip()
words: list[WordTiming] = [
WordTiming(
# XXX the space added here is to match the output of whisper
# whisper add space to each words, while parakeet don't
word=word_info["word"] + " ",
start=round(word_info["start"], 2),
end=round(word_info["end"], 2),
)
for word_info in output.timestamp["word"]
]
return {"text": text, "words": words}
@modal.method()
def transcribe_batch(
self,
filenames: list[str],
):
import librosa
upload_volume.reload()
results = []
audio_arrays = []
# Load all audio files with padding
for filename in filenames:
file_path = f"{UPLOADS_PATH}/{filename}"
if not os.path.exists(file_path):
raise FileNotFoundError(f"Batch file not found: {file_path}")
audio_array, sample_rate = librosa.load(file_path, sr=SAMPLERATE, mono=True)
padded_audio = pad_audio(audio_array, sample_rate)
audio_arrays.append(padded_audio)
with self.lock:
with NoStdStreams():
outputs = self.model.transcribe(audio_arrays, timestamps=True)
# Process results for each file
for i, (filename, output) in enumerate(zip(filenames, outputs)):
text = output.text.strip()
words: list[WordTiming] = [
WordTiming(
word=word_info["word"] + " ",
start=round(word_info["start"], 2),
end=round(word_info["end"], 2),
)
for word_info in output.timestamp["word"]
]
results.append(
{
"filename": filename,
"text": text,
"words": words,
}
)
return results
# L40S class for file transcription (bigger files)
@app.cls(
gpu="L40S",
timeout=900,
image=image,
volumes={CACHE_PATH: model_cache, UPLOADS_PATH: upload_volume},
enable_memory_snapshot=True,
experimental_options={"enable_gpu_snapshot": True},
)
class TranscriberParakeetFile:
@modal.enter(snap=True)
def enter(self):
import nemo.collections.asr as nemo_asr
import torch
from silero_vad import load_silero_vad
logging.getLogger("nemo_logger").setLevel(logging.CRITICAL)
self.model = nemo_asr.models.ASRModel.from_pretrained(model_name=MODEL_NAME)
device = next(self.model.parameters()).device
print(f"Model is on device: {device}")
torch.set_num_threads(1)
self.vad_model = load_silero_vad(onnx=False)
print("Silero VAD initialized")
@modal.method()
def transcribe_segment(
self,
filename: str,
timestamp_offset: float = 0.0,
):
import librosa
import numpy as np
from silero_vad import VADIterator
def load_and_convert_audio(file_path):
audio_array, sample_rate = librosa.load(file_path, sr=SAMPLERATE, mono=True)
return audio_array
def vad_segment_generator(
audio_array,
) -> Generator[TimeSegment, None, None]:
"""Generate speech segments using VAD with start/end sample indices"""
vad_iterator = VADIterator(self.vad_model, sampling_rate=SAMPLERATE)
window_size = VAD_CONFIG["window_size"]
start = None
for i in range(0, len(audio_array), window_size):
chunk = audio_array[i : i + window_size]
if len(chunk) < window_size:
chunk = np.pad(
chunk, (0, window_size - len(chunk)), mode="constant"
)
speech_dict = vad_iterator(chunk)
if not speech_dict:
continue
if "start" in speech_dict:
start = speech_dict["start"]
continue
if "end" in speech_dict and start is not None:
end = speech_dict["end"]
start_time = start / float(SAMPLERATE)
end_time = end / float(SAMPLERATE)
yield TimeSegment(start_time, end_time)
start = None
vad_iterator.reset_states()
def batch_speech_segments(
segments: Generator[TimeSegment, None, None], max_duration: int
) -> Generator[TimeSegment, None, None]:
"""
Input segments:
[0-2] [3-5] [6-8] [10-11] [12-15] [17-19] [20-22]
↓ (max_duration=10)
Output batches:
[0-8] [10-19] [20-22]
Note: silences are kept for better transcription, previous implementation was
passing segments separatly, but the output was less accurate.
"""
batch_start_time = None
batch_end_time = None
for segment in segments:
start_time, end_time = segment.start, segment.end
if batch_start_time is None or batch_end_time is None:
batch_start_time = start_time
batch_end_time = end_time
continue
total_duration = end_time - batch_start_time
if total_duration <= max_duration:
batch_end_time = end_time
continue
yield TimeSegment(batch_start_time, batch_end_time)
batch_start_time = start_time
batch_end_time = end_time
if batch_start_time is None or batch_end_time is None:
return
yield TimeSegment(batch_start_time, batch_end_time)
def batch_segment_to_audio_segment(
segments: Generator[TimeSegment, None, None],
audio_array,
) -> Generator[AudioSegment, None, None]:
"""Extract audio segments and apply padding for Parakeet compatibility.
Uses pad_audio to ensure segments are at least 0.5s long, preventing
Parakeet crashes. This padding may cause slight timing overlaps between
segments, which are corrected by enforce_word_timing_constraints.
"""
for segment in segments:
start_time, end_time = segment.start, segment.end
start_sample = int(start_time * SAMPLERATE)
end_sample = int(end_time * SAMPLERATE)
audio_segment = audio_array[start_sample:end_sample]
padded_segment = pad_audio(audio_segment, SAMPLERATE)
yield AudioSegment(start_time, end_time, padded_segment)
def transcribe_batch(model, audio_segments: list) -> list:
with NoStdStreams():
outputs = model.transcribe(audio_segments, timestamps=True)
return outputs
def enforce_word_timing_constraints(
words: list[WordTiming],
) -> list[WordTiming]:
"""Enforce that word end times don't exceed the start time of the next word.
Due to silence padding added in batch_segment_to_audio_segment for better
transcription accuracy, word timings from different segments may overlap.
This function ensures there are no overlaps by adjusting end times.
"""
if len(words) <= 1:
return words
enforced_words = []
for i, word in enumerate(words):
enforced_word = word.copy()
if i < len(words) - 1:
next_start = words[i + 1]["start"]
if enforced_word["end"] > next_start:
enforced_word["end"] = next_start
enforced_words.append(enforced_word)
return enforced_words
def emit_results(
results: list,
segments_info: list[AudioSegment],
) -> Generator[TranscriptResult, None, None]:
"""Yield transcribed text and word timings from model output, adjusting timestamps to absolute positions."""
for i, (output, segment) in enumerate(zip(results, segments_info)):
start_time, end_time = segment.start, segment.end
text = output.text.strip()
words: list[WordTiming] = [
WordTiming(
word=word_info["word"] + " ",
start=round(
word_info["start"] + start_time + timestamp_offset, 2
),
end=round(word_info["end"] + start_time + timestamp_offset, 2),
)
for word_info in output.timestamp["word"]
]
yield TranscriptResult(text, words)
upload_volume.reload()
file_path = f"{UPLOADS_PATH}/{filename}"
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
audio_array = load_and_convert_audio(file_path)
total_duration = len(audio_array) / float(SAMPLERATE)
all_text_parts: list[str] = []
all_words: list[WordTiming] = []
raw_segments = vad_segment_generator(audio_array)
speech_segments = batch_speech_segments(
raw_segments,
VAD_CONFIG["batch_max_duration"],
)
audio_segments = batch_segment_to_audio_segment(speech_segments, audio_array)
for batch in audio_segments:
audio_segment = batch.audio
results = transcribe_batch(self.model, [audio_segment])
for result in emit_results(
results,
[batch],
):
if not result.text:
continue
all_text_parts.append(result.text)
all_words.extend(result.words)
all_words = enforce_word_timing_constraints(all_words)
combined_text = " ".join(all_text_parts)
return {"text": combined_text, "words": all_words}
@app.function(
scaledown_window=60,
timeout=600,
secrets=[
modal.Secret.from_name("reflector-gpu"),
],
volumes={CACHE_PATH: model_cache, UPLOADS_PATH: upload_volume},
image=image,
)
@modal.concurrent(max_inputs=40)
@modal.asgi_app()
def web():
import os
import uuid
from fastapi import (
Body,
Depends,
FastAPI,
Form,
HTTPException,
UploadFile,
status,
)
from fastapi.security import OAuth2PasswordBearer
from pydantic import BaseModel
transcriber_live = TranscriberParakeetLive()
transcriber_file = TranscriberParakeetFile()
app = FastAPI()
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="token")
def apikey_auth(apikey: str = Depends(oauth2_scheme)):
if apikey == os.environ["REFLECTOR_GPU_APIKEY"]:
return
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid API key",
headers={"WWW-Authenticate": "Bearer"},
)
class TranscriptResponse(BaseModel):
result: dict
@app.post("/v1/audio/transcriptions", dependencies=[Depends(apikey_auth)])
def transcribe(
file: UploadFile = None,
files: list[UploadFile] | None = None,
model: str = Form(MODEL_NAME),
language: str = Form("en"),
batch: bool = Form(False),
):
# Parakeet only supports English
if language != "en":
raise HTTPException(
status_code=400,
detail=f"Parakeet model only supports English. Got language='{language}'",
)
# Handle both single file and multiple files
if not file and not files:
raise HTTPException(
status_code=400, detail="Either 'file' or 'files' parameter is required"
)
if batch and not files:
raise HTTPException(
status_code=400, detail="Batch transcription requires 'files'"
)
upload_files = [file] if file else files
# Upload files to volume
uploaded_filenames = []
for upload_file in upload_files:
audio_suffix = upload_file.filename.split(".")[-1]
assert audio_suffix in SUPPORTED_FILE_EXTENSIONS
# Generate unique filename
unique_filename = f"{uuid.uuid4()}.{audio_suffix}"
file_path = f"{UPLOADS_PATH}/{unique_filename}"
print(f"Writing file to: {file_path}")
with open(file_path, "wb") as f:
content = upload_file.file.read()
f.write(content)
uploaded_filenames.append(unique_filename)
upload_volume.commit()
try:
# Use A10G live transcriber for per-file transcription
if batch and len(upload_files) > 1:
# Use batch transcription
func = transcriber_live.transcribe_batch.spawn(
filenames=uploaded_filenames,
)
results = func.get()
return {"results": results}
# Per-file transcription
results = []
for filename in uploaded_filenames:
func = transcriber_live.transcribe_segment.spawn(
filename=filename,
)
result = func.get()
result["filename"] = filename
results.append(result)
return {"results": results} if len(results) > 1 else results[0]
finally:
for filename in uploaded_filenames:
try:
file_path = f"{UPLOADS_PATH}/{filename}"
print(f"Deleting file: {file_path}")
os.remove(file_path)
except Exception as e:
print(f"Error deleting {filename}: {e}")
upload_volume.commit()
@app.post("/v1/audio/transcriptions-from-url", dependencies=[Depends(apikey_auth)])
def transcribe_from_url(
audio_file_url: str = Body(
..., description="URL of the audio file to transcribe"
),
model: str = Body(MODEL_NAME),
language: str = Body("en", description="Language code (only 'en' supported)"),
timestamp_offset: float = Body(0.0),
):
# Parakeet only supports English
if language != "en":
raise HTTPException(
status_code=400,
detail=f"Parakeet model only supports English. Got language='{language}'",
)
unique_filename, audio_suffix = download_audio_to_volume(audio_file_url)
try:
func = transcriber_file.transcribe_segment.spawn(
filename=unique_filename,
timestamp_offset=timestamp_offset,
)
result = func.get()
return result
finally:
try:
file_path = f"{UPLOADS_PATH}/{unique_filename}"
print(f"Deleting file: {file_path}")
os.remove(file_path)
upload_volume.commit()
except Exception as e:
print(f"Error cleaning up {unique_filename}: {e}")
return app
class NoStdStreams:
def __init__(self):
self.devnull = open(os.devnull, "w")
def __enter__(self):
self._stdout, self._stderr = sys.stdout, sys.stderr
self._stdout.flush()
self._stderr.flush()
sys.stdout, sys.stderr = self.devnull, self.devnull
def __exit__(self, exc_type, exc_value, traceback):
sys.stdout, sys.stderr = self._stdout, self._stderr
self.devnull.close()

View File

@@ -0,0 +1,36 @@
"""Add webhook fields to rooms
Revision ID: 0194f65cd6d3
Revises: 5a8907fd1d78
Create Date: 2025-08-27 09:03:19.610995
"""
from typing import Sequence, Union
import sqlalchemy as sa
from alembic import op
# revision identifiers, used by Alembic.
revision: str = "0194f65cd6d3"
down_revision: Union[str, None] = "5a8907fd1d78"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
with op.batch_alter_table("room", schema=None) as batch_op:
batch_op.add_column(sa.Column("webhook_url", sa.String(), nullable=True))
batch_op.add_column(sa.Column("webhook_secret", sa.String(), nullable=True))
# ### end Alembic commands ###
def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
with op.batch_alter_table("room", schema=None) as batch_op:
batch_op.drop_column("webhook_secret")
batch_op.drop_column("webhook_url")
# ### end Alembic commands ###

View File

@@ -0,0 +1,64 @@
"""add_long_summary_to_search_vector
Revision ID: 0ab2d7ffaa16
Revises: b1c33bd09963
Create Date: 2025-08-15 13:27:52.680211
"""
from typing import Sequence, Union
from alembic import op
# revision identifiers, used by Alembic.
revision: str = "0ab2d7ffaa16"
down_revision: Union[str, None] = "b1c33bd09963"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
# Drop the existing search vector column and index
op.drop_index("idx_transcript_search_vector_en", table_name="transcript")
op.drop_column("transcript", "search_vector_en")
# Recreate the search vector column with long_summary included
op.execute("""
ALTER TABLE transcript ADD COLUMN search_vector_en tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(long_summary, '')), 'B') ||
setweight(to_tsvector('english', coalesce(webvtt, '')), 'C')
) STORED
""")
# Recreate the GIN index for the search vector
op.create_index(
"idx_transcript_search_vector_en",
"transcript",
["search_vector_en"],
postgresql_using="gin",
)
def downgrade() -> None:
# Drop the updated search vector column and index
op.drop_index("idx_transcript_search_vector_en", table_name="transcript")
op.drop_column("transcript", "search_vector_en")
# Recreate the original search vector column without long_summary
op.execute("""
ALTER TABLE transcript ADD COLUMN search_vector_en tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(webvtt, '')), 'B')
) STORED
""")
# Recreate the GIN index for the search vector
op.create_index(
"idx_transcript_search_vector_en",
"transcript",
["search_vector_en"],
postgresql_using="gin",
)

View File

@@ -0,0 +1,50 @@
"""add cascade delete to meeting consent foreign key
Revision ID: 5a8907fd1d78
Revises: 0ab2d7ffaa16
Create Date: 2025-08-26 17:26:50.945491
"""
from typing import Sequence, Union
from alembic import op
# revision identifiers, used by Alembic.
revision: str = "5a8907fd1d78"
down_revision: Union[str, None] = "0ab2d7ffaa16"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
with op.batch_alter_table("meeting_consent", schema=None) as batch_op:
batch_op.drop_constraint(
batch_op.f("meeting_consent_meeting_id_fkey"), type_="foreignkey"
)
batch_op.create_foreign_key(
batch_op.f("meeting_consent_meeting_id_fkey"),
"meeting",
["meeting_id"],
["id"],
ondelete="CASCADE",
)
# ### end Alembic commands ###
def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
with op.batch_alter_table("meeting_consent", schema=None) as batch_op:
batch_op.drop_constraint(
batch_op.f("meeting_consent_meeting_id_fkey"), type_="foreignkey"
)
batch_op.create_foreign_key(
batch_op.f("meeting_consent_meeting_id_fkey"),
"meeting",
["meeting_id"],
["id"],
)
# ### end Alembic commands ###

View File

@@ -1,53 +0,0 @@
"""remove_one_active_meeting_per_room_constraint
Revision ID: 6025e9b2bef2
Revises: 9f5c78d352d6
Create Date: 2025-08-18 18:45:44.418392
"""
from typing import Sequence, Union
import sqlalchemy as sa
from alembic import op
# revision identifiers, used by Alembic.
revision: str = "6025e9b2bef2"
down_revision: Union[str, None] = "9f5c78d352d6"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
# Remove the unique constraint that prevents multiple active meetings per room
# This is needed to support calendar integration with overlapping meetings
# Check if index exists before trying to drop it
from alembic import context
if context.get_context().dialect.name == "postgresql":
conn = op.get_bind()
result = conn.execute(
sa.text(
"SELECT 1 FROM pg_indexes WHERE indexname = 'idx_one_active_meeting_per_room'"
)
)
if result.fetchone():
op.drop_index("idx_one_active_meeting_per_room", table_name="meeting")
else:
# For SQLite, just try to drop it
try:
op.drop_index("idx_one_active_meeting_per_room", table_name="meeting")
except:
pass
def downgrade() -> None:
# Restore the unique constraint
op.create_index(
"idx_one_active_meeting_per_room",
"meeting",
["room_id"],
unique=True,
postgresql_where=sa.text("is_active = true"),
sqlite_where=sa.text("is_active = 1"),
)

View File

@@ -0,0 +1,28 @@
"""webhook url and secret null by default
Revision ID: 61882a919591
Revises: 0194f65cd6d3
Create Date: 2025-08-29 11:46:36.738091
"""
from typing import Sequence, Union
# revision identifiers, used by Alembic.
revision: str = "61882a919591"
down_revision: Union[str, None] = "0194f65cd6d3"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
pass
# ### end Alembic commands ###
def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
pass
# ### end Alembic commands ###

View File

@@ -0,0 +1,41 @@
"""add_search_optimization_indexes
Revision ID: b1c33bd09963
Revises: 9f5c78d352d6
Create Date: 2025-08-14 17:26:02.117408
"""
from typing import Sequence, Union
from alembic import op
# revision identifiers, used by Alembic.
revision: str = "b1c33bd09963"
down_revision: Union[str, None] = "9f5c78d352d6"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
# Add indexes for actual search filtering patterns used in frontend
# Based on /browse page filters: room_id and source_kind
# Index for room_id + created_at (for room-specific searches with date ordering)
op.create_index(
"idx_transcript_room_id_created_at",
"transcript",
["room_id", "created_at"],
if_not_exists=True,
)
# Index for source_kind alone (actively used filter in frontend)
op.create_index(
"idx_transcript_source_kind", "transcript", ["source_kind"], if_not_exists=True
)
def downgrade() -> None:
# Remove the indexes in reverse order
op.drop_index("idx_transcript_source_kind", "transcript", if_exists=True)
op.drop_index("idx_transcript_room_id_created_at", "transcript", if_exists=True)

View File

@@ -1,34 +0,0 @@
"""add_grace_period_fields_to_meeting
Revision ID: d4a1c446458c
Revises: 6025e9b2bef2
Create Date: 2025-08-18 18:50:37.768052
"""
from typing import Sequence, Union
import sqlalchemy as sa
from alembic import op
# revision identifiers, used by Alembic.
revision: str = "d4a1c446458c"
down_revision: Union[str, None] = "6025e9b2bef2"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
# Add fields to track when participants left for grace period logic
op.add_column(
"meeting", sa.Column("last_participant_left_at", sa.DateTime(timezone=True))
)
op.add_column(
"meeting",
sa.Column("grace_period_minutes", sa.Integer, server_default=sa.text("15")),
)
def downgrade() -> None:
op.drop_column("meeting", "grace_period_minutes")
op.drop_column("meeting", "last_participant_left_at")

View File

@@ -32,7 +32,6 @@ dependencies = [
"redis>=5.0.1",
"python-jose[cryptography]>=3.3.0",
"python-multipart>=0.0.6",
"faster-whisper>=0.10.0",
"transformers>=4.36.2",
"jsonschema>=4.23.0",
"openai>=1.59.7",
@@ -41,7 +40,6 @@ dependencies = [
"llama-index-llms-openai-like>=0.4.0",
"pytest-env>=1.1.5",
"webvtt-py>=0.5.0",
"icalendar>=6.0.0",
]
[dependency-groups]
@@ -58,6 +56,7 @@ tests = [
"httpx-ws>=0.4.1",
"pytest-httpx>=0.23.1",
"pytest-celery>=0.0.0",
"pytest-recording>=0.13.4",
"pytest-docker>=3.2.3",
"asgi-lifespan>=2.1.0",
]
@@ -68,6 +67,15 @@ evaluation = [
"tqdm>=4.66.0",
"pydantic>=2.1.1",
]
local = [
"pyannote-audio>=3.3.2",
"faster-whisper>=0.10.0",
]
silero-vad = [
"silero-vad>=5.1.2",
"torch>=2.8.0",
"torchaudio>=2.8.0",
]
[tool.uv]
default-groups = [
@@ -75,6 +83,21 @@ default-groups = [
"tests",
"aws",
"evaluation",
"local",
"silero-vad"
]
[[tool.uv.index]]
name = "pytorch-cpu"
url = "https://download.pytorch.org/whl/cpu"
explicit = true
[tool.uv.sources]
torch = [
{ index = "pytorch-cpu" },
]
torchaudio = [
{ index = "pytorch-cpu" },
]
[build-system]
@@ -95,6 +118,9 @@ DATABASE_URL = "postgresql://test_user:test_password@localhost:15432/reflector_t
addopts = "-ra -q --disable-pytest-warnings --cov --cov-report html -v"
testpaths = ["tests"]
asyncio_mode = "auto"
markers = [
"gpu_modal: mark test to run only with GPU Modal endpoints (deselect with '-m \"not gpu_modal\"')",
]
[tool.ruff.lint]
select = [

View File

@@ -0,0 +1,27 @@
import asyncio
import functools
from reflector.db import get_database
def asynctask(f):
@functools.wraps(f)
def wrapper(*args, **kwargs):
async def run_with_db():
database = get_database()
await database.connect()
try:
return await f(*args, **kwargs)
finally:
await database.disconnect()
coro = run_with_db()
try:
loop = asyncio.get_running_loop()
except RuntimeError:
loop = None
if loop and loop.is_running():
return loop.run_until_complete(coro)
return asyncio.run(coro)
return wrapper

View File

@@ -24,7 +24,6 @@ def get_database() -> databases.Database:
# import models
import reflector.db.calendar_events # noqa
import reflector.db.meetings # noqa
import reflector.db.recordings # noqa
import reflector.db.rooms # noqa

View File

@@ -1,193 +0,0 @@
from datetime import datetime, timezone
from typing import Any
import sqlalchemy as sa
from pydantic import BaseModel, Field
from sqlalchemy.dialects.postgresql import JSONB
from reflector.db import get_database, metadata
from reflector.utils import generate_uuid4
calendar_events = sa.Table(
"calendar_event",
metadata,
sa.Column("id", sa.String, primary_key=True),
sa.Column(
"room_id",
sa.String,
sa.ForeignKey("room.id", ondelete="CASCADE"),
nullable=False,
),
sa.Column("ics_uid", sa.Text, nullable=False),
sa.Column("title", sa.Text),
sa.Column("description", sa.Text),
sa.Column("start_time", sa.DateTime(timezone=True), nullable=False),
sa.Column("end_time", sa.DateTime(timezone=True), nullable=False),
sa.Column("attendees", JSONB),
sa.Column("location", sa.Text),
sa.Column("ics_raw_data", sa.Text),
sa.Column("last_synced", sa.DateTime(timezone=True), nullable=False),
sa.Column("is_deleted", sa.Boolean, nullable=False, server_default=sa.false()),
sa.Column("created_at", sa.DateTime(timezone=True), nullable=False),
sa.Column("updated_at", sa.DateTime(timezone=True), nullable=False),
sa.UniqueConstraint("room_id", "ics_uid", name="uq_room_calendar_event"),
sa.Index("idx_calendar_event_room_start", "room_id", "start_time"),
sa.Index(
"idx_calendar_event_deleted",
"is_deleted",
postgresql_where=sa.text("NOT is_deleted"),
),
)
class CalendarEvent(BaseModel):
id: str = Field(default_factory=generate_uuid4)
room_id: str
ics_uid: str
title: str | None = None
description: str | None = None
start_time: datetime
end_time: datetime
attendees: list[dict[str, Any]] | None = None
location: str | None = None
ics_raw_data: str | None = None
last_synced: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
is_deleted: bool = False
created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
updated_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
class CalendarEventController:
async def get_by_room(
self,
room_id: str,
include_deleted: bool = False,
start_after: datetime | None = None,
end_before: datetime | None = None,
) -> list[CalendarEvent]:
"""Get calendar events for a room."""
query = calendar_events.select().where(calendar_events.c.room_id == room_id)
if not include_deleted:
query = query.where(calendar_events.c.is_deleted == False)
if start_after:
query = query.where(calendar_events.c.start_time >= start_after)
if end_before:
query = query.where(calendar_events.c.end_time <= end_before)
query = query.order_by(calendar_events.c.start_time.asc())
results = await get_database().fetch_all(query)
return [CalendarEvent(**result) for result in results]
async def get_upcoming(
self, room_id: str, minutes_ahead: int = 30
) -> list[CalendarEvent]:
"""Get upcoming events for a room within the specified minutes."""
now = datetime.now(timezone.utc)
future_time = now + timedelta(minutes=minutes_ahead)
query = (
calendar_events.select()
.where(
sa.and_(
calendar_events.c.room_id == room_id,
calendar_events.c.is_deleted == False,
calendar_events.c.start_time >= now,
calendar_events.c.start_time <= future_time,
)
)
.order_by(calendar_events.c.start_time.asc())
)
results = await get_database().fetch_all(query)
return [CalendarEvent(**result) for result in results]
async def get_by_ics_uid(self, room_id: str, ics_uid: str) -> CalendarEvent | None:
"""Get a calendar event by its ICS UID."""
query = calendar_events.select().where(
sa.and_(
calendar_events.c.room_id == room_id,
calendar_events.c.ics_uid == ics_uid,
)
)
result = await get_database().fetch_one(query)
return CalendarEvent(**result) if result else None
async def upsert(self, event: CalendarEvent) -> CalendarEvent:
"""Create or update a calendar event."""
existing = await self.get_by_ics_uid(event.room_id, event.ics_uid)
if existing:
# Update existing event
event.id = existing.id
event.created_at = existing.created_at
event.updated_at = datetime.now(timezone.utc)
query = (
calendar_events.update()
.where(calendar_events.c.id == existing.id)
.values(**event.model_dump())
)
else:
# Insert new event
query = calendar_events.insert().values(**event.model_dump())
await get_database().execute(query)
return event
async def soft_delete_missing(
self, room_id: str, current_ics_uids: list[str]
) -> int:
"""Soft delete future events that are no longer in the calendar."""
now = datetime.now(timezone.utc)
# First, get the IDs of events to delete
select_query = calendar_events.select().where(
sa.and_(
calendar_events.c.room_id == room_id,
calendar_events.c.start_time > now,
calendar_events.c.is_deleted == False,
calendar_events.c.ics_uid.notin_(current_ics_uids)
if current_ics_uids
else True,
)
)
to_delete = await get_database().fetch_all(select_query)
delete_count = len(to_delete)
if delete_count > 0:
# Now update them
update_query = (
calendar_events.update()
.where(
sa.and_(
calendar_events.c.room_id == room_id,
calendar_events.c.start_time > now,
calendar_events.c.is_deleted == False,
calendar_events.c.ics_uid.notin_(current_ics_uids)
if current_ics_uids
else True,
)
)
.values(is_deleted=True, updated_at=now)
)
await get_database().execute(update_query)
return delete_count
async def delete_by_room(self, room_id: str) -> int:
"""Hard delete all events for a room (used when room is deleted)."""
query = calendar_events.delete().where(calendar_events.c.room_id == room_id)
result = await get_database().execute(query)
return result.rowcount
# Add missing import
from datetime import timedelta
calendar_events_controller = CalendarEventController()

View File

@@ -1,10 +1,9 @@
from datetime import datetime
from typing import Any, Literal
from typing import Literal
import sqlalchemy as sa
from fastapi import HTTPException
from pydantic import BaseModel, Field
from sqlalchemy.dialects.postgresql import JSONB
from reflector.db import get_database, metadata
from reflector.db.rooms import Room
@@ -42,23 +41,25 @@ meetings = sa.Table(
nullable=False,
server_default=sa.true(),
),
sa.Column(
"calendar_event_id",
sa.String,
sa.ForeignKey("calendar_event.id", ondelete="SET NULL"),
),
sa.Column("calendar_metadata", JSONB),
sa.Column("last_participant_left_at", sa.DateTime(timezone=True)),
sa.Column("grace_period_minutes", sa.Integer, server_default=sa.text("15")),
sa.Index("idx_meeting_room_id", "room_id"),
sa.Index("idx_meeting_calendar_event", "calendar_event_id"),
sa.Index(
"idx_one_active_meeting_per_room",
"room_id",
unique=True,
postgresql_where=sa.text("is_active = true"),
),
)
meeting_consent = sa.Table(
"meeting_consent",
metadata,
sa.Column("id", sa.String, primary_key=True),
sa.Column("meeting_id", sa.String, sa.ForeignKey("meeting.id"), nullable=False),
sa.Column(
"meeting_id",
sa.String,
sa.ForeignKey("meeting.id", ondelete="CASCADE"),
nullable=False,
),
sa.Column("user_id", sa.String),
sa.Column("consent_given", sa.Boolean, nullable=False),
sa.Column("consent_timestamp", sa.DateTime(timezone=True), nullable=False),
@@ -89,11 +90,6 @@ class Meeting(BaseModel):
"none", "prompt", "automatic", "automatic-2nd-participant"
] = "automatic-2nd-participant"
num_clients: int = 0
is_active: bool = True
calendar_event_id: str | None = None
calendar_metadata: dict[str, Any] | None = None
last_participant_left_at: datetime | None = None
grace_period_minutes: int = 15
class MeetingController:
@@ -107,8 +103,6 @@ class MeetingController:
end_date: datetime,
user_id: str,
room: Room,
calendar_event_id: str | None = None,
calendar_metadata: dict[str, Any] | None = None,
):
"""
Create a new meeting
@@ -126,8 +120,6 @@ class MeetingController:
room_mode=room.room_mode,
recording_type=room.recording_type,
recording_trigger=room.recording_trigger,
calendar_event_id=calendar_event_id,
calendar_metadata=calendar_metadata,
)
query = meetings.insert().values(**meeting.model_dump())
await get_database().execute(query)
@@ -157,7 +149,6 @@ class MeetingController:
async def get_active(self, room: Room, current_time: datetime) -> Meeting:
"""
Get latest active meeting for a room.
For backward compatibility, returns the most recent active meeting.
"""
end_date = getattr(meetings.c, "end_date")
query = (
@@ -177,47 +168,6 @@ class MeetingController:
return Meeting(**result)
async def get_all_active_for_room(
self, room: Room, current_time: datetime
) -> list[Meeting]:
"""
Get all active meetings for a room.
This supports multiple concurrent meetings per room.
"""
end_date = getattr(meetings.c, "end_date")
query = (
meetings.select()
.where(
sa.and_(
meetings.c.room_id == room.id,
meetings.c.end_date > current_time,
meetings.c.is_active,
)
)
.order_by(end_date.desc())
)
results = await get_database().fetch_all(query)
return [Meeting(**result) for result in results]
async def get_active_by_calendar_event(
self, room: Room, calendar_event_id: str, current_time: datetime
) -> Meeting | None:
"""
Get active meeting for a specific calendar event.
"""
query = meetings.select().where(
sa.and_(
meetings.c.room_id == room.id,
meetings.c.calendar_event_id == calendar_event_id,
meetings.c.end_date > current_time,
meetings.c.is_active,
)
)
result = await get_database().fetch_one(query)
if not result:
return None
return Meeting(**result)
async def get_by_id(self, meeting_id: str, **kwargs) -> Meeting | None:
"""
Get a meeting by id
@@ -245,15 +195,6 @@ class MeetingController:
return meeting
async def get_by_calendar_event(self, calendar_event_id: str) -> Meeting | None:
query = meetings.select().where(
meetings.c.calendar_event_id == calendar_event_id
)
result = await get_database().fetch_one(query)
if not result:
return None
return Meeting(**result)
async def update_meeting(self, meeting_id: str, **kwargs):
query = meetings.update().where(meetings.c.id == meeting_id).values(**kwargs)
await get_database().execute(query)

View File

@@ -1,3 +1,4 @@
import secrets
from datetime import datetime, timezone
from sqlite3 import IntegrityError
from typing import Literal
@@ -40,15 +41,9 @@ rooms = sqlalchemy.Table(
sqlalchemy.Column(
"is_shared", sqlalchemy.Boolean, nullable=False, server_default=false()
),
sqlalchemy.Column("ics_url", sqlalchemy.Text),
sqlalchemy.Column("ics_fetch_interval", sqlalchemy.Integer, server_default="300"),
sqlalchemy.Column(
"ics_enabled", sqlalchemy.Boolean, nullable=False, server_default=false()
),
sqlalchemy.Column("ics_last_sync", sqlalchemy.DateTime(timezone=True)),
sqlalchemy.Column("ics_last_etag", sqlalchemy.Text),
sqlalchemy.Column("webhook_url", sqlalchemy.String, nullable=True),
sqlalchemy.Column("webhook_secret", sqlalchemy.String, nullable=True),
sqlalchemy.Index("idx_room_is_shared", "is_shared"),
sqlalchemy.Index("idx_room_ics_enabled", "ics_enabled"),
)
@@ -67,11 +62,8 @@ class Room(BaseModel):
"none", "prompt", "automatic", "automatic-2nd-participant"
] = "automatic-2nd-participant"
is_shared: bool = False
ics_url: str | None = None
ics_fetch_interval: int = 300
ics_enabled: bool = False
ics_last_sync: datetime | None = None
ics_last_etag: str | None = None
webhook_url: str | None = None
webhook_secret: str | None = None
class RoomController:
@@ -120,13 +112,15 @@ class RoomController:
recording_type: str,
recording_trigger: str,
is_shared: bool,
ics_url: str | None = None,
ics_fetch_interval: int = 300,
ics_enabled: bool = False,
webhook_url: str = "",
webhook_secret: str = "",
):
"""
Add a new room
"""
if webhook_url and not webhook_secret:
webhook_secret = secrets.token_urlsafe(32)
room = Room(
name=name,
user_id=user_id,
@@ -138,9 +132,8 @@ class RoomController:
recording_type=recording_type,
recording_trigger=recording_trigger,
is_shared=is_shared,
ics_url=ics_url,
ics_fetch_interval=ics_fetch_interval,
ics_enabled=ics_enabled,
webhook_url=webhook_url,
webhook_secret=webhook_secret,
)
query = rooms.insert().values(**room.model_dump())
try:
@@ -153,6 +146,9 @@ class RoomController:
"""
Update a room fields with key/values in values
"""
if values.get("webhook_url") and not values.get("webhook_secret"):
values["webhook_secret"] = secrets.token_urlsafe(32)
query = rooms.update().where(rooms.c.id == room.id).values(**values)
try:
await get_database().execute(query)

View File

@@ -1,24 +1,37 @@
"""Search functionality for transcripts and other entities."""
import itertools
from dataclasses import dataclass
from datetime import datetime
from io import StringIO
from typing import Annotated, Any, Dict
from typing import Annotated, Any, Dict, Iterator
import sqlalchemy
import webvtt
from pydantic import BaseModel, Field, constr, field_serializer
from fastapi import HTTPException
from pydantic import (
BaseModel,
Field,
NonNegativeFloat,
NonNegativeInt,
ValidationError,
constr,
field_serializer,
)
from reflector.db import get_database
from reflector.db.rooms import rooms
from reflector.db.transcripts import SourceKind, transcripts
from reflector.db.utils import is_postgresql
from reflector.logger import logger
DEFAULT_SEARCH_LIMIT = 20
SNIPPET_CONTEXT_LENGTH = 50 # Characters before/after match to include
DEFAULT_SNIPPET_MAX_LENGTH = 150
DEFAULT_MAX_SNIPPETS = 3
DEFAULT_SNIPPET_MAX_LENGTH = NonNegativeInt(150)
DEFAULT_MAX_SNIPPETS = NonNegativeInt(3)
LONG_SUMMARY_MAX_SNIPPETS = 2
SearchQueryBase = constr(min_length=1, strip_whitespace=True)
SearchQueryBase = constr(min_length=0, strip_whitespace=True)
SearchLimitBase = Annotated[int, Field(ge=1, le=100)]
SearchOffsetBase = Annotated[int, Field(ge=0)]
SearchTotalBase = Annotated[int, Field(ge=0)]
@@ -32,6 +45,82 @@ SearchTotal = Annotated[
SearchTotalBase, Field(description="Total number of search results")
]
WEBVTT_SPEC_HEADER = "WEBVTT"
WebVTTContent = Annotated[
str,
Field(min_length=len(WEBVTT_SPEC_HEADER), description="WebVTT content"),
]
class WebVTTProcessor:
"""Stateless processor for WebVTT content operations."""
@staticmethod
def parse(raw_content: str) -> WebVTTContent:
"""Parse WebVTT content and return it as a string."""
if not raw_content.startswith(WEBVTT_SPEC_HEADER):
raise ValueError(f"Invalid WebVTT content, no header {WEBVTT_SPEC_HEADER}")
return raw_content
@staticmethod
def extract_text(webvtt_content: WebVTTContent) -> str:
"""Extract plain text from WebVTT content using webvtt library."""
try:
buffer = StringIO(webvtt_content)
vtt = webvtt.read_buffer(buffer)
return " ".join(caption.text for caption in vtt if caption.text)
except webvtt.errors.MalformedFileError as e:
logger.warning(f"Malformed WebVTT content: {e}")
return ""
except (UnicodeDecodeError, ValueError) as e:
logger.warning(f"Failed to decode WebVTT content: {e}")
return ""
except AttributeError as e:
logger.error(
f"WebVTT parsing error - unexpected format: {e}", exc_info=True
)
return ""
except Exception as e:
logger.error(f"Unexpected error parsing WebVTT: {e}", exc_info=True)
return ""
@staticmethod
def generate_snippets(
webvtt_content: WebVTTContent,
query: str,
max_snippets: NonNegativeInt = DEFAULT_MAX_SNIPPETS,
) -> list[str]:
"""Generate snippets from WebVTT content."""
return SnippetGenerator.generate(
WebVTTProcessor.extract_text(webvtt_content),
query,
max_snippets=max_snippets,
)
@dataclass(frozen=True)
class SnippetCandidate:
"""Represents a candidate snippet with its position."""
_text: str
start: NonNegativeInt
_original_text_length: int
@property
def end(self) -> NonNegativeInt:
"""Calculate end position from start and raw text length."""
return self.start + len(self._text)
def text(self) -> str:
"""Get display text with ellipses added if needed."""
result = self._text.strip()
if self.start > 0:
result = "..." + result
if self.end < self._original_text_length:
result = result + "..."
return result
class SearchParameters(BaseModel):
"""Validated search parameters for full-text search."""
@@ -41,6 +130,7 @@ class SearchParameters(BaseModel):
offset: SearchOffset = 0
user_id: str | None = None
room_id: str | None = None
source_kind: SourceKind | None = None
class SearchResultDB(BaseModel):
@@ -64,13 +154,18 @@ class SearchResult(BaseModel):
title: str | None = None
user_id: str | None = None
room_id: str | None = None
room_name: str | None = None
source_kind: SourceKind
created_at: datetime
status: str = Field(..., min_length=1)
rank: float = Field(..., ge=0, le=1)
duration: float | None = Field(..., ge=0, description="Duration in seconds")
duration: NonNegativeFloat | None = Field(..., description="Duration in seconds")
search_snippets: list[str] = Field(
description="Text snippets around search matches"
)
total_match_count: NonNegativeInt = Field(
default=0, description="Total number of matches found in the transcript"
)
@field_serializer("created_at", when_used="json")
def serialize_datetime(self, dt: datetime) -> str:
@@ -79,84 +174,153 @@ class SearchResult(BaseModel):
return dt.isoformat()
class SearchController:
"""Controller for search operations across different entities."""
class SnippetGenerator:
"""Stateless generator for text snippets and match operations."""
@staticmethod
def _extract_webvtt_text(webvtt_content: str) -> str:
"""Extract plain text from WebVTT content using webvtt library."""
if not webvtt_content:
return ""
def find_all_matches(text: str, query: str) -> Iterator[int]:
"""Generate all match positions for a query in text."""
if not text:
logger.warning("Empty text for search query in find_all_matches")
return
if not query:
logger.warning("Empty query for search text in find_all_matches")
return
try:
buffer = StringIO(webvtt_content)
vtt = webvtt.read_buffer(buffer)
return " ".join(caption.text for caption in vtt if caption.text)
except (webvtt.errors.MalformedFileError, UnicodeDecodeError, ValueError) as e:
logger.warning(f"Failed to parse WebVTT content: {e}", exc_info=e)
return ""
except AttributeError as e:
logger.warning(f"WebVTT parsing error - unexpected format: {e}", exc_info=e)
return ""
text_lower = text.lower()
query_lower = query.lower()
start = 0
prev_start = start
while (pos := text_lower.find(query_lower, start)) != -1:
yield pos
start = pos + len(query_lower)
if start <= prev_start:
raise ValueError("panic! find_all_matches is not incremental")
prev_start = start
@staticmethod
def _generate_snippets(
def count_matches(text: str, query: str) -> NonNegativeInt:
"""Count total number of matches for a query in text."""
ZERO = NonNegativeInt(0)
if not text:
logger.warning("Empty text for search query in count_matches")
return ZERO
if not query:
logger.warning("Empty query for search text in count_matches")
return ZERO
return NonNegativeInt(
sum(1 for _ in SnippetGenerator.find_all_matches(text, query))
)
@staticmethod
def create_snippet(
text: str, match_pos: int, max_length: int = DEFAULT_SNIPPET_MAX_LENGTH
) -> SnippetCandidate:
"""Create a snippet from a match position."""
snippet_start = NonNegativeInt(max(0, match_pos - SNIPPET_CONTEXT_LENGTH))
snippet_end = min(len(text), match_pos + max_length - SNIPPET_CONTEXT_LENGTH)
snippet_text = text[snippet_start:snippet_end]
return SnippetCandidate(
_text=snippet_text, start=snippet_start, _original_text_length=len(text)
)
@staticmethod
def filter_non_overlapping(
candidates: Iterator[SnippetCandidate],
) -> Iterator[str]:
"""Filter out overlapping snippets and return only display text."""
last_end = 0
for candidate in candidates:
display_text = candidate.text()
# it means that next overlapping snippets simply don't get included
# it's fine as simplistic logic and users probably won't care much because they already have their search results just fin
if candidate.start >= last_end and display_text:
yield display_text
last_end = candidate.end
@staticmethod
def generate(
text: str,
q: SearchQuery,
max_length: int = DEFAULT_SNIPPET_MAX_LENGTH,
max_snippets: int = DEFAULT_MAX_SNIPPETS,
query: str,
max_length: NonNegativeInt = DEFAULT_SNIPPET_MAX_LENGTH,
max_snippets: NonNegativeInt = DEFAULT_MAX_SNIPPETS,
) -> list[str]:
"""Generate multiple snippets around all occurrences of search term."""
if not text or not q:
"""Generate snippets from text."""
if not text or not query:
logger.warning("Empty text or query for generate_snippets")
return []
snippets = []
lower_text = text.lower()
search_lower = q.lower()
candidates = (
SnippetGenerator.create_snippet(text, pos, max_length)
for pos in SnippetGenerator.find_all_matches(text, query)
)
filtered = SnippetGenerator.filter_non_overlapping(candidates)
snippets = list(itertools.islice(filtered, max_snippets))
last_snippet_end = 0
start_pos = 0
while len(snippets) < max_snippets:
match_pos = lower_text.find(search_lower, start_pos)
if match_pos == -1:
if not snippets and search_lower.split():
first_word = search_lower.split()[0]
match_pos = lower_text.find(first_word, start_pos)
if match_pos == -1:
break
else:
break
snippet_start = max(0, match_pos - SNIPPET_CONTEXT_LENGTH)
snippet_end = min(
len(text), match_pos + max_length - SNIPPET_CONTEXT_LENGTH
)
if snippet_start < last_snippet_end:
start_pos = match_pos + len(search_lower)
continue
snippet = text[snippet_start:snippet_end]
if snippet_start > 0:
snippet = "..." + snippet
if snippet_end < len(text):
snippet = snippet + "..."
snippet = snippet.strip()
if snippet:
snippets.append(snippet)
last_snippet_end = snippet_end
start_pos = match_pos + len(search_lower)
if start_pos >= len(text):
break
# Fallback to first word search if no full matches
# it's another assumption: proper snippet logic generation is quite complicated and tied to db logic, so simplification is used here
if not snippets and " " in query:
first_word = query.split()[0]
return SnippetGenerator.generate(text, first_word, max_length, max_snippets)
return snippets
@staticmethod
def from_summary(
summary: str,
query: str,
max_snippets: NonNegativeInt = LONG_SUMMARY_MAX_SNIPPETS,
) -> list[str]:
"""Generate snippets from summary text."""
return SnippetGenerator.generate(summary, query, max_snippets=max_snippets)
@staticmethod
def combine_sources(
summary: str | None,
webvtt: WebVTTContent | None,
query: str,
max_total: NonNegativeInt = DEFAULT_MAX_SNIPPETS,
) -> tuple[list[str], NonNegativeInt]:
"""Combine snippets from multiple sources and return total match count.
Returns (snippets, total_match_count) tuple.
snippets can be empty for real in case of e.g. title match
"""
webvtt_matches = 0
summary_matches = 0
if webvtt:
webvtt_text = WebVTTProcessor.extract_text(webvtt)
webvtt_matches = SnippetGenerator.count_matches(webvtt_text, query)
if summary:
summary_matches = SnippetGenerator.count_matches(summary, query)
total_matches = NonNegativeInt(webvtt_matches + summary_matches)
summary_snippets = (
SnippetGenerator.from_summary(summary, query) if summary else []
)
if len(summary_snippets) >= max_total:
return summary_snippets[:max_total], total_matches
remaining = max_total - len(summary_snippets)
webvtt_snippets = (
WebVTTProcessor.generate_snippets(webvtt, query, remaining)
if webvtt
else []
)
return summary_snippets + webvtt_snippets, total_matches
class SearchController:
"""Controller for search operations across different entities."""
@classmethod
async def search_transcripts(
cls, params: SearchParameters
@@ -172,39 +336,70 @@ class SearchController:
)
return [], 0
search_query = sqlalchemy.func.websearch_to_tsquery(
"english", params.query_text
base_columns = [
transcripts.c.id,
transcripts.c.title,
transcripts.c.created_at,
transcripts.c.duration,
transcripts.c.status,
transcripts.c.user_id,
transcripts.c.room_id,
transcripts.c.source_kind,
transcripts.c.webvtt,
transcripts.c.long_summary,
sqlalchemy.case(
(
transcripts.c.room_id.isnot(None) & rooms.c.id.is_(None),
"Deleted Room",
),
else_=rooms.c.name,
).label("room_name"),
]
if params.query_text:
search_query = sqlalchemy.func.websearch_to_tsquery(
"english", params.query_text
)
rank_column = sqlalchemy.func.ts_rank(
transcripts.c.search_vector_en,
search_query,
32, # normalization flag: rank/(rank+1) for 0-1 range
).label("rank")
else:
rank_column = sqlalchemy.cast(1.0, sqlalchemy.Float).label("rank")
columns = base_columns + [rank_column]
base_query = sqlalchemy.select(columns).select_from(
transcripts.join(rooms, transcripts.c.room_id == rooms.c.id, isouter=True)
)
base_query = sqlalchemy.select(
[
transcripts.c.id,
transcripts.c.title,
transcripts.c.created_at,
transcripts.c.duration,
transcripts.c.status,
transcripts.c.user_id,
transcripts.c.room_id,
transcripts.c.source_kind,
transcripts.c.webvtt,
sqlalchemy.func.ts_rank(
transcripts.c.search_vector_en,
search_query,
32, # normalization flag: rank/(rank+1) for 0-1 range
).label("rank"),
]
).where(transcripts.c.search_vector_en.op("@@")(search_query))
if params.query_text:
base_query = base_query.where(
transcripts.c.search_vector_en.op("@@")(search_query)
)
if params.user_id:
base_query = base_query.where(transcripts.c.user_id == params.user_id)
base_query = base_query.where(
sqlalchemy.or_(
transcripts.c.user_id == params.user_id, rooms.c.is_shared
)
)
else:
base_query = base_query.where(rooms.c.is_shared)
if params.room_id:
base_query = base_query.where(transcripts.c.room_id == params.room_id)
if params.source_kind:
base_query = base_query.where(
transcripts.c.source_kind == params.source_kind
)
if params.query_text:
order_by = sqlalchemy.desc(sqlalchemy.text("rank"))
else:
order_by = sqlalchemy.desc(transcripts.c.created_at)
query = base_query.order_by(order_by).limit(params.limit).offset(params.offset)
query = (
base_query.order_by(sqlalchemy.desc(sqlalchemy.text("rank")))
.limit(params.limit)
.offset(params.offset)
)
rs = await get_database().fetch_all(query)
count_query = sqlalchemy.select([sqlalchemy.func.count()]).select_from(
@@ -214,18 +409,40 @@ class SearchController:
def _process_result(r) -> SearchResult:
r_dict: Dict[str, Any] = dict(r)
webvtt: str | None = r_dict.pop("webvtt", None)
webvtt_raw: str | None = r_dict.pop("webvtt", None)
if webvtt_raw:
webvtt = WebVTTProcessor.parse(webvtt_raw)
else:
webvtt = None
long_summary: str | None = r_dict.pop("long_summary", None)
room_name: str | None = r_dict.pop("room_name", None)
db_result = SearchResultDB.model_validate(r_dict)
snippets = []
if webvtt:
plain_text = cls._extract_webvtt_text(webvtt)
snippets = cls._generate_snippets(plain_text, params.query_text)
snippets, total_match_count = SnippetGenerator.combine_sources(
long_summary, webvtt, params.query_text, DEFAULT_MAX_SNIPPETS
)
return SearchResult(**db_result.model_dump(), search_snippets=snippets)
return SearchResult(
**db_result.model_dump(),
room_name=room_name,
search_snippets=snippets,
total_match_count=total_match_count,
)
try:
results = [_process_result(r) for r in rs]
except ValidationError as e:
logger.error(f"Invalid search result data: {e}", exc_info=True)
raise HTTPException(
status_code=500, detail="Internal search result data consistency error"
)
except Exception as e:
logger.error(f"Error processing search results: {e}", exc_info=True)
raise
results = [_process_result(r) for r in rs]
return results, total
search_controller = SearchController()
webvtt_processor = WebVTTProcessor()
snippet_generator = SnippetGenerator()

View File

@@ -88,6 +88,8 @@ transcripts = sqlalchemy.Table(
sqlalchemy.Index("idx_transcript_created_at", "created_at"),
sqlalchemy.Index("idx_transcript_user_id_recording_id", "user_id", "recording_id"),
sqlalchemy.Index("idx_transcript_room_id", "room_id"),
sqlalchemy.Index("idx_transcript_source_kind", "source_kind"),
sqlalchemy.Index("idx_transcript_room_id_created_at", "room_id", "created_at"),
)
# Add PostgreSQL-specific full-text search column
@@ -99,7 +101,8 @@ if is_postgresql():
TSVECTOR,
sqlalchemy.Computed(
"setweight(to_tsvector('english', coalesce(title, '')), 'A') || "
"setweight(to_tsvector('english', coalesce(webvtt, '')), 'B')",
"setweight(to_tsvector('english', coalesce(long_summary, '')), 'B') || "
"setweight(to_tsvector('english', coalesce(webvtt, '')), 'C')",
persisted=True,
),
)
@@ -119,6 +122,15 @@ def generate_transcript_name() -> str:
return f"Transcript {now.strftime('%Y-%m-%d %H:%M:%S')}"
TranscriptStatus = Literal[
"idle", "uploaded", "recording", "processing", "error", "ended"
]
class StrValue(BaseModel):
value: str
class AudioWaveform(BaseModel):
data: list[float]
@@ -182,7 +194,7 @@ class Transcript(BaseModel):
id: str = Field(default_factory=generate_uuid4)
user_id: str | None = None
name: str = Field(default_factory=generate_transcript_name)
status: str = "idle"
status: TranscriptStatus = "idle"
duration: float = 0
created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
title: str | None = None
@@ -729,5 +741,27 @@ class TranscriptController:
transcript.delete_participant(participant_id)
await self.update(transcript, {"participants": transcript.participants_dump()})
async def set_status(
self, transcript_id: str, status: TranscriptStatus
) -> TranscriptEvent | None:
"""
Update the status of a transcript
Will add an event STATUS + update the status field of transcript
"""
async with self.transaction():
transcript = await self.get_by_id(transcript_id)
if not transcript:
raise Exception(f"Transcript {transcript_id} not found")
if transcript.status == status:
return
resp = await self.append_event(
transcript=transcript,
event="STATUS",
data=StrValue(value=status),
)
await self.update(transcript, {"status": status})
return resp
transcripts_controller = TranscriptController()

View File

@@ -0,0 +1,421 @@
"""
File-based processing pipeline
==============================
Optimized pipeline for processing complete audio/video files.
Uses parallel processing for transcription, diarization, and waveform generation.
"""
import asyncio
import uuid
from pathlib import Path
import av
import structlog
from celery import shared_task
from reflector.asynctask import asynctask
from reflector.db.rooms import rooms_controller
from reflector.db.transcripts import (
SourceKind,
Transcript,
TranscriptStatus,
transcripts_controller,
)
from reflector.logger import logger
from reflector.pipelines.main_live_pipeline import (
PipelineMainBase,
broadcast_to_sockets,
)
from reflector.processors import (
AudioFileWriterProcessor,
TranscriptFinalSummaryProcessor,
TranscriptFinalTitleProcessor,
TranscriptTopicDetectorProcessor,
)
from reflector.processors.audio_waveform_processor import AudioWaveformProcessor
from reflector.processors.file_diarization import FileDiarizationInput
from reflector.processors.file_diarization_auto import FileDiarizationAutoProcessor
from reflector.processors.file_transcript import FileTranscriptInput
from reflector.processors.file_transcript_auto import FileTranscriptAutoProcessor
from reflector.processors.transcript_diarization_assembler import (
TranscriptDiarizationAssemblerInput,
TranscriptDiarizationAssemblerProcessor,
)
from reflector.processors.types import (
DiarizationSegment,
TitleSummary,
)
from reflector.processors.types import (
Transcript as TranscriptType,
)
from reflector.settings import settings
from reflector.storage import get_transcripts_storage
from reflector.worker.webhook import send_transcript_webhook
class EmptyPipeline:
"""Empty pipeline for processors that need a pipeline reference"""
def __init__(self, logger: structlog.BoundLogger):
self.logger = logger
def get_pref(self, k, d=None):
return d
async def emit(self, event):
pass
class PipelineMainFile(PipelineMainBase):
"""
Optimized file processing pipeline.
Processes complete audio/video files with parallel execution.
"""
logger: structlog.BoundLogger = None
empty_pipeline = None
def __init__(self, transcript_id: str):
super().__init__(transcript_id=transcript_id)
self.logger = logger.bind(transcript_id=self.transcript_id)
self.empty_pipeline = EmptyPipeline(logger=self.logger)
def _handle_gather_exceptions(self, results: list, operation: str) -> None:
"""Handle exceptions from asyncio.gather with return_exceptions=True"""
for i, result in enumerate(results):
if not isinstance(result, Exception):
continue
self.logger.error(
f"Error in {operation} (task {i}): {result}",
transcript_id=self.transcript_id,
exc_info=result,
)
@broadcast_to_sockets
async def set_status(self, transcript_id: str, status: TranscriptStatus):
async with self.lock_transaction():
return await transcripts_controller.set_status(transcript_id, status)
async def process(self, file_path: Path):
"""Main entry point for file processing"""
self.logger.info(f"Starting file pipeline for {file_path}")
transcript = await self.get_transcript()
# Clear transcript as we're going to regenerate everything
async with self.transaction():
await transcripts_controller.update(
transcript,
{
"events": [],
"topics": [],
},
)
# Extract audio and write to transcript location
audio_path = await self.extract_and_write_audio(file_path, transcript)
# Upload for processing
audio_url = await self.upload_audio(audio_path, transcript)
# Run parallel processing
await self.run_parallel_processing(
audio_path,
audio_url,
transcript.source_language,
transcript.target_language,
)
self.logger.info("File pipeline complete")
await transcripts_controller.set_status(transcript.id, "ended")
async def extract_and_write_audio(
self, file_path: Path, transcript: Transcript
) -> Path:
"""Extract audio from video if needed and write to transcript location as MP3"""
self.logger.info(f"Processing audio file: {file_path}")
# Check if it's already audio-only
container = av.open(str(file_path))
has_video = len(container.streams.video) > 0
container.close()
# Use AudioFileWriterProcessor to write MP3 to transcript location
mp3_writer = AudioFileWriterProcessor(
path=transcript.audio_mp3_filename,
on_duration=self.on_duration,
)
# Process audio frames and write to transcript location
input_container = av.open(str(file_path))
for frame in input_container.decode(audio=0):
await mp3_writer.push(frame)
await mp3_writer.flush()
input_container.close()
if has_video:
self.logger.info(
f"Extracted audio from video and saved to {transcript.audio_mp3_filename}"
)
else:
self.logger.info(
f"Converted audio file and saved to {transcript.audio_mp3_filename}"
)
return transcript.audio_mp3_filename
async def upload_audio(self, audio_path: Path, transcript: Transcript) -> str:
"""Upload audio to storage for processing"""
storage = get_transcripts_storage()
if not storage:
raise Exception(
"Storage backend required for file processing. Configure TRANSCRIPT_STORAGE_* settings."
)
self.logger.info("Uploading audio to storage")
with open(audio_path, "rb") as f:
audio_data = f.read()
storage_path = f"file_pipeline/{transcript.id}/audio.mp3"
await storage.put_file(storage_path, audio_data)
audio_url = await storage.get_file_url(storage_path)
self.logger.info(f"Audio uploaded to {audio_url}")
return audio_url
async def run_parallel_processing(
self,
audio_path: Path,
audio_url: str,
source_language: str,
target_language: str,
):
"""Coordinate parallel processing of transcription, diarization, and waveform"""
self.logger.info(
"Starting parallel processing", transcript_id=self.transcript_id
)
# Phase 1: Parallel processing of independent tasks
transcription_task = self.transcribe_file(audio_url, source_language)
diarization_task = self.diarize_file(audio_url)
waveform_task = self.generate_waveform(audio_path)
results = await asyncio.gather(
transcription_task, diarization_task, waveform_task, return_exceptions=True
)
transcript_result = results[0]
diarization_result = results[1]
# Handle errors - raise any exception that occurred
self._handle_gather_exceptions(results, "parallel processing")
for result in results:
if isinstance(result, Exception):
raise result
# Phase 2: Assemble transcript with diarization
self.logger.info(
"Assembling transcript with diarization", transcript_id=self.transcript_id
)
processor = TranscriptDiarizationAssemblerProcessor()
input_data = TranscriptDiarizationAssemblerInput(
transcript=transcript_result, diarization=diarization_result or []
)
# Store result for retrieval
diarized_transcript: Transcript | None = None
async def capture_result(transcript):
nonlocal diarized_transcript
diarized_transcript = transcript
processor.on(capture_result)
await processor.push(input_data)
await processor.flush()
if not diarized_transcript:
raise ValueError("No diarized transcript captured")
# Phase 3: Generate topics from diarized transcript
self.logger.info("Generating topics", transcript_id=self.transcript_id)
topics = await self.detect_topics(diarized_transcript, target_language)
# Phase 4: Generate title and summaries in parallel
self.logger.info(
"Generating title and summaries", transcript_id=self.transcript_id
)
results = await asyncio.gather(
self.generate_title(topics),
self.generate_summaries(topics),
return_exceptions=True,
)
self._handle_gather_exceptions(results, "title and summary generation")
async def transcribe_file(self, audio_url: str, language: str) -> TranscriptType:
"""Transcribe complete file"""
processor = FileTranscriptAutoProcessor()
input_data = FileTranscriptInput(audio_url=audio_url, language=language)
# Store result for retrieval
result: TranscriptType | None = None
async def capture_result(transcript):
nonlocal result
result = transcript
processor.on(capture_result)
await processor.push(input_data)
await processor.flush()
if not result:
raise ValueError("No transcript captured")
return result
async def diarize_file(self, audio_url: str) -> list[DiarizationSegment] | None:
"""Get diarization for file"""
if not settings.DIARIZATION_BACKEND:
self.logger.info("Diarization disabled")
return None
processor = FileDiarizationAutoProcessor()
input_data = FileDiarizationInput(audio_url=audio_url)
# Store result for retrieval
result = None
async def capture_result(diarization_output):
nonlocal result
result = diarization_output.diarization
try:
processor.on(capture_result)
await processor.push(input_data)
await processor.flush()
return result
except Exception as e:
self.logger.error(f"Diarization failed: {e}")
return None
async def generate_waveform(self, audio_path: Path):
"""Generate and save waveform"""
transcript = await self.get_transcript()
processor = AudioWaveformProcessor(
audio_path=audio_path,
waveform_path=transcript.audio_waveform_filename,
on_waveform=self.on_waveform,
)
processor.set_pipeline(self.empty_pipeline)
await processor.flush()
async def detect_topics(
self, transcript: TranscriptType, target_language: str
) -> list[TitleSummary]:
"""Detect topics from complete transcript"""
chunk_size = 300
topics: list[TitleSummary] = []
async def on_topic(topic: TitleSummary):
topics.append(topic)
return await self.on_topic(topic)
topic_detector = TranscriptTopicDetectorProcessor(callback=on_topic)
topic_detector.set_pipeline(self.empty_pipeline)
for i in range(0, len(transcript.words), chunk_size):
chunk_words = transcript.words[i : i + chunk_size]
if not chunk_words:
continue
chunk_transcript = TranscriptType(
words=chunk_words, translation=transcript.translation
)
await topic_detector.push(chunk_transcript)
await topic_detector.flush()
return topics
async def generate_title(self, topics: list[TitleSummary]):
"""Generate title from topics"""
if not topics:
self.logger.warning("No topics for title generation")
return
processor = TranscriptFinalTitleProcessor(callback=self.on_title)
processor.set_pipeline(self.empty_pipeline)
for topic in topics:
await processor.push(topic)
await processor.flush()
async def generate_summaries(self, topics: list[TitleSummary]):
"""Generate long and short summaries from topics"""
if not topics:
self.logger.warning("No topics for summary generation")
return
transcript = await self.get_transcript()
processor = TranscriptFinalSummaryProcessor(
transcript=transcript,
callback=self.on_long_summary,
on_short_summary=self.on_short_summary,
)
processor.set_pipeline(self.empty_pipeline)
for topic in topics:
await processor.push(topic)
await processor.flush()
@shared_task
@asynctask
async def task_pipeline_file_process(*, transcript_id: str):
"""Celery task for file pipeline processing"""
transcript = await transcripts_controller.get_by_id(transcript_id)
if not transcript:
raise Exception(f"Transcript {transcript_id} not found")
pipeline = PipelineMainFile(transcript_id=transcript_id)
try:
await pipeline.set_status(transcript_id, "processing")
# Find the file to process
audio_file = next(transcript.data_path.glob("upload.*"), None)
if not audio_file:
audio_file = next(transcript.data_path.glob("audio.*"), None)
if not audio_file:
raise Exception("No audio file found to process")
await pipeline.process(audio_file)
except Exception:
await pipeline.set_status(transcript_id, "error")
raise
# Trigger webhook if this is a room recording with webhook configured
if transcript.source_kind == SourceKind.ROOM and transcript.room_id:
room = await rooms_controller.get_by_id(transcript.room_id)
if room and room.webhook_url:
logger.info(
"Dispatching webhook task",
transcript_id=transcript_id,
room_id=room.id,
webhook_url=room.webhook_url,
)
send_transcript_webhook.delay(
transcript_id, room.id, event_id=uuid.uuid4().hex
)

View File

@@ -22,7 +22,7 @@ from celery import chord, current_task, group, shared_task
from pydantic import BaseModel
from structlog import BoundLogger as Logger
from reflector.db import get_database
from reflector.asynctask import asynctask
from reflector.db.meetings import meeting_consent_controller, meetings_controller
from reflector.db.recordings import recordings_controller
from reflector.db.rooms import rooms_controller
@@ -32,6 +32,7 @@ from reflector.db.transcripts import (
TranscriptFinalLongSummary,
TranscriptFinalShortSummary,
TranscriptFinalTitle,
TranscriptStatus,
TranscriptText,
TranscriptTopic,
TranscriptWaveform,
@@ -40,8 +41,9 @@ from reflector.db.transcripts import (
from reflector.logger import logger
from reflector.pipelines.runner import PipelineMessage, PipelineRunner
from reflector.processors import (
AudioChunkerProcessor,
AudioChunkerAutoProcessor,
AudioDiarizationAutoProcessor,
AudioDownscaleProcessor,
AudioFileWriterProcessor,
AudioMergeProcessor,
AudioTranscriptAutoProcessor,
@@ -68,29 +70,6 @@ from reflector.zulip import (
)
def asynctask(f):
@functools.wraps(f)
def wrapper(*args, **kwargs):
async def run_with_db():
database = get_database()
await database.connect()
try:
return await f(*args, **kwargs)
finally:
await database.disconnect()
coro = run_with_db()
try:
loop = asyncio.get_running_loop()
except RuntimeError:
loop = None
if loop and loop.is_running():
return loop.run_until_complete(coro)
return asyncio.run(coro)
return wrapper
def broadcast_to_sockets(func):
"""
Decorator to broadcast transcript event to websockets
@@ -147,15 +126,18 @@ class StrValue(BaseModel):
class PipelineMainBase(PipelineRunner[PipelineMessage], Generic[PipelineMessage]):
transcript_id: str
ws_room_id: str | None = None
ws_manager: WebsocketManager | None = None
def prepare(self):
# prepare websocket
def __init__(self, transcript_id: str):
super().__init__()
self._lock = asyncio.Lock()
self.transcript_id = transcript_id
self.ws_room_id = f"ts:{self.transcript_id}"
self.ws_manager = get_ws_manager()
self._ws_manager = None
@property
def ws_manager(self) -> WebsocketManager:
if self._ws_manager is None:
self._ws_manager = get_ws_manager()
return self._ws_manager
async def get_transcript(self) -> Transcript:
# fetch the transcript
@@ -184,8 +166,15 @@ class PipelineMainBase(PipelineRunner[PipelineMessage], Generic[PipelineMessage]
]
@asynccontextmanager
async def transaction(self):
async def lock_transaction(self):
# This lock is to prevent multiple processor starting adding
# into event array at the same time
async with self._lock:
yield
@asynccontextmanager
async def transaction(self):
async with self.lock_transaction():
async with transcripts_controller.transaction():
yield
@@ -194,14 +183,14 @@ class PipelineMainBase(PipelineRunner[PipelineMessage], Generic[PipelineMessage]
# if it's the first part, update the status of the transcript
# but do not set the ended status yet.
if isinstance(self, PipelineMainLive):
status_mapping = {
status_mapping: dict[str, TranscriptStatus] = {
"started": "recording",
"push": "recording",
"flush": "processing",
"error": "error",
}
elif isinstance(self, PipelineMainFinalSummaries):
status_mapping = {
status_mapping: dict[str, TranscriptStatus] = {
"push": "processing",
"flush": "processing",
"error": "error",
@@ -217,22 +206,8 @@ class PipelineMainBase(PipelineRunner[PipelineMessage], Generic[PipelineMessage]
return
# when the status of the pipeline changes, update the transcript
async with self.transaction():
transcript = await self.get_transcript()
if status == transcript.status:
return
resp = await transcripts_controller.append_event(
transcript=transcript,
event="STATUS",
data=StrValue(value=status),
)
await transcripts_controller.update(
transcript,
{
"status": status,
},
)
return resp
async with self._lock:
return await transcripts_controller.set_status(self.transcript_id, status)
@broadcast_to_sockets
async def on_transcript(self, data):
@@ -355,7 +330,6 @@ class PipelineMainLive(PipelineMainBase):
async def create(self) -> Pipeline:
# create a context for the whole rtc transaction
# add a customised logger to the context
self.prepare()
transcript = await self.get_transcript()
processors = [
@@ -363,7 +337,8 @@ class PipelineMainLive(PipelineMainBase):
path=transcript.audio_wav_filename,
on_duration=self.on_duration,
),
AudioChunkerProcessor(),
AudioDownscaleProcessor(),
AudioChunkerAutoProcessor(),
AudioMergeProcessor(),
AudioTranscriptAutoProcessor.as_threaded(),
TranscriptLinerProcessor(),
@@ -376,6 +351,7 @@ class PipelineMainLive(PipelineMainBase):
pipeline.set_pref("audio:target_language", transcript.target_language)
pipeline.logger.bind(transcript_id=transcript.id)
pipeline.logger.info("Pipeline main live created")
pipeline.describe()
return pipeline
@@ -394,7 +370,6 @@ class PipelineMainDiarization(PipelineMainBase[AudioDiarizationInput]):
async def create(self) -> Pipeline:
# create a context for the whole rtc transaction
# add a customised logger to the context
self.prepare()
pipeline = Pipeline(
AudioDiarizationAutoProcessor(callback=self.on_topic),
)
@@ -435,8 +410,6 @@ class PipelineMainFromTopics(PipelineMainBase[TitleSummaryWithIdProcessorType]):
raise NotImplementedError
async def create(self) -> Pipeline:
self.prepare()
# get transcript
self._transcript = transcript = await self.get_transcript()
@@ -792,7 +765,7 @@ def pipeline_post(*, transcript_id: str):
chain_final_summaries,
) | task_pipeline_post_to_zulip.si(transcript_id=transcript_id)
chain.delay()
return chain.delay()
@get_transcript

View File

@@ -18,22 +18,14 @@ During its lifecycle, it will emit the following status:
import asyncio
from typing import Generic, TypeVar
from pydantic import BaseModel, ConfigDict
from reflector.logger import logger
from reflector.processors import Pipeline
PipelineMessage = TypeVar("PipelineMessage")
class PipelineRunner(BaseModel, Generic[PipelineMessage]):
model_config = ConfigDict(arbitrary_types_allowed=True)
status: str = "idle"
pipeline: Pipeline | None = None
def __init__(self, **kwargs):
super().__init__(**kwargs)
class PipelineRunner(Generic[PipelineMessage]):
def __init__(self):
self._task = None
self._q_cmd = asyncio.Queue(maxsize=4096)
self._ev_done = asyncio.Event()
@@ -42,6 +34,8 @@ class PipelineRunner(BaseModel, Generic[PipelineMessage]):
runner=id(self),
runner_cls=self.__class__.__name__,
)
self.status = "idle"
self.pipeline: Pipeline | None = None
async def create(self) -> Pipeline:
"""

View File

@@ -1,5 +1,7 @@
from .audio_chunker import AudioChunkerProcessor # noqa: F401
from .audio_chunker_auto import AudioChunkerAutoProcessor # noqa: F401
from .audio_diarization_auto import AudioDiarizationAutoProcessor # noqa: F401
from .audio_downscale import AudioDownscaleProcessor # noqa: F401
from .audio_file_writer import AudioFileWriterProcessor # noqa: F401
from .audio_merge import AudioMergeProcessor # noqa: F401
from .audio_transcript import AudioTranscriptProcessor # noqa: F401
@@ -11,6 +13,13 @@ from .base import ( # noqa: F401
Processor,
ThreadedProcessor,
)
from .file_diarization import FileDiarizationProcessor # noqa: F401
from .file_diarization_auto import FileDiarizationAutoProcessor # noqa: F401
from .file_transcript import FileTranscriptProcessor # noqa: F401
from .file_transcript_auto import FileTranscriptAutoProcessor # noqa: F401
from .transcript_diarization_assembler import (
TranscriptDiarizationAssemblerProcessor, # noqa: F401
)
from .transcript_final_summary import TranscriptFinalSummaryProcessor # noqa: F401
from .transcript_final_title import TranscriptFinalTitleProcessor # noqa: F401
from .transcript_liner import TranscriptLinerProcessor # noqa: F401

View File

@@ -1,28 +1,78 @@
from typing import Optional
import av
from prometheus_client import Counter, Histogram
from reflector.processors.base import Processor
class AudioChunkerProcessor(Processor):
"""
Assemble audio frames into chunks
Base class for assembling audio frames into chunks
"""
INPUT_TYPE = av.AudioFrame
OUTPUT_TYPE = list[av.AudioFrame]
def __init__(self, max_frames=256):
super().__init__()
m_chunk = Histogram(
"audio_chunker",
"Time spent in AudioChunker.chunk",
["backend"],
)
m_chunk_call = Counter(
"audio_chunker_call",
"Number of calls to AudioChunker.chunk",
["backend"],
)
m_chunk_success = Counter(
"audio_chunker_success",
"Number of successful calls to AudioChunker.chunk",
["backend"],
)
m_chunk_failure = Counter(
"audio_chunker_failure",
"Number of failed calls to AudioChunker.chunk",
["backend"],
)
def __init__(self, *args, **kwargs):
name = self.__class__.__name__
self.m_chunk = self.m_chunk.labels(name)
self.m_chunk_call = self.m_chunk_call.labels(name)
self.m_chunk_success = self.m_chunk_success.labels(name)
self.m_chunk_failure = self.m_chunk_failure.labels(name)
super().__init__(*args, **kwargs)
self.frames: list[av.AudioFrame] = []
self.max_frames = max_frames
async def _push(self, data: av.AudioFrame):
self.frames.append(data)
if len(self.frames) >= self.max_frames:
await self.flush()
"""Process incoming audio frame"""
# Validate audio format on first frame
if len(self.frames) == 0:
if data.sample_rate != 16000 or len(data.layout.channels) != 1:
raise ValueError(
f"AudioChunkerProcessor expects 16kHz mono audio, got {data.sample_rate}Hz "
f"with {len(data.layout.channels)} channel(s). "
f"Use AudioDownscaleProcessor before this processor."
)
try:
self.m_chunk_call.inc()
with self.m_chunk.time():
result = await self._chunk(data)
self.m_chunk_success.inc()
if result:
await self.emit(result)
except Exception:
self.m_chunk_failure.inc()
raise
async def _chunk(self, data: av.AudioFrame) -> Optional[list[av.AudioFrame]]:
"""
Process audio frame and return chunk when ready.
Subclasses should implement their chunking logic here.
"""
raise NotImplementedError
async def _flush(self):
frames = self.frames[:]
self.frames = []
if frames:
await self.emit(frames)
"""Flush any remaining frames when processing ends"""
raise NotImplementedError

View File

@@ -0,0 +1,32 @@
import importlib
from reflector.processors.audio_chunker import AudioChunkerProcessor
from reflector.settings import settings
class AudioChunkerAutoProcessor(AudioChunkerProcessor):
_registry = {}
@classmethod
def register(cls, name, kclass):
cls._registry[name] = kclass
def __new__(cls, name: str | None = None, **kwargs):
if name is None:
name = settings.AUDIO_CHUNKER_BACKEND
if name not in cls._registry:
module_name = f"reflector.processors.audio_chunker_{name}"
importlib.import_module(module_name)
# gather specific configuration for the processor
# search `AUDIO_CHUNKER_BACKEND_XXX_YYY`, push to constructor as `backend_xxx_yyy`
config = {}
name_upper = name.upper()
settings_prefix = "AUDIO_CHUNKER_"
config_prefix = f"{settings_prefix}{name_upper}_"
for key, value in settings:
if key.startswith(config_prefix):
config_name = key[len(settings_prefix) :].lower()
config[config_name] = value
return cls._registry[name](**config | kwargs)

View File

@@ -0,0 +1,34 @@
from typing import Optional
import av
from reflector.processors.audio_chunker import AudioChunkerProcessor
from reflector.processors.audio_chunker_auto import AudioChunkerAutoProcessor
class AudioChunkerFramesProcessor(AudioChunkerProcessor):
"""
Simple frame-based audio chunker that emits chunks after a fixed number of frames
"""
def __init__(self, max_frames=256, **kwargs):
super().__init__(**kwargs)
self.max_frames = max_frames
async def _chunk(self, data: av.AudioFrame) -> Optional[list[av.AudioFrame]]:
self.frames.append(data)
if len(self.frames) >= self.max_frames:
frames_to_emit = self.frames[:]
self.frames = []
return frames_to_emit
return None
async def _flush(self):
frames = self.frames[:]
self.frames = []
if frames:
await self.emit(frames)
AudioChunkerAutoProcessor.register("frames", AudioChunkerFramesProcessor)

View File

@@ -0,0 +1,298 @@
from typing import Optional
import av
import numpy as np
import torch
from silero_vad import VADIterator, load_silero_vad
from reflector.processors.audio_chunker import AudioChunkerProcessor
from reflector.processors.audio_chunker_auto import AudioChunkerAutoProcessor
class AudioChunkerSileroProcessor(AudioChunkerProcessor):
"""
Assemble audio frames into chunks with VAD-based speech detection using Silero VAD
"""
def __init__(
self,
block_frames=256,
max_frames=1024,
use_onnx=True,
min_frames=2,
**kwargs,
):
super().__init__(**kwargs)
self.block_frames = block_frames
self.max_frames = max_frames
self.min_frames = min_frames
# Initialize Silero VAD
self._init_vad(use_onnx)
def _init_vad(self, use_onnx=False):
"""Initialize Silero VAD model"""
try:
torch.set_num_threads(1)
self.vad_model = load_silero_vad(onnx=use_onnx)
self.vad_iterator = VADIterator(self.vad_model, sampling_rate=16000)
self.logger.info("Silero VAD initialized successfully")
except Exception as e:
self.logger.error(f"Failed to initialize Silero VAD: {e}")
self.vad_model = None
self.vad_iterator = None
async def _chunk(self, data: av.AudioFrame) -> Optional[list[av.AudioFrame]]:
"""Process audio frame and return chunk when ready"""
self.frames.append(data)
# Check for speech segments every 32 frames (~1 second)
if len(self.frames) >= 32 and len(self.frames) % 32 == 0:
return await self._process_block()
# Safety fallback - emit if we hit max frames
elif len(self.frames) >= self.max_frames:
self.logger.warning(
f"AudioChunkerSileroProcessor: Reached max frames ({self.max_frames}), "
f"emitting first {self.max_frames // 2} frames"
)
frames_to_emit = self.frames[: self.max_frames // 2]
self.frames = self.frames[self.max_frames // 2 :]
if len(frames_to_emit) >= self.min_frames:
return frames_to_emit
else:
self.logger.debug(
f"Ignoring fallback segment with {len(frames_to_emit)} frames "
f"(< {self.min_frames} minimum)"
)
return None
async def _process_block(self) -> Optional[list[av.AudioFrame]]:
# Need at least 32 frames for VAD detection (~1 second)
if len(self.frames) < 32 or self.vad_iterator is None:
return None
# Processing block with current buffer size
print(f"Processing block: {len(self.frames)} frames in buffer")
try:
# Convert frames to numpy array for VAD
audio_array = self._frames_to_numpy(self.frames)
if audio_array is None:
# Fallback: emit all frames if conversion failed
frames_to_emit = self.frames[:]
self.frames = []
if len(frames_to_emit) >= self.min_frames:
return frames_to_emit
else:
self.logger.debug(
f"Ignoring conversion-failed segment with {len(frames_to_emit)} frames "
f"(< {self.min_frames} minimum)"
)
return None
# Find complete speech segments in the buffer
speech_end_frame = self._find_speech_segment_end(audio_array)
if speech_end_frame is None or speech_end_frame <= 0:
# No speech found but buffer is getting large
if len(self.frames) > 512:
# Check if it's all silence and can be discarded
# No speech segment found, buffer at {len(self.frames)} frames
# Could emit silence or discard old frames here
# For now, keep first 256 frames and discard older silence
if len(self.frames) > 768:
self.logger.debug(
f"Discarding {len(self.frames) - 256} old frames (likely silence)"
)
self.frames = self.frames[-256:]
return None
# Calculate segment timing information
frames_to_emit = self.frames[:speech_end_frame]
# Get timing from av.AudioFrame
if frames_to_emit:
first_frame = frames_to_emit[0]
last_frame = frames_to_emit[-1]
sample_rate = first_frame.sample_rate
# Calculate duration
total_samples = sum(f.samples for f in frames_to_emit)
duration_seconds = total_samples / sample_rate if sample_rate > 0 else 0
# Get timestamps if available
start_time = (
first_frame.pts * first_frame.time_base if first_frame.pts else 0
)
end_time = (
last_frame.pts * last_frame.time_base if last_frame.pts else 0
)
# Convert to HH:MM:SS format for logging
def format_time(seconds):
if not seconds:
return "00:00:00"
total_seconds = int(float(seconds))
hours = total_seconds // 3600
minutes = (total_seconds % 3600) // 60
secs = total_seconds % 60
return f"{hours:02d}:{minutes:02d}:{secs:02d}"
start_formatted = format_time(start_time)
end_formatted = format_time(end_time)
# Keep remaining frames for next processing
remaining_after = len(self.frames) - speech_end_frame
# Single structured log line
self.logger.info(
"Speech segment found",
start=start_formatted,
end=end_formatted,
frames=speech_end_frame,
duration=round(duration_seconds, 2),
buffer_before=len(self.frames),
remaining=remaining_after,
)
# Keep remaining frames for next processing
self.frames = self.frames[speech_end_frame:]
# Filter out segments with too few frames
if len(frames_to_emit) >= self.min_frames:
return frames_to_emit
else:
self.logger.debug(
f"Ignoring segment with {len(frames_to_emit)} frames "
f"(< {self.min_frames} minimum)"
)
except Exception as e:
self.logger.error(f"Error in VAD processing: {e}")
# Fallback to simple chunking
if len(self.frames) >= self.block_frames:
frames_to_emit = self.frames[: self.block_frames]
self.frames = self.frames[self.block_frames :]
if len(frames_to_emit) >= self.min_frames:
return frames_to_emit
else:
self.logger.debug(
f"Ignoring exception-fallback segment with {len(frames_to_emit)} frames "
f"(< {self.min_frames} minimum)"
)
return None
def _frames_to_numpy(self, frames: list[av.AudioFrame]) -> Optional[np.ndarray]:
"""Convert av.AudioFrame list to numpy array for VAD processing"""
if not frames:
return None
try:
audio_data = []
for frame in frames:
frame_array = frame.to_ndarray()
if len(frame_array.shape) == 2:
frame_array = frame_array.flatten()
audio_data.append(frame_array)
if not audio_data:
return None
combined_audio = np.concatenate(audio_data)
# Ensure float32 format
if combined_audio.dtype == np.int16:
# Normalize int16 audio to float32 in range [-1.0, 1.0]
combined_audio = combined_audio.astype(np.float32) / 32768.0
elif combined_audio.dtype != np.float32:
combined_audio = combined_audio.astype(np.float32)
return combined_audio
except Exception as e:
self.logger.error(f"Error converting frames to numpy: {e}")
return None
def _find_speech_segment_end(self, audio_array: np.ndarray) -> Optional[int]:
"""Find complete speech segments and return frame index at segment end"""
if self.vad_iterator is None or len(audio_array) == 0:
return None
try:
# Process audio in 512-sample windows for VAD
window_size = 512
min_silence_windows = 3 # Require 3 windows of silence after speech
# Track speech state
in_speech = False
speech_start = None
speech_end = None
silence_count = 0
for i in range(0, len(audio_array), window_size):
chunk = audio_array[i : i + window_size]
if len(chunk) < window_size:
chunk = np.pad(chunk, (0, window_size - len(chunk)))
# Detect if this window has speech
speech_dict = self.vad_iterator(chunk, return_seconds=True)
# VADIterator returns dict with 'start' and 'end' when speech segments are detected
if speech_dict:
if not in_speech:
# Speech started
speech_start = i
in_speech = True
# Debug: print(f"Speech START at sample {i}, VAD: {speech_dict}")
silence_count = 0 # Reset silence counter
continue
if not in_speech:
continue
# We're in speech but found silence
silence_count += 1
if silence_count < min_silence_windows:
continue
# Found end of speech segment
speech_end = i - (min_silence_windows - 1) * window_size
# Debug: print(f"Speech END at sample {speech_end}")
# Convert sample position to frame index
samples_per_frame = self.frames[0].samples if self.frames else 1024
frame_index = speech_end // samples_per_frame
# Ensure we don't exceed buffer
frame_index = min(frame_index, len(self.frames))
return frame_index
return None
except Exception as e:
self.logger.error(f"Error finding speech segment: {e}")
return None
async def _flush(self):
frames = self.frames[:]
self.frames = []
if frames:
if len(frames) >= self.min_frames:
await self.emit(frames)
else:
self.logger.debug(
f"Ignoring flush segment with {len(frames)} frames "
f"(< {self.min_frames} minimum)"
)
AudioChunkerAutoProcessor.register("silero", AudioChunkerSileroProcessor)

View File

@@ -1,6 +1,7 @@
from reflector.processors.base import Processor
from reflector.processors.types import (
AudioDiarizationInput,
DiarizationSegment,
TitleSummary,
Word,
)
@@ -37,18 +38,21 @@ class AudioDiarizationProcessor(Processor):
async def _diarize(self, data: AudioDiarizationInput):
raise NotImplementedError
def assign_speaker(self, words: list[Word], diarization: list[dict]):
self._diarization_remove_overlap(diarization)
self._diarization_remove_segment_without_words(words, diarization)
self._diarization_merge_same_speaker(words, diarization)
self._diarization_assign_speaker(words, diarization)
@classmethod
def assign_speaker(cls, words: list[Word], diarization: list[DiarizationSegment]):
cls._diarization_remove_overlap(diarization)
cls._diarization_remove_segment_without_words(words, diarization)
cls._diarization_merge_same_speaker(diarization)
cls._diarization_assign_speaker(words, diarization)
def iter_words_from_topics(self, topics: TitleSummary):
@staticmethod
def iter_words_from_topics(topics: list[TitleSummary]):
for topic in topics:
for word in topic.transcript.words:
yield word
def is_word_continuation(self, word_prev, word):
@staticmethod
def is_word_continuation(word_prev, word):
"""
Return True if the word is a continuation of the previous word
by checking if the previous word is ending with a punctuation
@@ -61,7 +65,8 @@ class AudioDiarizationProcessor(Processor):
return False
return True
def _diarization_remove_overlap(self, diarization: list[dict]):
@staticmethod
def _diarization_remove_overlap(diarization: list[DiarizationSegment]):
"""
Remove overlap in diarization results
@@ -86,8 +91,9 @@ class AudioDiarizationProcessor(Processor):
else:
diarization_idx += 1
@staticmethod
def _diarization_remove_segment_without_words(
self, words: list[Word], diarization: list[dict]
words: list[Word], diarization: list[DiarizationSegment]
):
"""
Remove diarization segments without words
@@ -116,9 +122,8 @@ class AudioDiarizationProcessor(Processor):
else:
diarization_idx += 1
def _diarization_merge_same_speaker(
self, words: list[Word], diarization: list[dict]
):
@staticmethod
def _diarization_merge_same_speaker(diarization: list[DiarizationSegment]):
"""
Merge diarization contigous segments with the same speaker
@@ -135,7 +140,10 @@ class AudioDiarizationProcessor(Processor):
else:
diarization_idx += 1
def _diarization_assign_speaker(self, words: list[Word], diarization: list[dict]):
@classmethod
def _diarization_assign_speaker(
cls, words: list[Word], diarization: list[DiarizationSegment]
):
"""
Assign speaker to words based on diarization
@@ -143,7 +151,7 @@ class AudioDiarizationProcessor(Processor):
"""
word_idx = 0
last_speaker = None
last_speaker = 0
for d in diarization:
start = d["start"]
end = d["end"]
@@ -158,7 +166,7 @@ class AudioDiarizationProcessor(Processor):
# If it's a continuation, assign with the last speaker
is_continuation = False
if word_idx > 0 and word_idx < len(words) - 1:
is_continuation = self.is_word_continuation(
is_continuation = cls.is_word_continuation(
*words[word_idx - 1 : word_idx + 1]
)
if is_continuation:

View File

@@ -0,0 +1,74 @@
import os
import torch
import torchaudio
from pyannote.audio import Pipeline
from reflector.processors.audio_diarization import AudioDiarizationProcessor
from reflector.processors.audio_diarization_auto import AudioDiarizationAutoProcessor
from reflector.processors.types import AudioDiarizationInput, DiarizationSegment
class AudioDiarizationPyannoteProcessor(AudioDiarizationProcessor):
"""Local diarization processor using pyannote.audio library"""
def __init__(
self,
model_name: str = "pyannote/speaker-diarization-3.1",
pyannote_auth_token: str | None = None,
device: str | None = None,
**kwargs,
):
super().__init__(**kwargs)
self.model_name = model_name
self.auth_token = pyannote_auth_token or os.environ.get("HF_TOKEN")
self.device = device
if device is None:
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.logger.info(f"Loading pyannote diarization model: {self.model_name}")
self.diarization_pipeline = Pipeline.from_pretrained(
self.model_name, use_auth_token=self.auth_token
)
self.diarization_pipeline.to(torch.device(self.device))
self.logger.info(f"Diarization model loaded on device: {self.device}")
async def _diarize(self, data: AudioDiarizationInput) -> list[DiarizationSegment]:
try:
# Load audio file (audio_url is assumed to be a local file path)
self.logger.info(f"Loading local audio file: {data.audio_url}")
waveform, sample_rate = torchaudio.load(data.audio_url)
audio_input = {"waveform": waveform, "sample_rate": sample_rate}
self.logger.info("Running speaker diarization")
diarization = self.diarization_pipeline(audio_input)
# Convert pyannote diarization output to our format
segments = []
for segment, _, speaker in diarization.itertracks(yield_label=True):
# Extract speaker number from label (e.g., "SPEAKER_00" -> 0)
speaker_id = 0
if speaker.startswith("SPEAKER_"):
try:
speaker_id = int(speaker.split("_")[-1])
except (ValueError, IndexError):
# Fallback to hash-based ID if parsing fails
speaker_id = hash(speaker) % 1000
segments.append(
{
"start": round(segment.start, 3),
"end": round(segment.end, 3),
"speaker": speaker_id,
}
)
self.logger.info(f"Diarization completed with {len(segments)} segments")
return segments
except Exception as e:
self.logger.exception(f"Diarization failed: {e}")
raise
AudioDiarizationAutoProcessor.register("pyannote", AudioDiarizationPyannoteProcessor)

View File

@@ -0,0 +1,60 @@
from typing import Optional
import av
from av.audio.resampler import AudioResampler
from reflector.processors.base import Processor
def copy_frame(frame: av.AudioFrame) -> av.AudioFrame:
frame_copy = frame.from_ndarray(
frame.to_ndarray(),
format=frame.format.name,
layout=frame.layout.name,
)
frame_copy.sample_rate = frame.sample_rate
frame_copy.pts = frame.pts
frame_copy.time_base = frame.time_base
return frame_copy
class AudioDownscaleProcessor(Processor):
"""
Downscale audio frames to 16kHz mono format
"""
INPUT_TYPE = av.AudioFrame
OUTPUT_TYPE = av.AudioFrame
def __init__(self, target_rate: int = 16000, target_layout: str = "mono", **kwargs):
super().__init__(**kwargs)
self.target_rate = target_rate
self.target_layout = target_layout
self.resampler: Optional[AudioResampler] = None
self.needs_resampling: Optional[bool] = None
async def _push(self, data: av.AudioFrame):
if self.needs_resampling is None:
self.needs_resampling = (
data.sample_rate != self.target_rate
or data.layout.name != self.target_layout
)
if self.needs_resampling:
self.resampler = AudioResampler(
format="s16", layout=self.target_layout, rate=self.target_rate
)
if not self.needs_resampling or not self.resampler:
await self.emit(data)
return
resampled_frames = self.resampler.resample(copy_frame(data))
for resampled_frame in resampled_frames:
await self.emit(resampled_frame)
async def _flush(self):
if self.needs_resampling and self.resampler:
final_frames = self.resampler.resample(None)
for frame in final_frames:
await self.emit(frame)

View File

@@ -16,37 +16,46 @@ class AudioMergeProcessor(Processor):
INPUT_TYPE = list[av.AudioFrame]
OUTPUT_TYPE = AudioFile
def __init__(self, **kwargs):
super().__init__(**kwargs)
async def _push(self, data: list[av.AudioFrame]):
if not data:
return
# get audio information from first frame
frame = data[0]
channels = len(frame.layout.channels)
sample_rate = frame.sample_rate
sample_width = frame.format.bytes
output_channels = len(frame.layout.channels)
output_sample_rate = frame.sample_rate
output_sample_width = frame.format.bytes
# create audio file
uu = uuid4().hex
fd = io.BytesIO()
# Use PyAV to write frames
out_container = av.open(fd, "w", format="wav")
out_stream = out_container.add_stream("pcm_s16le", rate=sample_rate)
out_stream = out_container.add_stream("pcm_s16le", rate=output_sample_rate)
out_stream.layout = frame.layout.name
for frame in data:
for packet in out_stream.encode(frame):
out_container.mux(packet)
# Flush the encoder
for packet in out_stream.encode(None):
out_container.mux(packet)
out_container.close()
fd.seek(0)
# emit audio file
audiofile = AudioFile(
name=f"{monotonic_ns()}-{uu}.wav",
fd=fd,
sample_rate=sample_rate,
channels=channels,
sample_width=sample_width,
sample_rate=output_sample_rate,
channels=output_channels,
sample_width=output_sample_width,
timestamp=data[0].pts * data[0].time_base,
)

View File

@@ -21,7 +21,11 @@ from reflector.settings import settings
class AudioTranscriptModalProcessor(AudioTranscriptProcessor):
def __init__(self, modal_api_key: str | None = None, **kwargs):
def __init__(
self,
modal_api_key: str | None = None,
**kwargs,
):
super().__init__()
if not settings.TRANSCRIPT_URL:
raise Exception(

View File

@@ -173,6 +173,7 @@ class Processor(Emitter):
except Exception:
self.m_processor_failure.inc()
self.logger.exception("Error in push")
raise
async def flush(self):
"""
@@ -240,33 +241,45 @@ class ThreadedProcessor(Processor):
self.INPUT_TYPE = processor.INPUT_TYPE
self.OUTPUT_TYPE = processor.OUTPUT_TYPE
self.executor = ThreadPoolExecutor(max_workers=max_workers)
self.queue = asyncio.Queue()
self.task = asyncio.get_running_loop().create_task(self.loop())
self.queue = asyncio.Queue(maxsize=50)
self.task: asyncio.Task | None = None
def set_pipeline(self, pipeline: "Pipeline"):
super().set_pipeline(pipeline)
self.processor.set_pipeline(pipeline)
async def loop(self):
while True:
data = await self.queue.get()
self.m_processor_queue.set(self.queue.qsize())
with self.m_processor_queue_in_progress.track_inprogress():
try:
if data is None:
await self.processor.flush()
break
try:
while True:
data = await self.queue.get()
self.m_processor_queue.set(self.queue.qsize())
with self.m_processor_queue_in_progress.track_inprogress():
try:
await self.processor.push(data)
except Exception:
self.logger.error(
f"Error in push {self.processor.__class__.__name__}"
", continue"
)
finally:
self.queue.task_done()
if data is None:
await self.processor.flush()
break
try:
await self.processor.push(data)
except Exception:
self.logger.error(
f"Error in push {self.processor.__class__.__name__}"
", continue"
)
finally:
self.queue.task_done()
except Exception as e:
logger.error(f"Crash in {self.__class__.__name__}: {e}", exc_info=e)
async def _ensure_task(self):
if self.task is None:
self.task = asyncio.get_running_loop().create_task(self.loop())
# XXX not doing a sleep here make the whole pipeline prior the thread
# to be running without having a chance to work on the task here.
await asyncio.sleep(0)
async def _push(self, data):
await self._ensure_task()
await self.queue.put(data)
async def _flush(self):

View File

@@ -0,0 +1,33 @@
from pydantic import BaseModel
from reflector.processors.base import Processor
from reflector.processors.types import DiarizationSegment
class FileDiarizationInput(BaseModel):
"""Input for file diarization containing audio URL"""
audio_url: str
class FileDiarizationOutput(BaseModel):
"""Output for file diarization containing speaker segments"""
diarization: list[DiarizationSegment]
class FileDiarizationProcessor(Processor):
"""
Diarize complete audio files from URL
"""
INPUT_TYPE = FileDiarizationInput
OUTPUT_TYPE = FileDiarizationOutput
async def _push(self, data: FileDiarizationInput):
result = await self._diarize(data)
if result:
await self.emit(result)
async def _diarize(self, data: FileDiarizationInput):
raise NotImplementedError

View File

@@ -0,0 +1,33 @@
import importlib
from reflector.processors.file_diarization import FileDiarizationProcessor
from reflector.settings import settings
class FileDiarizationAutoProcessor(FileDiarizationProcessor):
_registry = {}
@classmethod
def register(cls, name, kclass):
cls._registry[name] = kclass
def __new__(cls, name: str | None = None, **kwargs):
if name is None:
name = settings.DIARIZATION_BACKEND
if name not in cls._registry:
module_name = f"reflector.processors.file_diarization_{name}"
importlib.import_module(module_name)
# gather specific configuration for the processor
# search `DIARIZATION_BACKEND_XXX_YYY`, push to constructor as `backend_xxx_yyy`
config = {}
name_upper = name.upper()
settings_prefix = "DIARIZATION_"
config_prefix = f"{settings_prefix}{name_upper}_"
for key, value in settings:
if key.startswith(config_prefix):
config_name = key[len(settings_prefix) :].lower()
config[config_name] = value
return cls._registry[name](**config | kwargs)

View File

@@ -0,0 +1,57 @@
"""
File diarization implementation using the GPU service from modal.com
API will be a POST request to DIARIZATION_URL:
```
POST /diarize?audio_file_url=...&timestamp=0
Authorization: Bearer <modal_api_key>
```
"""
import httpx
from reflector.processors.file_diarization import (
FileDiarizationInput,
FileDiarizationOutput,
FileDiarizationProcessor,
)
from reflector.processors.file_diarization_auto import FileDiarizationAutoProcessor
from reflector.settings import settings
class FileDiarizationModalProcessor(FileDiarizationProcessor):
def __init__(self, modal_api_key: str | None = None, **kwargs):
super().__init__(**kwargs)
if not settings.DIARIZATION_URL:
raise Exception(
"DIARIZATION_URL required to use FileDiarizationModalProcessor"
)
self.diarization_url = settings.DIARIZATION_URL + "/diarize"
self.file_timeout = settings.DIARIZATION_FILE_TIMEOUT
self.modal_api_key = modal_api_key
async def _diarize(self, data: FileDiarizationInput):
"""Get speaker diarization for file"""
self.logger.info(f"Starting diarization from {data.audio_url}")
headers = {}
if self.modal_api_key:
headers["Authorization"] = f"Bearer {self.modal_api_key}"
async with httpx.AsyncClient(timeout=self.file_timeout) as client:
response = await client.post(
self.diarization_url,
headers=headers,
params={
"audio_file_url": data.audio_url,
"timestamp": 0,
},
)
response.raise_for_status()
diarization_data = response.json()["diarization"]
return FileDiarizationOutput(diarization=diarization_data)
FileDiarizationAutoProcessor.register("modal", FileDiarizationModalProcessor)

View File

@@ -0,0 +1,65 @@
from prometheus_client import Counter, Histogram
from reflector.processors.base import Processor
from reflector.processors.types import Transcript
class FileTranscriptInput:
"""Input for file transcription containing audio URL and language settings"""
def __init__(self, audio_url: str, language: str = "en"):
self.audio_url = audio_url
self.language = language
class FileTranscriptProcessor(Processor):
"""
Transcript complete audio files from URL
"""
INPUT_TYPE = FileTranscriptInput
OUTPUT_TYPE = Transcript
m_transcript = Histogram(
"file_transcript",
"Time spent in FileTranscript.transcript",
["backend"],
)
m_transcript_call = Counter(
"file_transcript_call",
"Number of calls to FileTranscript.transcript",
["backend"],
)
m_transcript_success = Counter(
"file_transcript_success",
"Number of successful calls to FileTranscript.transcript",
["backend"],
)
m_transcript_failure = Counter(
"file_transcript_failure",
"Number of failed calls to FileTranscript.transcript",
["backend"],
)
def __init__(self, *args, **kwargs):
name = self.__class__.__name__
self.m_transcript = self.m_transcript.labels(name)
self.m_transcript_call = self.m_transcript_call.labels(name)
self.m_transcript_success = self.m_transcript_success.labels(name)
self.m_transcript_failure = self.m_transcript_failure.labels(name)
super().__init__(*args, **kwargs)
async def _push(self, data: FileTranscriptInput):
try:
self.m_transcript_call.inc()
with self.m_transcript.time():
result = await self._transcript(data)
self.m_transcript_success.inc()
if result:
await self.emit(result)
except Exception:
self.m_transcript_failure.inc()
raise
async def _transcript(self, data: FileTranscriptInput):
raise NotImplementedError

View File

@@ -0,0 +1,32 @@
import importlib
from reflector.processors.file_transcript import FileTranscriptProcessor
from reflector.settings import settings
class FileTranscriptAutoProcessor(FileTranscriptProcessor):
_registry = {}
@classmethod
def register(cls, name, kclass):
cls._registry[name] = kclass
def __new__(cls, name: str | None = None, **kwargs):
if name is None:
name = settings.TRANSCRIPT_BACKEND
if name not in cls._registry:
module_name = f"reflector.processors.file_transcript_{name}"
importlib.import_module(module_name)
# gather specific configuration for the processor
# search `TRANSCRIPT_BACKEND_XXX_YYY`, push to constructor as `backend_xxx_yyy`
config = {}
name_upper = name.upper()
settings_prefix = "TRANSCRIPT_"
config_prefix = f"{settings_prefix}{name_upper}_"
for key, value in settings:
if key.startswith(config_prefix):
config_name = key[len(settings_prefix) :].lower()
config[config_name] = value
return cls._registry[name](**config | kwargs)

View File

@@ -0,0 +1,77 @@
"""
File transcription implementation using the GPU service from modal.com
API will be a POST request to TRANSCRIPT_URL:
```json
{
"audio_file_url": "https://...",
"language": "en",
"model": "parakeet-tdt-0.6b-v2",
"batch": true
}
```
"""
import httpx
from reflector.processors.file_transcript import (
FileTranscriptInput,
FileTranscriptProcessor,
)
from reflector.processors.file_transcript_auto import FileTranscriptAutoProcessor
from reflector.processors.types import Transcript, Word
from reflector.settings import settings
class FileTranscriptModalProcessor(FileTranscriptProcessor):
def __init__(self, modal_api_key: str | None = None, **kwargs):
super().__init__(**kwargs)
if not settings.TRANSCRIPT_URL:
raise Exception(
"TRANSCRIPT_URL required to use FileTranscriptModalProcessor"
)
self.transcript_url = settings.TRANSCRIPT_URL
self.file_timeout = settings.TRANSCRIPT_FILE_TIMEOUT
self.modal_api_key = modal_api_key
async def _transcript(self, data: FileTranscriptInput):
"""Send full file to Modal for transcription"""
url = f"{self.transcript_url}/v1/audio/transcriptions-from-url"
self.logger.info(f"Starting file transcription from {data.audio_url}")
headers = {}
if self.modal_api_key:
headers["Authorization"] = f"Bearer {self.modal_api_key}"
async with httpx.AsyncClient(timeout=self.file_timeout) as client:
response = await client.post(
url,
headers=headers,
json={
"audio_file_url": data.audio_url,
"language": data.language,
"batch": True,
},
)
response.raise_for_status()
result = response.json()
words = [
Word(
text=word_info["word"],
start=word_info["start"],
end=word_info["end"],
)
for word_info in result.get("words", [])
]
# words come not in order
words.sort(key=lambda w: w.start)
return Transcript(words=words)
# Register with the auto processor
FileTranscriptAutoProcessor.register("modal", FileTranscriptModalProcessor)

View File

@@ -0,0 +1,45 @@
"""
Processor to assemble transcript with diarization results
"""
from reflector.processors.audio_diarization import AudioDiarizationProcessor
from reflector.processors.base import Processor
from reflector.processors.types import DiarizationSegment, Transcript
class TranscriptDiarizationAssemblerInput:
"""Input containing transcript and diarization data"""
def __init__(self, transcript: Transcript, diarization: list[DiarizationSegment]):
self.transcript = transcript
self.diarization = diarization
class TranscriptDiarizationAssemblerProcessor(Processor):
"""
Assemble transcript with diarization results by applying speaker assignments
"""
INPUT_TYPE = TranscriptDiarizationAssemblerInput
OUTPUT_TYPE = Transcript
async def _push(self, data: TranscriptDiarizationAssemblerInput):
result = await self._assemble(data)
if result:
await self.emit(result)
async def _assemble(self, data: TranscriptDiarizationAssemblerInput):
"""Apply diarization to transcript words"""
if not data.diarization:
self.logger.info(
"No diarization data provided, returning original transcript"
)
return data.transcript
# Reuse logic from AudioDiarizationProcessor
processor = AudioDiarizationProcessor()
words = data.transcript.words
processor.assign_speaker(words, data.diarization)
self.logger.info(f"Applied diarization to {len(words)} words")
return data.transcript

View File

@@ -2,13 +2,22 @@ import io
import re
import tempfile
from pathlib import Path
from typing import Annotated
from typing import Annotated, TypedDict
from profanityfilter import ProfanityFilter
from pydantic import BaseModel, Field, PrivateAttr
from reflector.redis_cache import redis_cache
class DiarizationSegment(TypedDict):
"""Type definition for diarization segment containing speaker information"""
start: float
end: float
speaker: int
PUNC_RE = re.compile(r"[.;:?!…]")
profanity_filter = ProfanityFilter()

View File

@@ -1,296 +0,0 @@
import hashlib
from datetime import date, datetime, timedelta, timezone
from typing import TypedDict
import httpx
import pytz
from icalendar import Calendar, Event
from loguru import logger
from reflector.db.calendar_events import CalendarEvent, calendar_events_controller
from reflector.db.rooms import Room, rooms_controller
from reflector.settings import settings
class AttendeeData(TypedDict, total=False):
email: str | None
name: str | None
status: str | None
role: str | None
class EventData(TypedDict):
ics_uid: str
title: str | None
description: str | None
location: str | None
start_time: datetime
end_time: datetime
attendees: list[AttendeeData]
ics_raw_data: str
class SyncStats(TypedDict):
events_created: int
events_updated: int
events_deleted: int
class ICSFetchService:
def __init__(self):
self.client = httpx.AsyncClient(
timeout=30.0, headers={"User-Agent": "Reflector/1.0"}
)
async def fetch_ics(self, url: str) -> str:
response = await self.client.get(url)
response.raise_for_status()
return response.text
def parse_ics(self, ics_content: str) -> Calendar:
return Calendar.from_ical(ics_content)
def extract_room_events(
self, calendar: Calendar, room_name: str, room_url: str
) -> list[EventData]:
events = []
now = datetime.now(timezone.utc)
window_start = now - timedelta(hours=1)
window_end = now + timedelta(hours=24)
for component in calendar.walk():
if component.name == "VEVENT":
# Skip cancelled events
status = component.get("STATUS", "").upper()
if status == "CANCELLED":
continue
# Check if event matches this room
if self._event_matches_room(component, room_name, room_url):
event_data = self._parse_event(component)
# Only include events in our time window
if (
event_data
and window_start <= event_data["start_time"] <= window_end
):
events.append(event_data)
return events
def _event_matches_room(self, event: Event, room_name: str, room_url: str) -> bool:
location = str(event.get("LOCATION", ""))
description = str(event.get("DESCRIPTION", ""))
# Only match full room URL (with or without protocol)
patterns = [
room_url, # Full URL with protocol
room_url.replace("https://", ""), # Without https protocol
room_url.replace("http://", ""), # Without http protocol
]
# Check location and description for patterns
text_to_check = f"{location} {description}".lower()
for pattern in patterns:
if pattern.lower() in text_to_check:
return True
return False
def _parse_event(self, event: Event) -> EventData | None:
# Extract basic fields
uid = str(event.get("UID", ""))
summary = str(event.get("SUMMARY", ""))
description = str(event.get("DESCRIPTION", ""))
location = str(event.get("LOCATION", ""))
# Parse dates
dtstart = event.get("DTSTART")
dtend = event.get("DTEND")
if not dtstart:
return None
# Convert to datetime
start_time = self._normalize_datetime(
dtstart.dt if hasattr(dtstart, "dt") else dtstart
)
end_time = (
self._normalize_datetime(dtend.dt if hasattr(dtend, "dt") else dtend)
if dtend
else start_time + timedelta(hours=1)
)
# Parse attendees
attendees = self._parse_attendees(event)
# Get raw event data for storage
raw_data = event.to_ical().decode("utf-8")
return {
"ics_uid": uid,
"title": summary,
"description": description,
"location": location,
"start_time": start_time,
"end_time": end_time,
"attendees": attendees,
"ics_raw_data": raw_data,
}
def _normalize_datetime(self, dt) -> datetime:
# Handle date objects (all-day events)
if isinstance(dt, date) and not isinstance(dt, datetime):
# Convert to datetime at start of day in UTC
dt = datetime.combine(dt, datetime.min.time())
dt = pytz.UTC.localize(dt)
elif isinstance(dt, datetime):
# Add UTC timezone if naive
if dt.tzinfo is None:
dt = pytz.UTC.localize(dt)
else:
# Convert to UTC
dt = dt.astimezone(pytz.UTC)
return dt
def _parse_attendees(self, event: Event) -> list[AttendeeData]:
attendees = []
# Parse ATTENDEE properties
for attendee in event.get("ATTENDEE", []):
if not isinstance(attendee, list):
attendee = [attendee]
for att in attendee:
att_data: AttendeeData = {
"email": str(att).replace("mailto:", "") if att else None,
"name": att.params.get("CN") if hasattr(att, "params") else None,
"status": att.params.get("PARTSTAT")
if hasattr(att, "params")
else None,
"role": att.params.get("ROLE") if hasattr(att, "params") else None,
}
attendees.append(att_data)
# Add organizer
organizer = event.get("ORGANIZER")
if organizer:
org_data: AttendeeData = {
"email": str(organizer).replace("mailto:", "") if organizer else None,
"name": organizer.params.get("CN")
if hasattr(organizer, "params")
else None,
"role": "ORGANIZER",
}
attendees.append(org_data)
return attendees
class ICSSyncService:
def __init__(self):
self.fetch_service = ICSFetchService()
async def sync_room_calendar(self, room: Room) -> dict:
if not room.ics_enabled or not room.ics_url:
return {"status": "skipped", "reason": "ICS not configured"}
try:
# Check if it's time to sync
if not self._should_sync(room):
return {"status": "skipped", "reason": "Not time to sync yet"}
# Fetch ICS file
ics_content = await self.fetch_service.fetch_ics(room.ics_url)
# Check if content changed
content_hash = hashlib.md5(ics_content.encode()).hexdigest()
if room.ics_last_etag == content_hash:
logger.info(f"No changes in ICS for room {room.id}")
return {"status": "unchanged", "hash": content_hash}
# Parse calendar
calendar = self.fetch_service.parse_ics(ics_content)
# Build room URL
room_url = f"{settings.BASE_URL}/room/{room.name}"
# Extract matching events
events = self.fetch_service.extract_room_events(
calendar, room.name, room_url
)
# Sync events to database
sync_result = await self._sync_events_to_database(room.id, events)
# Update room sync metadata
await rooms_controller.update(
room,
{
"ics_last_sync": datetime.now(timezone.utc),
"ics_last_etag": content_hash,
},
mutate=False,
)
return {
"status": "success",
"hash": content_hash,
"events_found": len(events),
**sync_result,
}
except Exception as e:
logger.error(f"Failed to sync ICS for room {room.id}: {e}")
return {"status": "error", "error": str(e)}
def _should_sync(self, room: Room) -> bool:
if not room.ics_last_sync:
return True
time_since_sync = datetime.now(timezone.utc) - room.ics_last_sync
return time_since_sync.total_seconds() >= room.ics_fetch_interval
async def _sync_events_to_database(
self, room_id: str, events: list[EventData]
) -> SyncStats:
created = 0
updated = 0
# Track current event IDs
current_ics_uids = []
for event_data in events:
# Create CalendarEvent object
calendar_event = CalendarEvent(room_id=room_id, **event_data)
# Upsert event
existing = await calendar_events_controller.get_by_ics_uid(
room_id, event_data["ics_uid"]
)
if existing:
updated += 1
else:
created += 1
await calendar_events_controller.upsert(calendar_event)
current_ics_uids.append(event_data["ics_uid"])
# Soft delete events that are no longer in calendar
deleted = await calendar_events_controller.soft_delete_missing(
room_id, current_ics_uids
)
return {
"events_created": created,
"events_updated": updated,
"events_deleted": deleted,
}
# Global instance
ics_sync_service = ICSSyncService()

View File

@@ -1,3 +1,4 @@
from pydantic.types import PositiveInt
from pydantic_settings import BaseSettings, SettingsConfigDict
@@ -21,11 +22,16 @@ class Settings(BaseSettings):
# local data directory
DATA_DIR: str = "./data"
# Audio Chunking
# backends: silero, frames
AUDIO_CHUNKER_BACKEND: str = "frames"
# Audio Transcription
# backends: whisper, modal
TRANSCRIPT_BACKEND: str = "whisper"
TRANSCRIPT_URL: str | None = None
TRANSCRIPT_TIMEOUT: int = 90
TRANSCRIPT_FILE_TIMEOUT: int = 600
# Audio Transcription: modal backend
TRANSCRIPT_MODAL_API_KEY: str | None = None
@@ -66,10 +72,14 @@ class Settings(BaseSettings):
DIARIZATION_ENABLED: bool = True
DIARIZATION_BACKEND: str = "modal"
DIARIZATION_URL: str | None = None
DIARIZATION_FILE_TIMEOUT: int = 600
# Diarization: modal backend
DIARIZATION_MODAL_API_KEY: str | None = None
# Diarization: local pyannote.audio
DIARIZATION_PYANNOTE_AUTH_TOKEN: str | None = None
# Sentry
SENTRY_DSN: str | None = None
@@ -81,9 +91,8 @@ class Settings(BaseSettings):
AUTH_JWT_PUBLIC_KEY: str | None = "authentik.monadical.com_public.pem"
AUTH_JWT_AUDIENCE: str | None = None
# API public mode
# if set, all anonymous record will be public
PUBLIC_MODE: bool = False
PUBLIC_DATA_RETENTION_DAYS: PositiveInt = 7
# Min transcript length to generate topic + summary
MIN_TRANSCRIPT_LENGTH: int = 750

View File

@@ -0,0 +1,72 @@
#!/usr/bin/env python
"""
Manual cleanup tool for old public data.
Uses the same implementation as the Celery worker task.
"""
import argparse
import asyncio
import sys
import structlog
from reflector.settings import settings
from reflector.worker.cleanup import _cleanup_old_public_data
logger = structlog.get_logger(__name__)
async def cleanup_old_data(days: int = 7):
logger.info(
"Starting manual cleanup",
retention_days=days,
public_mode=settings.PUBLIC_MODE,
)
if not settings.PUBLIC_MODE:
logger.critical(
"WARNING: PUBLIC_MODE is False. "
"This tool is intended for public instances only."
)
raise Exception("Tool intended for public instances only")
result = await _cleanup_old_public_data(days=days)
if result:
logger.info(
"Cleanup completed",
transcripts_deleted=result.get("transcripts_deleted", 0),
meetings_deleted=result.get("meetings_deleted", 0),
recordings_deleted=result.get("recordings_deleted", 0),
errors_count=len(result.get("errors", [])),
)
if result.get("errors"):
logger.warning(
"Errors encountered during cleanup:", errors=result["errors"][:10]
)
else:
logger.info("Cleanup skipped or completed without results")
def main():
parser = argparse.ArgumentParser(
description="Clean up old transcripts and meetings"
)
parser.add_argument(
"--days",
type=int,
default=7,
help="Number of days to keep data (default: 7)",
)
args = parser.parse_args()
if args.days < 1:
logger.error("Days must be at least 1")
sys.exit(1)
asyncio.run(cleanup_old_data(days=args.days))
if __name__ == "__main__":
main()

View File

@@ -1,105 +1,220 @@
"""
Process audio file with diarization support
"""
import argparse
import asyncio
import json
import shutil
import sys
import time
from pathlib import Path
from typing import Any, Dict, List, Literal
import av
from reflector.db.transcripts import SourceKind, TranscriptTopic, transcripts_controller
from reflector.logger import logger
from reflector.processors import (
AudioChunkerProcessor,
AudioMergeProcessor,
AudioTranscriptAutoProcessor,
Pipeline,
PipelineEvent,
TranscriptFinalSummaryProcessor,
TranscriptFinalTitleProcessor,
TranscriptLinerProcessor,
TranscriptTopicDetectorProcessor,
TranscriptTranslatorAutoProcessor,
from reflector.pipelines.main_file_pipeline import (
task_pipeline_file_process as task_pipeline_file_process,
)
from reflector.pipelines.main_live_pipeline import pipeline_post as live_pipeline_post
from reflector.pipelines.main_live_pipeline import (
pipeline_process as live_pipeline_process,
)
from reflector.processors.base import BroadcastProcessor
async def process_audio_file(
filename,
event_callback,
only_transcript=False,
source_language="en",
target_language="en",
def serialize_topics(topics: List[TranscriptTopic]) -> List[Dict[str, Any]]:
"""Convert TranscriptTopic objects to JSON-serializable dicts"""
serialized = []
for topic in topics:
topic_dict = topic.model_dump()
serialized.append(topic_dict)
return serialized
def debug_print_speakers(serialized_topics: List[Dict[str, Any]]) -> None:
"""Print debug info about speakers found in topics"""
all_speakers = set()
for topic_dict in serialized_topics:
for word in topic_dict.get("words", []):
all_speakers.add(word.get("speaker", 0))
print(
f"Found {len(serialized_topics)} topics with speakers: {all_speakers}",
file=sys.stderr,
)
TranscriptId = str
# common interface for every flow: it needs an Entry in db with specific ceremony (file path + status + actual file in file system)
# ideally we want to get rid of it at some point
async def prepare_entry(
source_path: str,
source_language: str,
target_language: str,
) -> TranscriptId:
file_path = Path(source_path)
transcript = await transcripts_controller.add(
file_path.name,
# note that the real file upload has SourceKind: LIVE for the reason of it's an error
source_kind=SourceKind.FILE,
source_language=source_language,
target_language=target_language,
user_id=None,
)
logger.info(
f"Created empty transcript {transcript.id} for file {file_path.name} because technically we need an empty transcript before we start transcript"
)
# pipelines expect files as upload.*
extension = file_path.suffix
upload_path = transcript.data_path / f"upload{extension}"
upload_path.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(source_path, upload_path)
logger.info(f"Copied {source_path} to {upload_path}")
# pipelines expect entity status "uploaded"
await transcripts_controller.update(transcript, {"status": "uploaded"})
return transcript.id
# same reason as prepare_entry
async def extract_result_from_entry(
transcript_id: TranscriptId, output_path: str
) -> None:
post_final_transcript = await transcripts_controller.get_by_id(transcript_id)
# assert post_final_transcript.status == "ended"
# File pipeline doesn't set status to "ended", only live pipeline does https://github.com/Monadical-SAS/reflector/issues/582
topics = post_final_transcript.topics
if not topics:
raise RuntimeError(
f"No topics found for transcript {transcript_id} after processing"
)
serialized_topics = serialize_topics(topics)
if output_path:
# Write to JSON file
with open(output_path, "w") as f:
for topic_dict in serialized_topics:
json.dump(topic_dict, f)
f.write("\n")
print(f"Results written to {output_path}", file=sys.stderr)
else:
# Write to stdout as JSONL
for topic_dict in serialized_topics:
print(json.dumps(topic_dict))
debug_print_speakers(serialized_topics)
async def process_live_pipeline(
transcript_id: TranscriptId,
):
# build pipeline for audio processing
processors = [
AudioChunkerProcessor(),
AudioMergeProcessor(),
AudioTranscriptAutoProcessor.as_threaded(),
TranscriptLinerProcessor(),
TranscriptTranslatorAutoProcessor.as_threaded(),
]
if not only_transcript:
processors += [
TranscriptTopicDetectorProcessor.as_threaded(),
BroadcastProcessor(
processors=[
TranscriptFinalTitleProcessor.as_threaded(),
TranscriptFinalSummaryProcessor.as_threaded(),
],
),
]
"""Process transcript_id with transcription and diarization"""
# transcription output
pipeline = Pipeline(*processors)
pipeline.set_pref("audio:source_language", source_language)
pipeline.set_pref("audio:target_language", target_language)
pipeline.describe()
pipeline.on(event_callback)
print(f"Processing transcript_id {transcript_id}...", file=sys.stderr)
await live_pipeline_process(transcript_id=transcript_id)
print(f"Processing complete for transcript {transcript_id}", file=sys.stderr)
pre_final_transcript = await transcripts_controller.get_by_id(transcript_id)
# assert documented behaviour: after process, the pipeline isn't ended. this is the reason of calling pipeline_post
assert pre_final_transcript.status != "ended"
# at this point, diarization is running but we have no access to it. run diarization in parallel - one will hopefully win after polling
result = live_pipeline_post(transcript_id=transcript_id)
# result.ready() blocks even without await; it mutates result also
while not result.ready():
print(f"Status: {result.state}")
time.sleep(2)
async def process_file_pipeline(
transcript_id: TranscriptId,
):
"""Process audio/video file using the optimized file pipeline"""
# task_pipeline_file_process is a Celery task, need to use .delay() for async execution
result = task_pipeline_file_process.delay(transcript_id=transcript_id)
# Wait for the Celery task to complete
while not result.ready():
print(f"File pipeline status: {result.state}", file=sys.stderr)
time.sleep(2)
logger.info("File pipeline processing complete")
async def process(
source_path: str,
source_language: str,
target_language: str,
pipeline: Literal["live", "file"],
output_path: str = None,
):
from reflector.db import get_database
database = get_database()
# db connect is a part of ceremony
await database.connect()
# start processing audio
logger.info(f"Opening {filename}")
container = av.open(filename)
try:
logger.info("Start pushing audio into the pipeline")
for frame in container.decode(audio=0):
await pipeline.push(frame)
finally:
logger.info("Flushing the pipeline")
await pipeline.flush()
transcript_id = await prepare_entry(
source_path,
source_language,
target_language,
)
logger.info("All done !")
pipeline_handlers = {
"live": process_live_pipeline,
"file": process_file_pipeline,
}
handler = pipeline_handlers.get(pipeline)
if not handler:
raise ValueError(f"Unknown pipeline type: {pipeline}")
await handler(transcript_id)
await extract_result_from_entry(transcript_id, output_path)
finally:
await database.disconnect()
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser = argparse.ArgumentParser(
description="Process audio files with speaker diarization"
)
parser.add_argument("source", help="Source file (mp3, wav, mp4...)")
parser.add_argument("--only-transcript", "-t", action="store_true")
parser.add_argument("--source-language", default="en")
parser.add_argument("--target-language", default="en")
parser.add_argument(
"--pipeline",
required=True,
choices=["live", "file"],
help="Pipeline type to use for processing (live: streaming/incremental, file: batch/parallel)",
)
parser.add_argument(
"--source-language", default="en", help="Source language code (default: en)"
)
parser.add_argument(
"--target-language", default="en", help="Target language code (default: en)"
)
parser.add_argument("--output", "-o", help="Output file (output.jsonl)")
args = parser.parse_args()
output_fd = None
if args.output:
output_fd = open(args.output, "w")
async def event_callback(event: PipelineEvent):
processor = event.processor
# ignore some processor
if processor in ("AudioChunkerProcessor", "AudioMergeProcessor"):
return
logger.info(f"Event: {event}")
if output_fd:
output_fd.write(event.model_dump_json())
output_fd.write("\n")
asyncio.run(
process_audio_file(
process(
args.source,
event_callback,
only_transcript=args.only_transcript,
source_language=args.source_language,
target_language=args.target_language,
args.source_language,
args.target_language,
args.pipeline,
args.output,
)
)
if output_fd:
output_fd.close()
logger.info(f"Output written to {args.output}")

View File

@@ -1,315 +0,0 @@
"""
@vibe-generated
Process audio file with diarization support
===========================================
Extended version of process.py that includes speaker diarization.
This tool processes audio files locally without requiring the full server infrastructure.
"""
import asyncio
import tempfile
import uuid
from pathlib import Path
from typing import List
import av
from reflector.logger import logger
from reflector.processors import (
AudioChunkerProcessor,
AudioFileWriterProcessor,
AudioMergeProcessor,
AudioTranscriptAutoProcessor,
Pipeline,
PipelineEvent,
TranscriptFinalSummaryProcessor,
TranscriptFinalTitleProcessor,
TranscriptLinerProcessor,
TranscriptTopicDetectorProcessor,
TranscriptTranslatorAutoProcessor,
)
from reflector.processors.base import BroadcastProcessor, Processor
from reflector.processors.types import (
AudioDiarizationInput,
TitleSummary,
TitleSummaryWithId,
)
class TopicCollectorProcessor(Processor):
"""Collect topics for diarization"""
INPUT_TYPE = TitleSummary
OUTPUT_TYPE = TitleSummary
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.topics: List[TitleSummaryWithId] = []
self._topic_id = 0
async def _push(self, data: TitleSummary):
# Convert to TitleSummaryWithId and collect
self._topic_id += 1
topic_with_id = TitleSummaryWithId(
id=str(self._topic_id),
title=data.title,
summary=data.summary,
timestamp=data.timestamp,
duration=data.duration,
transcript=data.transcript,
)
self.topics.append(topic_with_id)
# Pass through the original topic
await self.emit(data)
def get_topics(self) -> List[TitleSummaryWithId]:
return self.topics
async def process_audio_file_with_diarization(
filename,
event_callback,
only_transcript=False,
source_language="en",
target_language="en",
enable_diarization=True,
diarization_backend="modal",
):
# Create temp file for audio if diarization is enabled
audio_temp_path = None
if enable_diarization:
audio_temp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
audio_temp_path = audio_temp_file.name
audio_temp_file.close()
# Create processor for collecting topics
topic_collector = TopicCollectorProcessor()
# Build pipeline for audio processing
processors = []
# Add audio file writer at the beginning if diarization is enabled
if enable_diarization:
processors.append(AudioFileWriterProcessor(audio_temp_path))
# Add the rest of the processors
processors += [
AudioChunkerProcessor(),
AudioMergeProcessor(),
AudioTranscriptAutoProcessor.as_threaded(),
]
processors += [
TranscriptLinerProcessor(),
TranscriptTranslatorAutoProcessor.as_threaded(),
]
if not only_transcript:
processors += [
TranscriptTopicDetectorProcessor.as_threaded(),
# Collect topics for diarization
topic_collector,
BroadcastProcessor(
processors=[
TranscriptFinalTitleProcessor.as_threaded(),
TranscriptFinalSummaryProcessor.as_threaded(),
],
),
]
# Create main pipeline
pipeline = Pipeline(*processors)
pipeline.set_pref("audio:source_language", source_language)
pipeline.set_pref("audio:target_language", target_language)
pipeline.describe()
pipeline.on(event_callback)
# Start processing audio
logger.info(f"Opening {filename}")
container = av.open(filename)
try:
logger.info("Start pushing audio into the pipeline")
for frame in container.decode(audio=0):
await pipeline.push(frame)
finally:
logger.info("Flushing the pipeline")
await pipeline.flush()
# Run diarization if enabled and we have topics
if enable_diarization and not only_transcript and audio_temp_path:
topics = topic_collector.get_topics()
if topics:
logger.info(f"Starting diarization with {len(topics)} topics")
try:
from reflector.processors import AudioDiarizationAutoProcessor
diarization_processor = AudioDiarizationAutoProcessor(
name=diarization_backend
)
diarization_processor.set_pipeline(pipeline)
# For Modal backend, we need to upload the file to S3 first
if diarization_backend == "modal":
from datetime import datetime, timezone
from reflector.storage import get_transcripts_storage
from reflector.utils.s3_temp_file import S3TemporaryFile
storage = get_transcripts_storage()
# Generate a unique filename in evaluation folder
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
audio_filename = f"evaluation/diarization_temp/{timestamp}_{uuid.uuid4().hex}.wav"
# Use context manager for automatic cleanup
async with S3TemporaryFile(storage, audio_filename) as s3_file:
# Read and upload the audio file
with open(audio_temp_path, "rb") as f:
audio_data = f.read()
audio_url = await s3_file.upload(audio_data)
logger.info(f"Uploaded audio to S3: {audio_filename}")
# Create diarization input with S3 URL
diarization_input = AudioDiarizationInput(
audio_url=audio_url, topics=topics
)
# Run diarization
await diarization_processor.push(diarization_input)
await diarization_processor.flush()
logger.info("Diarization complete")
# File will be automatically cleaned up when exiting the context
else:
# For local backend, use local file path
audio_url = audio_temp_path
# Create diarization input
diarization_input = AudioDiarizationInput(
audio_url=audio_url, topics=topics
)
# Run diarization
await diarization_processor.push(diarization_input)
await diarization_processor.flush()
logger.info("Diarization complete")
except ImportError as e:
logger.error(f"Failed to import diarization dependencies: {e}")
logger.error(
"Install with: uv pip install pyannote.audio torch torchaudio"
)
logger.error(
"And set HF_TOKEN environment variable for pyannote models"
)
raise SystemExit(1)
except Exception as e:
logger.error(f"Diarization failed: {e}")
raise SystemExit(1)
else:
logger.warning("Skipping diarization: no topics available")
# Clean up temp file
if audio_temp_path:
try:
Path(audio_temp_path).unlink()
except Exception as e:
logger.warning(f"Failed to clean up temp file {audio_temp_path}: {e}")
logger.info("All done!")
if __name__ == "__main__":
import argparse
import os
parser = argparse.ArgumentParser(
description="Process audio files with optional speaker diarization"
)
parser.add_argument("source", help="Source file (mp3, wav, mp4...)")
parser.add_argument(
"--only-transcript",
"-t",
action="store_true",
help="Only generate transcript without topics/summaries",
)
parser.add_argument(
"--source-language", default="en", help="Source language code (default: en)"
)
parser.add_argument(
"--target-language", default="en", help="Target language code (default: en)"
)
parser.add_argument("--output", "-o", help="Output file (output.jsonl)")
parser.add_argument(
"--enable-diarization",
"-d",
action="store_true",
help="Enable speaker diarization",
)
parser.add_argument(
"--diarization-backend",
default="modal",
choices=["modal"],
help="Diarization backend to use (default: modal)",
)
args = parser.parse_args()
# Set REDIS_HOST to localhost if not provided
if "REDIS_HOST" not in os.environ:
os.environ["REDIS_HOST"] = "localhost"
logger.info("REDIS_HOST not set, defaulting to localhost")
output_fd = None
if args.output:
output_fd = open(args.output, "w")
async def event_callback(event: PipelineEvent):
processor = event.processor
data = event.data
# Ignore internal processors
if processor in (
"AudioChunkerProcessor",
"AudioMergeProcessor",
"AudioFileWriterProcessor",
"TopicCollectorProcessor",
"BroadcastProcessor",
):
return
# If diarization is enabled, skip the original topic events from the pipeline
# The diarization processor will emit the same topics but with speaker info
if processor == "TranscriptTopicDetectorProcessor" and args.enable_diarization:
return
# Log all events
logger.info(f"Event: {processor} - {type(data).__name__}")
# Write to output
if output_fd:
output_fd.write(event.model_dump_json())
output_fd.write("\n")
output_fd.flush()
asyncio.run(
process_audio_file_with_diarization(
args.source,
event_callback,
only_transcript=args.only_transcript,
source_language=args.source_language,
target_language=args.target_language,
enable_diarization=args.enable_diarization,
diarization_backend=args.diarization_backend,
)
)
if output_fd:
output_fd.close()
logger.info(f"Output written to {args.output}")

View File

@@ -53,7 +53,7 @@ async def run_single_processor(args):
async def event_callback(event: PipelineEvent):
processor = event.processor
# ignore some processor
if processor in ("AudioChunkerProcessor", "AudioMergeProcessor"):
if processor in ("AudioChunkerAutoProcessor", "AudioMergeProcessor"):
return
print(f"Event: {event}")
if output_fd:

View File

@@ -1,96 +0,0 @@
#!/usr/bin/env python3
"""
@vibe-generated
Test script for the diarization CLI tool
=========================================
This script helps test the diarization functionality with sample audio files.
"""
import asyncio
import sys
from pathlib import Path
from reflector.logger import logger
async def test_diarization(audio_file: str):
"""Test the diarization functionality"""
# Import the processing function
from process_with_diarization import process_audio_file_with_diarization
# Collect events
events = []
async def event_callback(event):
events.append({"processor": event.processor, "data": event.data})
logger.info(f"Event from {event.processor}")
# Process the audio file
logger.info(f"Processing audio file: {audio_file}")
try:
await process_audio_file_with_diarization(
audio_file,
event_callback,
only_transcript=False,
source_language="en",
target_language="en",
enable_diarization=True,
diarization_backend="modal",
)
# Analyze results
logger.info(f"Processing complete. Received {len(events)} events")
# Look for diarization results
diarized_topics = []
for event in events:
if "TitleSummary" in event["processor"]:
# Check if words have speaker information
if hasattr(event["data"], "transcript") and event["data"].transcript:
words = event["data"].transcript.words
if words and hasattr(words[0], "speaker"):
speakers = set(
w.speaker for w in words if hasattr(w, "speaker")
)
logger.info(
f"Found {len(speakers)} speakers in topic: {event['data'].title}"
)
diarized_topics.append(event["data"])
if diarized_topics:
logger.info(f"Successfully diarized {len(diarized_topics)} topics")
# Print sample output
sample_topic = diarized_topics[0]
logger.info("Sample diarized output:")
for i, word in enumerate(sample_topic.transcript.words[:10]):
logger.info(f" Word {i}: '{word.text}' - Speaker {word.speaker}")
else:
logger.warning("No diarization results found in output")
return events
except Exception as e:
logger.error(f"Error during processing: {e}")
raise
def main():
if len(sys.argv) < 2:
print("Usage: python test_diarization.py <audio_file>")
sys.exit(1)
audio_file = sys.argv[1]
if not Path(audio_file).exists():
print(f"Error: Audio file '{audio_file}' not found")
sys.exit(1)
# Run the test
asyncio.run(test_diarization(audio_file))
if __name__ == "__main__":
main()

View File

@@ -15,6 +15,7 @@ from reflector.db.meetings import meetings_controller
from reflector.db.rooms import rooms_controller
from reflector.settings import settings
from reflector.whereby import create_meeting, upload_logo
from reflector.worker.webhook import test_webhook
logger = logging.getLogger(__name__)
@@ -42,11 +43,11 @@ class Room(BaseModel):
recording_type: str
recording_trigger: str
is_shared: bool
ics_url: Optional[str] = None
ics_fetch_interval: int = 300
ics_enabled: bool = False
ics_last_sync: Optional[datetime] = None
ics_last_etag: Optional[str] = None
class RoomDetails(Room):
webhook_url: str | None
webhook_secret: str | None
class Meeting(BaseModel):
@@ -69,34 +70,40 @@ class CreateRoom(BaseModel):
recording_type: str
recording_trigger: str
is_shared: bool
ics_url: Optional[str] = None
ics_fetch_interval: int = 300
ics_enabled: bool = False
webhook_url: str
webhook_secret: str
class UpdateRoom(BaseModel):
name: Optional[str] = None
zulip_auto_post: Optional[bool] = None
zulip_stream: Optional[str] = None
zulip_topic: Optional[str] = None
is_locked: Optional[bool] = None
room_mode: Optional[str] = None
recording_type: Optional[str] = None
recording_trigger: Optional[str] = None
is_shared: Optional[bool] = None
ics_url: Optional[str] = None
ics_fetch_interval: Optional[int] = None
ics_enabled: Optional[bool] = None
name: str
zulip_auto_post: bool
zulip_stream: str
zulip_topic: str
is_locked: bool
room_mode: str
recording_type: str
recording_trigger: str
is_shared: bool
webhook_url: str
webhook_secret: str
class DeletionStatus(BaseModel):
status: str
@router.get("/rooms", response_model=Page[Room])
class WebhookTestResult(BaseModel):
success: bool
message: str = ""
error: str = ""
status_code: int | None = None
response_preview: str | None = None
@router.get("/rooms", response_model=Page[RoomDetails])
async def rooms_list(
user: Annotated[Optional[auth.UserInfo], Depends(auth.current_user_optional)],
) -> list[Room]:
) -> list[RoomDetails]:
if not user and not settings.PUBLIC_MODE:
raise HTTPException(status_code=401, detail="Not authenticated")
@@ -110,6 +117,18 @@ async def rooms_list(
)
@router.get("/rooms/{room_id}", response_model=RoomDetails)
async def rooms_get(
room_id: str,
user: Annotated[Optional[auth.UserInfo], Depends(auth.current_user_optional)],
):
user_id = user["sub"] if user else None
room = await rooms_controller.get_by_id_for_http(room_id, user_id=user_id)
if not room:
raise HTTPException(status_code=404, detail="Room not found")
return room
@router.post("/rooms", response_model=Room)
async def rooms_create(
room: CreateRoom,
@@ -128,13 +147,12 @@ async def rooms_create(
recording_type=room.recording_type,
recording_trigger=room.recording_trigger,
is_shared=room.is_shared,
ics_url=room.ics_url,
ics_fetch_interval=room.ics_fetch_interval,
ics_enabled=room.ics_enabled,
webhook_url=room.webhook_url,
webhook_secret=room.webhook_secret,
)
@router.patch("/rooms/{room_id}", response_model=Room)
@router.patch("/rooms/{room_id}", response_model=RoomDetails)
async def rooms_update(
room_id: str,
info: UpdateRoom,
@@ -225,215 +243,22 @@ async def rooms_create_meeting(
return meeting
class ICSStatus(BaseModel):
status: str
last_sync: Optional[datetime] = None
next_sync: Optional[datetime] = None
last_etag: Optional[str] = None
events_count: int = 0
class ICSSyncResult(BaseModel):
status: str
hash: Optional[str] = None
events_found: int = 0
events_created: int = 0
events_updated: int = 0
events_deleted: int = 0
error: Optional[str] = None
@router.post("/rooms/{room_name}/ics/sync", response_model=ICSSyncResult)
async def rooms_sync_ics(
room_name: str,
@router.post("/rooms/{room_id}/webhook/test", response_model=WebhookTestResult)
async def rooms_test_webhook(
room_id: str,
user: Annotated[Optional[auth.UserInfo], Depends(auth.current_user_optional)],
):
"""Test webhook configuration by sending a sample payload."""
user_id = user["sub"] if user else None
room = await rooms_controller.get_by_name(room_name)
room = await rooms_controller.get_by_id(room_id)
if not room:
raise HTTPException(status_code=404, detail="Room not found")
if user_id != room.user_id:
if user_id and room.user_id != user_id:
raise HTTPException(
status_code=403, detail="Only room owner can trigger ICS sync"
status_code=403, detail="Not authorized to test this room's webhook"
)
if not room.ics_enabled or not room.ics_url:
raise HTTPException(status_code=400, detail="ICS not configured for this room")
from reflector.services.ics_sync import ics_sync_service
result = await ics_sync_service.sync_room_calendar(room)
if result["status"] == "error":
raise HTTPException(
status_code=500, detail=result.get("error", "Unknown error")
)
return ICSSyncResult(**result)
@router.get("/rooms/{room_name}/ics/status", response_model=ICSStatus)
async def rooms_ics_status(
room_name: str,
user: Annotated[Optional[auth.UserInfo], Depends(auth.current_user_optional)],
):
user_id = user["sub"] if user else None
room = await rooms_controller.get_by_name(room_name)
if not room:
raise HTTPException(status_code=404, detail="Room not found")
if user_id != room.user_id:
raise HTTPException(
status_code=403, detail="Only room owner can view ICS status"
)
next_sync = None
if room.ics_enabled and room.ics_last_sync:
next_sync = room.ics_last_sync + timedelta(seconds=room.ics_fetch_interval)
from reflector.db.calendar_events import calendar_events_controller
events = await calendar_events_controller.get_by_room(
room.id, include_deleted=False
)
return ICSStatus(
status="enabled" if room.ics_enabled else "disabled",
last_sync=room.ics_last_sync,
next_sync=next_sync,
last_etag=room.ics_last_etag,
events_count=len(events),
)
class CalendarEventResponse(BaseModel):
id: str
room_id: str
ics_uid: str
title: Optional[str] = None
description: Optional[str] = None
start_time: datetime
end_time: datetime
attendees: Optional[list[dict]] = None
location: Optional[str] = None
last_synced: datetime
created_at: datetime
updated_at: datetime
@router.get("/rooms/{room_name}/meetings", response_model=list[CalendarEventResponse])
async def rooms_list_meetings(
room_name: str,
user: Annotated[Optional[auth.UserInfo], Depends(auth.current_user_optional)],
):
user_id = user["sub"] if user else None
room = await rooms_controller.get_by_name(room_name)
if not room:
raise HTTPException(status_code=404, detail="Room not found")
from reflector.db.calendar_events import calendar_events_controller
events = await calendar_events_controller.get_by_room(
room.id, include_deleted=False
)
if user_id != room.user_id:
for event in events:
event.description = None
event.attendees = None
return events
@router.get(
"/rooms/{room_name}/meetings/upcoming", response_model=list[CalendarEventResponse]
)
async def rooms_list_upcoming_meetings(
room_name: str,
user: Annotated[Optional[auth.UserInfo], Depends(auth.current_user_optional)],
minutes_ahead: int = 30,
):
user_id = user["sub"] if user else None
room = await rooms_controller.get_by_name(room_name)
if not room:
raise HTTPException(status_code=404, detail="Room not found")
from reflector.db.calendar_events import calendar_events_controller
events = await calendar_events_controller.get_upcoming(
room.id, minutes_ahead=minutes_ahead
)
if user_id != room.user_id:
for event in events:
event.description = None
event.attendees = None
return events
@router.get("/rooms/{room_name}/meetings/active", response_model=list[Meeting])
async def rooms_list_active_meetings(
room_name: str,
user: Annotated[Optional[auth.UserInfo], Depends(auth.current_user_optional)],
):
"""List all active meetings for a room (supports multiple active meetings)"""
user_id = user["sub"] if user else None
room = await rooms_controller.get_by_name(room_name)
if not room:
raise HTTPException(status_code=404, detail="Room not found")
current_time = datetime.now(timezone.utc)
meetings = await meetings_controller.get_all_active_for_room(
room=room, current_time=current_time
)
# Hide host URLs from non-owners
if user_id != room.user_id:
for meeting in meetings:
meeting.host_room_url = ""
return meetings
@router.post("/rooms/{room_name}/meetings/{meeting_id}/join", response_model=Meeting)
async def rooms_join_meeting(
room_name: str,
meeting_id: str,
user: Annotated[Optional[auth.UserInfo], Depends(auth.current_user_optional)],
):
"""Join a specific meeting by ID"""
user_id = user["sub"] if user else None
room = await rooms_controller.get_by_name(room_name)
if not room:
raise HTTPException(status_code=404, detail="Room not found")
meeting = await meetings_controller.get_by_id(meeting_id)
if not meeting:
raise HTTPException(status_code=404, detail="Meeting not found")
if meeting.room_id != room.id:
raise HTTPException(
status_code=403, detail="Meeting does not belong to this room"
)
if not meeting.is_active:
raise HTTPException(status_code=400, detail="Meeting is not active")
current_time = datetime.now(timezone.utc)
if meeting.end_date <= current_time:
raise HTTPException(status_code=400, detail="Meeting has ended")
# Hide host URL from non-owners
if user_id != room.user_id:
meeting.host_room_url = ""
return meeting
result = await test_webhook(room_id)
return WebhookTestResult(**result)

View File

@@ -160,6 +160,7 @@ async def transcripts_search(
limit: SearchLimitParam = DEFAULT_SEARCH_LIMIT,
offset: SearchOffsetParam = 0,
room_id: Optional[str] = None,
source_kind: Optional[SourceKind] = None,
user: Annotated[
Optional[auth.UserInfo], Depends(auth.current_user_optional)
] = None,
@@ -173,7 +174,12 @@ async def transcripts_search(
user_id = user["sub"] if user else None
search_params = SearchParameters(
query_text=q, limit=limit, offset=offset, user_id=user_id, room_id=room_id
query_text=q,
limit=limit,
offset=offset,
user_id=user_id,
room_id=room_id,
source_kind=source_kind,
)
results, total = await search_controller.search_transcripts(search_params)

View File

@@ -6,7 +6,7 @@ from pydantic import BaseModel
import reflector.auth as auth
from reflector.db.transcripts import transcripts_controller
from reflector.pipelines.main_live_pipeline import task_pipeline_process
from reflector.pipelines.main_file_pipeline import task_pipeline_file_process
router = APIRouter()
@@ -40,7 +40,7 @@ async def transcript_process(
return ProcessStatus(status="already running")
# schedule a background task process the file
task_pipeline_process.delay(transcript_id=transcript_id)
task_pipeline_file_process.delay(transcript_id=transcript_id)
return ProcessStatus(status="ok")

View File

@@ -6,7 +6,7 @@ from pydantic import BaseModel
import reflector.auth as auth
from reflector.db.transcripts import transcripts_controller
from reflector.pipelines.main_live_pipeline import task_pipeline_process
from reflector.pipelines.main_file_pipeline import task_pipeline_file_process
router = APIRouter()
@@ -92,6 +92,6 @@ async def transcript_record_upload(
await transcripts_controller.update(transcript, {"status": "uploaded"})
# launch a background task to process the file
task_pipeline_process.delay(transcript_id=transcript_id)
task_pipeline_file_process.delay(transcript_id=transcript_id)
return UploadStatus(status="ok")

View File

@@ -68,13 +68,8 @@ async def whereby_webhook(event: WherebyWebhookEvent, request: Request):
raise HTTPException(status_code=404, detail="Meeting not found")
if event.type in ["room.client.joined", "room.client.left"]:
update_data = {"num_clients": event.data["numClients"]}
# Clear grace period if participant joined
if event.type == "room.client.joined" and event.data["numClients"] > 0:
if meeting.last_participant_left_at:
update_data["last_participant_left_at"] = None
await meetings_controller.update_meeting(meeting.id, **update_data)
await meetings_controller.update_meeting(
meeting.id, num_clients=event.data["numClients"]
)
return {"status": "ok"}

View File

@@ -19,7 +19,7 @@ else:
"reflector.pipelines.main_live_pipeline",
"reflector.worker.healthcheck",
"reflector.worker.process",
"reflector.worker.ics_sync",
"reflector.worker.cleanup",
]
)
@@ -37,16 +37,18 @@ else:
"task": "reflector.worker.process.reprocess_failed_recordings",
"schedule": crontab(hour=5, minute=0), # Midnight EST
},
"sync_all_ics_calendars": {
"task": "reflector.worker.ics_sync.sync_all_ics_calendars",
"schedule": 60.0, # Run every minute to check which rooms need sync
},
"pre_create_upcoming_meetings": {
"task": "reflector.worker.ics_sync.pre_create_upcoming_meetings",
"schedule": 30.0, # Run every 30 seconds to pre-create meetings
},
}
if settings.PUBLIC_MODE:
app.conf.beat_schedule["cleanup_old_public_data"] = {
"task": "reflector.worker.cleanup.cleanup_old_public_data_task",
"schedule": crontab(hour=3, minute=0),
}
logger.info(
"Public mode cleanup enabled",
retention_days=settings.PUBLIC_DATA_RETENTION_DAYS,
)
if settings.HEALTHCHECK_URL:
app.conf.beat_schedule["healthcheck_ping"] = {
"task": "reflector.worker.healthcheck.healthcheck_ping",

View File

@@ -0,0 +1,156 @@
"""
Main task for cleanup old public data.
Deletes old anonymous transcripts and their associated meetings/recordings.
Transcripts are the main entry point - any associated data is also removed.
"""
import asyncio
from datetime import datetime, timedelta, timezone
from typing import TypedDict
import structlog
from celery import shared_task
from databases import Database
from pydantic.types import PositiveInt
from reflector.asynctask import asynctask
from reflector.db import get_database
from reflector.db.meetings import meetings
from reflector.db.recordings import recordings
from reflector.db.transcripts import transcripts, transcripts_controller
from reflector.settings import settings
from reflector.storage import get_recordings_storage
logger = structlog.get_logger(__name__)
class CleanupStats(TypedDict):
"""Statistics for cleanup operation."""
transcripts_deleted: int
meetings_deleted: int
recordings_deleted: int
errors: list[str]
async def delete_single_transcript(
db: Database, transcript_data: dict, stats: CleanupStats
):
transcript_id = transcript_data["id"]
meeting_id = transcript_data["meeting_id"]
recording_id = transcript_data["recording_id"]
try:
async with db.transaction(isolation="serializable"):
if meeting_id:
await db.execute(meetings.delete().where(meetings.c.id == meeting_id))
stats["meetings_deleted"] += 1
logger.info("Deleted associated meeting", meeting_id=meeting_id)
if recording_id:
recording = await db.fetch_one(
recordings.select().where(recordings.c.id == recording_id)
)
if recording:
try:
await get_recordings_storage().delete_file(
recording["object_key"]
)
except Exception as storage_error:
logger.warning(
"Failed to delete recording from storage",
recording_id=recording_id,
object_key=recording["object_key"],
error=str(storage_error),
)
await db.execute(
recordings.delete().where(recordings.c.id == recording_id)
)
stats["recordings_deleted"] += 1
logger.info(
"Deleted associated recording", recording_id=recording_id
)
await transcripts_controller.remove_by_id(transcript_id)
stats["transcripts_deleted"] += 1
logger.info(
"Deleted transcript",
transcript_id=transcript_id,
created_at=transcript_data["created_at"].isoformat(),
)
except Exception as e:
error_msg = f"Failed to delete transcript {transcript_id}: {str(e)}"
logger.error(error_msg, exc_info=e)
stats["errors"].append(error_msg)
async def cleanup_old_transcripts(
db: Database, cutoff_date: datetime, stats: CleanupStats
):
"""Delete old anonymous transcripts and their associated recordings/meetings."""
query = transcripts.select().where(
(transcripts.c.created_at < cutoff_date) & (transcripts.c.user_id.is_(None))
)
old_transcripts = await db.fetch_all(query)
logger.info(f"Found {len(old_transcripts)} old transcripts to delete")
for transcript_data in old_transcripts:
await delete_single_transcript(db, transcript_data, stats)
def log_cleanup_results(stats: CleanupStats):
logger.info(
"Cleanup completed",
transcripts_deleted=stats["transcripts_deleted"],
meetings_deleted=stats["meetings_deleted"],
recordings_deleted=stats["recordings_deleted"],
errors_count=len(stats["errors"]),
)
if stats["errors"]:
logger.warning(
"Cleanup completed with errors",
errors=stats["errors"][:10],
)
async def cleanup_old_public_data(
days: PositiveInt | None = None,
) -> CleanupStats | None:
if days is None:
days = settings.PUBLIC_DATA_RETENTION_DAYS
if not settings.PUBLIC_MODE:
logger.info("Skipping cleanup - not a public instance")
return None
cutoff_date = datetime.now(timezone.utc) - timedelta(days=days)
logger.info(
"Starting cleanup of old public data",
cutoff_date=cutoff_date.isoformat(),
)
stats: CleanupStats = {
"transcripts_deleted": 0,
"meetings_deleted": 0,
"recordings_deleted": 0,
"errors": [],
}
db = get_database()
await cleanup_old_transcripts(db, cutoff_date, stats)
log_cleanup_results(stats)
return stats
@shared_task(
autoretry_for=(Exception,),
retry_kwargs={"max_retries": 3, "countdown": 300},
)
@asynctask
def cleanup_old_public_data_task(days: int | None = None):
asyncio.run(cleanup_old_public_data(days=days))

View File

@@ -1,209 +0,0 @@
from datetime import datetime, timedelta, timezone
import structlog
from celery import shared_task
from celery.utils.log import get_task_logger
from reflector.db import get_database
from reflector.db.meetings import meetings_controller
from reflector.db.rooms import rooms, rooms_controller
from reflector.services.ics_sync import ics_sync_service
from reflector.whereby import create_meeting, upload_logo
logger = structlog.wrap_logger(get_task_logger(__name__))
@shared_task
def sync_room_ics(room_id: str):
asynctask(_sync_room_ics_async(room_id))
async def _sync_room_ics_async(room_id: str):
try:
room = await rooms_controller.get_by_id(room_id)
if not room:
logger.warning("Room not found for ICS sync", room_id=room_id)
return
if not room.ics_enabled or not room.ics_url:
logger.debug("ICS not enabled for room", room_id=room_id)
return
logger.info("Starting ICS sync for room", room_id=room_id, room_name=room.name)
result = await ics_sync_service.sync_room_calendar(room)
if result["status"] == "success":
logger.info(
"ICS sync completed successfully",
room_id=room_id,
events_found=result.get("events_found", 0),
events_created=result.get("events_created", 0),
events_updated=result.get("events_updated", 0),
events_deleted=result.get("events_deleted", 0),
)
elif result["status"] == "unchanged":
logger.debug("ICS content unchanged", room_id=room_id)
elif result["status"] == "error":
logger.error("ICS sync failed", room_id=room_id, error=result.get("error"))
else:
logger.debug(
"ICS sync skipped", room_id=room_id, reason=result.get("reason")
)
except Exception as e:
logger.error("Unexpected error during ICS sync", room_id=room_id, error=str(e))
@shared_task
def sync_all_ics_calendars():
asynctask(_sync_all_ics_calendars_async())
async def _sync_all_ics_calendars_async():
try:
logger.info("Starting sync for all ICS-enabled rooms")
# Get ALL rooms - not filtered by is_shared
query = rooms.select().where(
rooms.c.ics_enabled == True, rooms.c.ics_url != None
)
all_rooms = await get_database().fetch_all(query)
ics_enabled_rooms = list(all_rooms)
logger.info(f"Found {len(ics_enabled_rooms)} rooms with ICS enabled")
for room_data in ics_enabled_rooms:
room_id = room_data["id"]
room = await rooms_controller.get_by_id(room_id)
if not room:
continue
if not _should_sync(room):
logger.debug("Skipping room, not time to sync yet", room_id=room_id)
continue
sync_room_ics.delay(room_id)
logger.info("Queued sync tasks for all eligible rooms")
except Exception as e:
logger.error("Error in sync_all_ics_calendars", error=str(e))
def _should_sync(room) -> bool:
if not room.ics_last_sync:
return True
time_since_sync = datetime.now(timezone.utc) - room.ics_last_sync
return time_since_sync.total_seconds() >= room.ics_fetch_interval
@shared_task
def pre_create_upcoming_meetings():
asynctask(_pre_create_upcoming_meetings_async())
async def _pre_create_upcoming_meetings_async():
try:
logger.info("Starting pre-creation of upcoming meetings")
from reflector.db.calendar_events import calendar_events_controller
# Get ALL rooms with ICS enabled
query = rooms.select().where(
rooms.c.ics_enabled == True, rooms.c.ics_url != None
)
all_rooms = await get_database().fetch_all(query)
now = datetime.now(timezone.utc)
pre_create_window = now + timedelta(minutes=1)
for room_data in all_rooms:
room_id = room_data["id"]
room = await rooms_controller.get_by_id(room_id)
if not room:
continue
events = await calendar_events_controller.get_upcoming(
room_id, minutes_ahead=2
)
for event in events:
if event.start_time <= pre_create_window:
existing_meeting = await meetings_controller.get_by_calendar_event(
event.id
)
if not existing_meeting:
logger.info(
"Pre-creating meeting for calendar event",
room_id=room_id,
event_id=event.id,
event_title=event.title,
)
try:
end_date = event.end_time or (
event.start_time + timedelta(hours=1)
)
whereby_meeting = await create_meeting(
event.title or "Scheduled Meeting",
end_date=end_date,
room=room,
)
await upload_logo(
whereby_meeting["roomName"], "./images/logo.png"
)
meeting = await meetings_controller.create(
id=whereby_meeting["meetingId"],
room_name=whereby_meeting["roomName"],
room_url=whereby_meeting["roomUrl"],
host_room_url=whereby_meeting["hostRoomUrl"],
start_date=datetime.fromisoformat(
whereby_meeting["startDate"]
),
end_date=datetime.fromisoformat(
whereby_meeting["endDate"]
),
user_id=room.user_id,
room=room,
calendar_event_id=event.id,
calendar_metadata={
"title": event.title,
"description": event.description,
"attendees": event.attendees,
},
)
logger.info(
"Meeting pre-created successfully",
meeting_id=meeting.id,
event_id=event.id,
)
except Exception as e:
logger.error(
"Failed to pre-create meeting",
room_id=room_id,
event_id=event.id,
error=str(e),
)
logger.info("Completed pre-creation check for upcoming meetings")
except Exception as e:
logger.error("Error in pre_create_upcoming_meetings", error=str(e))
def asynctask(coro):
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
return loop.run_until_complete(coro)
finally:
loop.close()

View File

@@ -1,6 +1,6 @@
import json
import os
from datetime import datetime, timedelta, timezone
from datetime import datetime, timezone
from urllib.parse import unquote
import av
@@ -14,7 +14,8 @@ from reflector.db.meetings import meetings_controller
from reflector.db.recordings import Recording, recordings_controller
from reflector.db.rooms import rooms_controller
from reflector.db.transcripts import SourceKind, transcripts_controller
from reflector.pipelines.main_live_pipeline import asynctask, task_pipeline_process
from reflector.pipelines.main_file_pipeline import task_pipeline_file_process
from reflector.pipelines.main_live_pipeline import asynctask
from reflector.settings import settings
from reflector.whereby import get_room_sessions
@@ -140,82 +141,30 @@ async def process_recording(bucket_name: str, object_key: str):
await transcripts_controller.update(transcript, {"status": "uploaded"})
task_pipeline_process.delay(transcript_id=transcript.id)
task_pipeline_file_process.delay(transcript_id=transcript.id)
@shared_task
@asynctask
async def process_meetings():
"""
Checks which meetings are still active and deactivates those that have ended.
Supports multiple active meetings per room and grace period logic.
"""
logger.info("Processing meetings")
meetings = await meetings_controller.get_all_active()
current_time = datetime.now(timezone.utc)
for meeting in meetings:
should_deactivate = False
is_active = False
end_date = meeting.end_date
if end_date.tzinfo is None:
end_date = end_date.replace(tzinfo=timezone.utc)
# Check if meeting has passed its scheduled end time
if end_date <= current_time:
# For calendar meetings, force close 30 minutes after scheduled end
if meeting.calendar_event_id:
if current_time > end_date + timedelta(minutes=30):
should_deactivate = True
logger.info(
"Meeting %s forced closed 30 min after calendar end", meeting.id
)
else:
# Unscheduled meetings follow normal closure rules
should_deactivate = True
# Check Whereby room sessions only if not already deactivating
if not should_deactivate and end_date > current_time:
if end_date > datetime.now(timezone.utc):
response = await get_room_sessions(meeting.room_name)
room_sessions = response.get("results", [])
has_active_sessions = room_sessions and any(
is_active = not room_sessions or any(
rs["endedAt"] is None for rs in room_sessions
)
if not has_active_sessions:
# No active sessions - check grace period
if meeting.num_clients == 0:
if meeting.last_participant_left_at:
# Check if grace period has expired
grace_period = timedelta(minutes=meeting.grace_period_minutes)
if (
current_time
> meeting.last_participant_left_at + grace_period
):
should_deactivate = True
logger.info("Meeting %s grace period expired", meeting.id)
else:
# First time all participants left, record the time
await meetings_controller.update_meeting(
meeting.id, last_participant_left_at=current_time
)
logger.info(
"Meeting %s marked empty at %s", meeting.id, current_time
)
else:
# Has active sessions - clear grace period if set
if meeting.last_participant_left_at:
await meetings_controller.update_meeting(
meeting.id, last_participant_left_at=None
)
logger.info(
"Meeting %s reactivated - participant rejoined", meeting.id
)
if should_deactivate:
if not is_active:
await meetings_controller.update_meeting(meeting.id, is_active=False)
logger.info("Meeting %s is deactivated", meeting.id)
logger.info("Processed %d meetings", len(meetings))
logger.info("Processed meetings")
@shared_task

View File

@@ -0,0 +1,258 @@
"""Webhook task for sending transcript notifications."""
import hashlib
import hmac
import json
import uuid
from datetime import datetime, timezone
import httpx
import structlog
from celery import shared_task
from celery.utils.log import get_task_logger
from reflector.db.rooms import rooms_controller
from reflector.db.transcripts import transcripts_controller
from reflector.pipelines.main_live_pipeline import asynctask
from reflector.settings import settings
from reflector.utils.webvtt import topics_to_webvtt
logger = structlog.wrap_logger(get_task_logger(__name__))
def generate_webhook_signature(payload: bytes, secret: str, timestamp: str) -> str:
"""Generate HMAC signature for webhook payload."""
signed_payload = f"{timestamp}.{payload.decode('utf-8')}"
hmac_obj = hmac.new(
secret.encode("utf-8"),
signed_payload.encode("utf-8"),
hashlib.sha256,
)
return hmac_obj.hexdigest()
@shared_task(
bind=True,
max_retries=30,
default_retry_delay=60,
retry_backoff=True,
retry_backoff_max=3600, # Max 1 hour between retries
)
@asynctask
async def send_transcript_webhook(
self,
transcript_id: str,
room_id: str,
event_id: str,
):
log = logger.bind(
transcript_id=transcript_id,
room_id=room_id,
retry_count=self.request.retries,
)
try:
# Fetch transcript and room
transcript = await transcripts_controller.get_by_id(transcript_id)
if not transcript:
log.error("Transcript not found, skipping webhook")
return
room = await rooms_controller.get_by_id(room_id)
if not room:
log.error("Room not found, skipping webhook")
return
if not room.webhook_url:
log.info("No webhook URL configured for room, skipping")
return
# Generate WebVTT content from topics
topics_data = []
if transcript.topics:
# Build topics data with diarized content per topic
for topic in transcript.topics:
topic_webvtt = topics_to_webvtt([topic]) if topic.words else ""
topics_data.append(
{
"title": topic.title,
"summary": topic.summary,
"timestamp": topic.timestamp,
"duration": topic.duration,
"webvtt": topic_webvtt,
}
)
# Build webhook payload
frontend_url = f"{settings.UI_BASE_URL}/transcripts/{transcript.id}"
participants = [
{"id": p.id, "name": p.name, "speaker": p.speaker}
for p in (transcript.participants or [])
]
payload_data = {
"event": "transcript.completed",
"event_id": event_id,
"timestamp": datetime.now(timezone.utc).isoformat(),
"transcript": {
"id": transcript.id,
"room_id": transcript.room_id,
"created_at": transcript.created_at.isoformat(),
"duration": transcript.duration,
"title": transcript.title,
"short_summary": transcript.short_summary,
"long_summary": transcript.long_summary,
"webvtt": transcript.webvtt,
"topics": topics_data,
"participants": participants,
"source_language": transcript.source_language,
"target_language": transcript.target_language,
"status": transcript.status,
"frontend_url": frontend_url,
},
"room": {
"id": room.id,
"name": room.name,
},
}
# Convert to JSON
payload_json = json.dumps(payload_data, separators=(",", ":"))
payload_bytes = payload_json.encode("utf-8")
# Generate signature if secret is configured
headers = {
"Content-Type": "application/json",
"User-Agent": "Reflector-Webhook/1.0",
"X-Webhook-Event": "transcript.completed",
"X-Webhook-Retry": str(self.request.retries),
}
if room.webhook_secret:
timestamp = str(int(datetime.now(timezone.utc).timestamp()))
signature = generate_webhook_signature(
payload_bytes, room.webhook_secret, timestamp
)
headers["X-Webhook-Signature"] = f"t={timestamp},v1={signature}"
# Send webhook with timeout
async with httpx.AsyncClient(timeout=30.0) as client:
log.info(
"Sending webhook",
url=room.webhook_url,
payload_size=len(payload_bytes),
)
response = await client.post(
room.webhook_url,
content=payload_bytes,
headers=headers,
)
response.raise_for_status()
log.info(
"Webhook sent successfully",
status_code=response.status_code,
response_size=len(response.content),
)
except httpx.HTTPStatusError as e:
log.error(
"Webhook failed with HTTP error",
status_code=e.response.status_code,
response_text=e.response.text[:500], # First 500 chars
)
# Don't retry on client errors (4xx)
if 400 <= e.response.status_code < 500:
log.error("Client error, not retrying")
return
# Retry on server errors (5xx)
raise self.retry(exc=e)
except (httpx.ConnectError, httpx.TimeoutException) as e:
# Retry on network errors
log.error("Webhook failed with connection error", error=str(e))
raise self.retry(exc=e)
except Exception as e:
# Retry on unexpected errors
log.exception("Unexpected error in webhook task", error=str(e))
raise self.retry(exc=e)
async def test_webhook(room_id: str) -> dict:
"""
Test webhook configuration by sending a sample payload.
Returns immediately with success/failure status.
This is the shared implementation used by both the API endpoint and Celery task.
"""
try:
room = await rooms_controller.get_by_id(room_id)
if not room:
return {"success": False, "error": "Room not found"}
if not room.webhook_url:
return {"success": False, "error": "No webhook URL configured"}
now = (datetime.now(timezone.utc).isoformat(),)
payload_data = {
"event": "test",
"event_id": uuid.uuid4().hex,
"timestamp": now,
"message": "This is a test webhook from Reflector",
"room": {
"id": room.id,
"name": room.name,
},
}
payload_json = json.dumps(payload_data, separators=(",", ":"))
payload_bytes = payload_json.encode("utf-8")
# Generate headers with signature
headers = {
"Content-Type": "application/json",
"User-Agent": "Reflector-Webhook/1.0",
"X-Webhook-Event": "test",
}
if room.webhook_secret:
timestamp = str(int(datetime.now(timezone.utc).timestamp()))
signature = generate_webhook_signature(
payload_bytes, room.webhook_secret, timestamp
)
headers["X-Webhook-Signature"] = f"t={timestamp},v1={signature}"
# Send test webhook with short timeout
async with httpx.AsyncClient(timeout=10.0) as client:
response = await client.post(
room.webhook_url,
content=payload_bytes,
headers=headers,
)
return {
"success": response.is_success,
"status_code": response.status_code,
"message": f"Webhook test {'successful' if response.is_success else 'failed'}",
"response_preview": response.text if response.text else None,
}
except httpx.TimeoutException:
return {
"success": False,
"error": "Webhook request timed out (10 seconds)",
}
except httpx.ConnectError as e:
return {
"success": False,
"error": f"Could not connect to webhook URL: {str(e)}",
}
except Exception as e:
return {
"success": False,
"error": f"Unexpected error: {str(e)}",
}

View File

@@ -0,0 +1,40 @@
interactions:
- request:
body: ''
headers:
accept:
- '*/*'
accept-encoding:
- gzip, deflate
authorization:
- DUMMY_API_KEY
connection:
- keep-alive
content-length:
- '0'
host:
- monadical-sas--reflector-diarizer-web.modal.run
user-agent:
- python-httpx/0.27.2
method: POST
uri: https://monadical-sas--reflector-diarizer-web.modal.run/diarize?audio_file_url=https%3A%2F%2Freflector-github-pytest.s3.us-east-1.amazonaws.com%2Ftest_mathieu_hello.mp3&timestamp=0
response:
body:
string: '{"diarization":[{"start":0.823,"end":1.91,"speaker":0},{"start":2.572,"end":6.409,"speaker":0},{"start":6.783,"end":10.62,"speaker":0},{"start":11.231,"end":14.168,"speaker":0},{"start":14.796,"end":19.295,"speaker":0}]}'
headers:
Alt-Svc:
- h3=":443"; ma=2592000
Content-Length:
- '220'
Content-Type:
- application/json
Date:
- Wed, 13 Aug 2025 18:25:34 GMT
Modal-Function-Call-Id:
- fc-01K2JAVNEP6N7Y1Y7W3T98BCXK
Vary:
- accept-encoding
status:
code: 200
message: OK
version: 1

View File

@@ -0,0 +1,46 @@
interactions:
- request:
body: '{"audio_file_url": "https://reflector-github-pytest.s3.us-east-1.amazonaws.com/test_mathieu_hello.mp3",
"language": "en", "batch": true}'
headers:
accept:
- '*/*'
accept-encoding:
- gzip, deflate
authorization:
- DUMMY_API_KEY
connection:
- keep-alive
content-length:
- '136'
content-type:
- application/json
host:
- monadical-sas--reflector-transcriber-parakeet-web.modal.run
user-agent:
- python-httpx/0.27.2
method: POST
uri: https://monadical-sas--reflector-transcriber-parakeet-web.modal.run/v1/audio/transcriptions-from-url
response:
body:
string: '{"text":"Hi there everyone. Today I want to share my incredible experience
with Reflector. a Q teenage product that revolutionizes audio processing.
With reflector, I can easily convert any audio into accurate transcription.
saving me hours of tedious manual work.","words":[{"word":"Hi","start":0.87,"end":1.19},{"word":"there","start":1.19,"end":1.35},{"word":"everyone.","start":1.51,"end":1.83},{"word":"Today","start":2.63,"end":2.87},{"word":"I","start":3.36,"end":3.52},{"word":"want","start":3.6,"end":3.76},{"word":"to","start":3.76,"end":3.92},{"word":"share","start":3.92,"end":4.16},{"word":"my","start":4.16,"end":4.4},{"word":"incredible","start":4.32,"end":4.96},{"word":"experience","start":4.96,"end":5.44},{"word":"with","start":5.44,"end":5.68},{"word":"Reflector.","start":5.68,"end":6.24},{"word":"a","start":6.93,"end":7.01},{"word":"Q","start":7.01,"end":7.17},{"word":"teenage","start":7.25,"end":7.65},{"word":"product","start":7.89,"end":8.29},{"word":"that","start":8.29,"end":8.61},{"word":"revolutionizes","start":8.61,"end":9.65},{"word":"audio","start":9.65,"end":10.05},{"word":"processing.","start":10.05,"end":10.53},{"word":"With","start":11.27,"end":11.43},{"word":"reflector,","start":11.51,"end":12.15},{"word":"I","start":12.31,"end":12.39},{"word":"can","start":12.39,"end":12.55},{"word":"easily","start":12.55,"end":12.95},{"word":"convert","start":12.95,"end":13.43},{"word":"any","start":13.43,"end":13.67},{"word":"audio","start":13.67,"end":13.99},{"word":"into","start":14.98,"end":15.06},{"word":"accurate","start":15.22,"end":15.54},{"word":"transcription.","start":15.7,"end":16.34},{"word":"saving","start":16.99,"end":17.15},{"word":"me","start":17.31,"end":17.47},{"word":"hours","start":17.47,"end":17.87},{"word":"of","start":17.87,"end":18.11},{"word":"tedious","start":18.11,"end":18.67},{"word":"manual","start":18.67,"end":19.07},{"word":"work.","start":19.07,"end":19.31}]}'
headers:
Alt-Svc:
- h3=":443"; ma=2592000
Content-Length:
- '1933'
Content-Type:
- application/json
Date:
- Wed, 13 Aug 2025 18:26:59 GMT
Modal-Function-Call-Id:
- fc-01K2JAWC7GAMKX4DSJ21WV31NG
Vary:
- accept-encoding
status:
code: 200
message: OK
version: 1

View File

@@ -0,0 +1,84 @@
interactions:
- request:
body: '{"audio_file_url": "https://reflector-github-pytest.s3.us-east-1.amazonaws.com/test_mathieu_hello.mp3",
"language": "en", "batch": true}'
headers:
accept:
- '*/*'
accept-encoding:
- gzip, deflate
authorization:
- DUMMY_API_KEY
connection:
- keep-alive
content-length:
- '136'
content-type:
- application/json
host:
- monadical-sas--reflector-transcriber-parakeet-web.modal.run
user-agent:
- python-httpx/0.27.2
method: POST
uri: https://monadical-sas--reflector-transcriber-parakeet-web.modal.run/v1/audio/transcriptions-from-url
response:
body:
string: '{"text":"Hi there everyone. Today I want to share my incredible experience
with Reflector. a Q teenage product that revolutionizes audio processing.
With reflector, I can easily convert any audio into accurate transcription.
saving me hours of tedious manual work.","words":[{"word":"Hi","start":0.87,"end":1.19},{"word":"there","start":1.19,"end":1.35},{"word":"everyone.","start":1.51,"end":1.83},{"word":"Today","start":2.63,"end":2.87},{"word":"I","start":3.36,"end":3.52},{"word":"want","start":3.6,"end":3.76},{"word":"to","start":3.76,"end":3.92},{"word":"share","start":3.92,"end":4.16},{"word":"my","start":4.16,"end":4.4},{"word":"incredible","start":4.32,"end":4.96},{"word":"experience","start":4.96,"end":5.44},{"word":"with","start":5.44,"end":5.68},{"word":"Reflector.","start":5.68,"end":6.24},{"word":"a","start":6.93,"end":7.01},{"word":"Q","start":7.01,"end":7.17},{"word":"teenage","start":7.25,"end":7.65},{"word":"product","start":7.89,"end":8.29},{"word":"that","start":8.29,"end":8.61},{"word":"revolutionizes","start":8.61,"end":9.65},{"word":"audio","start":9.65,"end":10.05},{"word":"processing.","start":10.05,"end":10.53},{"word":"With","start":11.27,"end":11.43},{"word":"reflector,","start":11.51,"end":12.15},{"word":"I","start":12.31,"end":12.39},{"word":"can","start":12.39,"end":12.55},{"word":"easily","start":12.55,"end":12.95},{"word":"convert","start":12.95,"end":13.43},{"word":"any","start":13.43,"end":13.67},{"word":"audio","start":13.67,"end":13.99},{"word":"into","start":14.98,"end":15.06},{"word":"accurate","start":15.22,"end":15.54},{"word":"transcription.","start":15.7,"end":16.34},{"word":"saving","start":16.99,"end":17.15},{"word":"me","start":17.31,"end":17.47},{"word":"hours","start":17.47,"end":17.87},{"word":"of","start":17.87,"end":18.11},{"word":"tedious","start":18.11,"end":18.67},{"word":"manual","start":18.67,"end":19.07},{"word":"work.","start":19.07,"end":19.31}]}'
headers:
Alt-Svc:
- h3=":443"; ma=2592000
Content-Length:
- '1933'
Content-Type:
- application/json
Date:
- Wed, 13 Aug 2025 18:27:02 GMT
Modal-Function-Call-Id:
- fc-01K2JAYZ1AR2HE422VJVKBWX9Z
Vary:
- accept-encoding
status:
code: 200
message: OK
- request:
body: ''
headers:
accept:
- '*/*'
accept-encoding:
- gzip, deflate
authorization:
- DUMMY_API_KEY
connection:
- keep-alive
content-length:
- '0'
host:
- monadical-sas--reflector-diarizer-web.modal.run
user-agent:
- python-httpx/0.27.2
method: POST
uri: https://monadical-sas--reflector-diarizer-web.modal.run/diarize?audio_file_url=https%3A%2F%2Freflector-github-pytest.s3.us-east-1.amazonaws.com%2Ftest_mathieu_hello.mp3&timestamp=0
response:
body:
string: '{"diarization":[{"start":0.823,"end":1.91,"speaker":0},{"start":2.572,"end":6.409,"speaker":0},{"start":6.783,"end":10.62,"speaker":0},{"start":11.231,"end":14.168,"speaker":0},{"start":14.796,"end":19.295,"speaker":0}]}'
headers:
Alt-Svc:
- h3=":443"; ma=2592000
Content-Length:
- '220'
Content-Type:
- application/json
Date:
- Wed, 13 Aug 2025 18:27:18 GMT
Modal-Function-Call-Id:
- fc-01K2JAZ1M34NQRJK03CCFK95D6
Vary:
- accept-encoding
status:
code: 200
message: OK
version: 1

View File

@@ -5,7 +5,29 @@ from unittest.mock import patch
import pytest
# Pytest-docker configuration
@pytest.fixture(scope="session", autouse=True)
def settings_configuration():
# theses settings are linked to monadical for pytest-recording
# if a fork is done, they have to provide their own url when cassettes needs to be updated
# modal api keys has to be defined by the user
from reflector.settings import settings
settings.TRANSCRIPT_BACKEND = "modal"
settings.TRANSCRIPT_URL = (
"https://monadical-sas--reflector-transcriber-parakeet-web.modal.run"
)
settings.DIARIZATION_BACKEND = "modal"
settings.DIARIZATION_URL = "https://monadical-sas--reflector-diarizer-web.modal.run"
@pytest.fixture(scope="module")
def vcr_config():
"""VCR configuration to filter sensitive headers"""
return {
"filter_headers": [("authorization", "DUMMY_API_KEY")],
}
@pytest.fixture(scope="session")
def docker_compose_file(pytestconfig):
return os.path.join(str(pytestconfig.rootdir), "tests", "docker-compose.test.yml")
@@ -156,6 +178,63 @@ async def dummy_diarization():
yield
@pytest.fixture
async def dummy_file_transcript():
from reflector.processors.file_transcript import FileTranscriptProcessor
from reflector.processors.types import Transcript, Word
class TestFileTranscriptProcessor(FileTranscriptProcessor):
async def _transcript(self, data):
return Transcript(
text="Hello world. How are you today?",
words=[
Word(start=0.0, end=0.5, text="Hello", speaker=0),
Word(start=0.5, end=0.6, text=" ", speaker=0),
Word(start=0.6, end=1.0, text="world", speaker=0),
Word(start=1.0, end=1.1, text=".", speaker=0),
Word(start=1.1, end=1.2, text=" ", speaker=0),
Word(start=1.2, end=1.5, text="How", speaker=0),
Word(start=1.5, end=1.6, text=" ", speaker=0),
Word(start=1.6, end=1.8, text="are", speaker=0),
Word(start=1.8, end=1.9, text=" ", speaker=0),
Word(start=1.9, end=2.1, text="you", speaker=0),
Word(start=2.1, end=2.2, text=" ", speaker=0),
Word(start=2.2, end=2.5, text="today", speaker=0),
Word(start=2.5, end=2.6, text="?", speaker=0),
],
)
with patch(
"reflector.processors.file_transcript_auto.FileTranscriptAutoProcessor.__new__"
) as mock_auto:
mock_auto.return_value = TestFileTranscriptProcessor()
yield
@pytest.fixture
async def dummy_file_diarization():
from reflector.processors.file_diarization import (
FileDiarizationOutput,
FileDiarizationProcessor,
)
from reflector.processors.types import DiarizationSegment
class TestFileDiarizationProcessor(FileDiarizationProcessor):
async def _diarize(self, data):
return FileDiarizationOutput(
diarization=[
DiarizationSegment(start=0.0, end=1.1, speaker=0),
DiarizationSegment(start=1.2, end=2.6, speaker=1),
]
)
with patch(
"reflector.processors.file_diarization_auto.FileDiarizationAutoProcessor.__new__"
) as mock_auto:
mock_auto.return_value = TestFileDiarizationProcessor()
yield
@pytest.fixture
async def dummy_transcript_translator():
from reflector.processors.transcript_translator import TranscriptTranslatorProcessor
@@ -216,9 +295,13 @@ async def dummy_storage():
with (
patch("reflector.storage.base.Storage.get_instance") as mock_storage,
patch("reflector.storage.get_transcripts_storage") as mock_get_transcripts,
patch(
"reflector.pipelines.main_file_pipeline.get_transcripts_storage"
) as mock_get_transcripts2,
):
mock_storage.return_value = dummy
mock_get_transcripts.return_value = dummy
mock_get_transcripts2.return_value = dummy
yield
@@ -238,7 +321,10 @@ def celery_config():
@pytest.fixture(scope="session")
def celery_includes():
return ["reflector.pipelines.main_live_pipeline"]
return [
"reflector.pipelines.main_live_pipeline",
"reflector.pipelines.main_file_pipeline",
]
@pytest.fixture
@@ -280,7 +366,7 @@ async def fake_transcript_with_topics(tmpdir, client):
transcript = await transcripts_controller.get_by_id(tid)
assert transcript is not None
await transcripts_controller.update(transcript, {"status": "finished"})
await transcripts_controller.update(transcript, {"status": "ended"})
# manually copy a file at the expected location
audio_filename = transcript.audio_mp3_filename

View File

@@ -1,7 +1,7 @@
version: '3.8'
version: "3.8"
services:
postgres_test:
image: postgres:15
image: postgres:17
environment:
POSTGRES_DB: reflector_test
POSTGRES_USER: test_user
@@ -10,4 +10,4 @@ services:
- "15432:5432"
command: postgres -c fsync=off -c synchronous_commit=off -c full_page_writes=off
tmpfs:
- /var/lib/postgresql/data:rw,noexec,nosuid,size=1g
- /var/lib/postgresql/data:rw,noexec,nosuid,size=1g

View File

@@ -1,351 +0,0 @@
"""
Tests for CalendarEvent model.
"""
from datetime import datetime, timedelta, timezone
import pytest
from reflector.db.calendar_events import CalendarEvent, calendar_events_controller
from reflector.db.rooms import rooms_controller
@pytest.mark.asyncio
async def test_calendar_event_create():
"""Test creating a calendar event."""
# Create a room first
room = await rooms_controller.add(
name="test-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
# Create calendar event
now = datetime.now(timezone.utc)
event = CalendarEvent(
room_id=room.id,
ics_uid="test-event-123",
title="Team Meeting",
description="Weekly team sync",
start_time=now + timedelta(hours=1),
end_time=now + timedelta(hours=2),
location=f"https://example.com/room/{room.name}",
attendees=[
{"email": "alice@example.com", "name": "Alice", "status": "ACCEPTED"},
{"email": "bob@example.com", "name": "Bob", "status": "TENTATIVE"},
],
)
# Save event
saved_event = await calendar_events_controller.upsert(event)
assert saved_event.ics_uid == "test-event-123"
assert saved_event.title == "Team Meeting"
assert saved_event.room_id == room.id
assert len(saved_event.attendees) == 2
@pytest.mark.asyncio
async def test_calendar_event_get_by_room():
"""Test getting calendar events for a room."""
# Create room
room = await rooms_controller.add(
name="events-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
now = datetime.now(timezone.utc)
# Create multiple events
for i in range(3):
event = CalendarEvent(
room_id=room.id,
ics_uid=f"event-{i}",
title=f"Meeting {i}",
start_time=now + timedelta(hours=i),
end_time=now + timedelta(hours=i + 1),
)
await calendar_events_controller.upsert(event)
# Get events for room
events = await calendar_events_controller.get_by_room(room.id)
assert len(events) == 3
assert all(e.room_id == room.id for e in events)
assert events[0].title == "Meeting 0"
assert events[1].title == "Meeting 1"
assert events[2].title == "Meeting 2"
@pytest.mark.asyncio
async def test_calendar_event_get_upcoming():
"""Test getting upcoming events within time window."""
# Create room
room = await rooms_controller.add(
name="upcoming-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
now = datetime.now(timezone.utc)
# Create events at different times
# Past event (should not be included)
past_event = CalendarEvent(
room_id=room.id,
ics_uid="past-event",
title="Past Meeting",
start_time=now - timedelta(hours=2),
end_time=now - timedelta(hours=1),
)
await calendar_events_controller.upsert(past_event)
# Upcoming event within 30 minutes
upcoming_event = CalendarEvent(
room_id=room.id,
ics_uid="upcoming-event",
title="Upcoming Meeting",
start_time=now + timedelta(minutes=15),
end_time=now + timedelta(minutes=45),
)
await calendar_events_controller.upsert(upcoming_event)
# Future event beyond 30 minutes
future_event = CalendarEvent(
room_id=room.id,
ics_uid="future-event",
title="Future Meeting",
start_time=now + timedelta(hours=2),
end_time=now + timedelta(hours=3),
)
await calendar_events_controller.upsert(future_event)
# Get upcoming events (default 30 minutes)
upcoming = await calendar_events_controller.get_upcoming(room.id)
assert len(upcoming) == 1
assert upcoming[0].ics_uid == "upcoming-event"
# Get upcoming with custom window
upcoming_extended = await calendar_events_controller.get_upcoming(
room.id, minutes_ahead=180
)
assert len(upcoming_extended) == 2
assert upcoming_extended[0].ics_uid == "upcoming-event"
assert upcoming_extended[1].ics_uid == "future-event"
@pytest.mark.asyncio
async def test_calendar_event_upsert():
"""Test upserting (create/update) calendar events."""
# Create room
room = await rooms_controller.add(
name="upsert-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
now = datetime.now(timezone.utc)
# Create new event
event = CalendarEvent(
room_id=room.id,
ics_uid="upsert-test",
title="Original Title",
start_time=now,
end_time=now + timedelta(hours=1),
)
created = await calendar_events_controller.upsert(event)
assert created.title == "Original Title"
# Update existing event
event.title = "Updated Title"
event.description = "Added description"
updated = await calendar_events_controller.upsert(event)
assert updated.title == "Updated Title"
assert updated.description == "Added description"
assert updated.ics_uid == "upsert-test"
# Verify only one event exists
events = await calendar_events_controller.get_by_room(room.id)
assert len(events) == 1
assert events[0].title == "Updated Title"
@pytest.mark.asyncio
async def test_calendar_event_soft_delete():
"""Test soft deleting events no longer in calendar."""
# Create room
room = await rooms_controller.add(
name="delete-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
now = datetime.now(timezone.utc)
# Create multiple events
for i in range(4):
event = CalendarEvent(
room_id=room.id,
ics_uid=f"event-{i}",
title=f"Meeting {i}",
start_time=now + timedelta(hours=i),
end_time=now + timedelta(hours=i + 1),
)
await calendar_events_controller.upsert(event)
# Soft delete events not in current list
current_ids = ["event-0", "event-2"] # Keep events 0 and 2
deleted_count = await calendar_events_controller.soft_delete_missing(
room.id, current_ids
)
assert deleted_count == 2 # Should delete events 1 and 3
# Get non-deleted events
events = await calendar_events_controller.get_by_room(
room.id, include_deleted=False
)
assert len(events) == 2
assert {e.ics_uid for e in events} == {"event-0", "event-2"}
# Get all events including deleted
all_events = await calendar_events_controller.get_by_room(
room.id, include_deleted=True
)
assert len(all_events) == 4
@pytest.mark.asyncio
async def test_calendar_event_past_events_not_deleted():
"""Test that past events are not soft deleted."""
# Create room
room = await rooms_controller.add(
name="past-events-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
now = datetime.now(timezone.utc)
# Create past event
past_event = CalendarEvent(
room_id=room.id,
ics_uid="past-event",
title="Past Meeting",
start_time=now - timedelta(hours=2),
end_time=now - timedelta(hours=1),
)
await calendar_events_controller.upsert(past_event)
# Create future event
future_event = CalendarEvent(
room_id=room.id,
ics_uid="future-event",
title="Future Meeting",
start_time=now + timedelta(hours=1),
end_time=now + timedelta(hours=2),
)
await calendar_events_controller.upsert(future_event)
# Try to soft delete all events (only future should be deleted)
deleted_count = await calendar_events_controller.soft_delete_missing(room.id, [])
assert deleted_count == 1 # Only future event deleted
# Verify past event still exists
events = await calendar_events_controller.get_by_room(
room.id, include_deleted=False
)
assert len(events) == 1
assert events[0].ics_uid == "past-event"
@pytest.mark.asyncio
async def test_calendar_event_with_raw_ics_data():
"""Test storing raw ICS data with calendar event."""
# Create room
room = await rooms_controller.add(
name="raw-ics-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
raw_ics = """BEGIN:VEVENT
UID:test-raw-123
SUMMARY:Test Event
DTSTART:20240101T100000Z
DTEND:20240101T110000Z
END:VEVENT"""
event = CalendarEvent(
room_id=room.id,
ics_uid="test-raw-123",
title="Test Event",
start_time=datetime.now(timezone.utc),
end_time=datetime.now(timezone.utc) + timedelta(hours=1),
ics_raw_data=raw_ics,
)
saved = await calendar_events_controller.upsert(event)
assert saved.ics_raw_data == raw_ics
# Retrieve and verify
retrieved = await calendar_events_controller.get_by_ics_uid(room.id, "test-raw-123")
assert retrieved is not None
assert retrieved.ics_raw_data == raw_ics

View File

@@ -0,0 +1,287 @@
from datetime import datetime, timedelta, timezone
from unittest.mock import AsyncMock, patch
import pytest
from reflector.db.recordings import Recording, recordings_controller
from reflector.db.transcripts import SourceKind, transcripts_controller
from reflector.worker.cleanup import cleanup_old_public_data
@pytest.mark.asyncio
async def test_cleanup_old_public_data_skips_when_not_public():
"""Test that cleanup is skipped when PUBLIC_MODE is False."""
with patch("reflector.worker.cleanup.settings") as mock_settings:
mock_settings.PUBLIC_MODE = False
result = await cleanup_old_public_data()
# Should return early without doing anything
assert result is None
@pytest.mark.asyncio
async def test_cleanup_old_public_data_deletes_old_anonymous_transcripts():
"""Test that old anonymous transcripts are deleted."""
# Create old and new anonymous transcripts
old_date = datetime.now(timezone.utc) - timedelta(days=8)
new_date = datetime.now(timezone.utc) - timedelta(days=2)
# Create old anonymous transcript (should be deleted)
old_transcript = await transcripts_controller.add(
name="Old Anonymous Transcript",
source_kind=SourceKind.FILE,
user_id=None, # Anonymous
)
# Manually update created_at to be old
from reflector.db import get_database
from reflector.db.transcripts import transcripts
await get_database().execute(
transcripts.update()
.where(transcripts.c.id == old_transcript.id)
.values(created_at=old_date)
)
# Create new anonymous transcript (should NOT be deleted)
new_transcript = await transcripts_controller.add(
name="New Anonymous Transcript",
source_kind=SourceKind.FILE,
user_id=None, # Anonymous
)
# Create old transcript with user (should NOT be deleted)
old_user_transcript = await transcripts_controller.add(
name="Old User Transcript",
source_kind=SourceKind.FILE,
user_id="user123",
)
await get_database().execute(
transcripts.update()
.where(transcripts.c.id == old_user_transcript.id)
.values(created_at=old_date)
)
with patch("reflector.worker.cleanup.settings") as mock_settings:
mock_settings.PUBLIC_MODE = True
mock_settings.PUBLIC_DATA_RETENTION_DAYS = 7
# Mock the storage deletion
with patch("reflector.db.transcripts.get_transcripts_storage") as mock_storage:
mock_storage.return_value.delete_file = AsyncMock()
result = await cleanup_old_public_data()
# Check results
assert result["transcripts_deleted"] == 1
assert result["errors"] == []
# Verify old anonymous transcript was deleted
assert await transcripts_controller.get_by_id(old_transcript.id) is None
# Verify new anonymous transcript still exists
assert await transcripts_controller.get_by_id(new_transcript.id) is not None
# Verify user transcript still exists
assert await transcripts_controller.get_by_id(old_user_transcript.id) is not None
@pytest.mark.asyncio
async def test_cleanup_deletes_associated_meeting_and_recording():
"""Test that meetings and recordings associated with old transcripts are deleted."""
from reflector.db import get_database
from reflector.db.meetings import meetings
from reflector.db.transcripts import transcripts
old_date = datetime.now(timezone.utc) - timedelta(days=8)
# Create a meeting
meeting_id = "test-meeting-for-transcript"
await get_database().execute(
meetings.insert().values(
id=meeting_id,
room_name="Meeting with Transcript",
room_url="https://example.com/meeting",
host_room_url="https://example.com/meeting-host",
start_date=old_date,
end_date=old_date + timedelta(hours=1),
user_id=None,
room_id=None,
)
)
# Create a recording
recording = await recordings_controller.create(
Recording(
bucket_name="test-bucket",
object_key="test-recording.mp4",
recorded_at=old_date,
)
)
# Create an old transcript with both meeting and recording
old_transcript = await transcripts_controller.add(
name="Old Transcript with Meeting and Recording",
source_kind=SourceKind.ROOM,
user_id=None,
meeting_id=meeting_id,
recording_id=recording.id,
)
# Update created_at to be old
await get_database().execute(
transcripts.update()
.where(transcripts.c.id == old_transcript.id)
.values(created_at=old_date)
)
with patch("reflector.worker.cleanup.settings") as mock_settings:
mock_settings.PUBLIC_MODE = True
mock_settings.PUBLIC_DATA_RETENTION_DAYS = 7
# Mock storage deletion
with patch("reflector.db.transcripts.get_transcripts_storage") as mock_storage:
mock_storage.return_value.delete_file = AsyncMock()
with patch(
"reflector.worker.cleanup.get_recordings_storage"
) as mock_rec_storage:
mock_rec_storage.return_value.delete_file = AsyncMock()
result = await cleanup_old_public_data()
# Check results
assert result["transcripts_deleted"] == 1
assert result["meetings_deleted"] == 1
assert result["recordings_deleted"] == 1
assert result["errors"] == []
# Verify transcript was deleted
assert await transcripts_controller.get_by_id(old_transcript.id) is None
# Verify meeting was deleted
query = meetings.select().where(meetings.c.id == meeting_id)
meeting_result = await get_database().fetch_one(query)
assert meeting_result is None
# Verify recording was deleted
assert await recordings_controller.get_by_id(recording.id) is None
@pytest.mark.asyncio
async def test_cleanup_handles_errors_gracefully():
"""Test that cleanup continues even when individual deletions fail."""
old_date = datetime.now(timezone.utc) - timedelta(days=8)
# Create multiple old transcripts
transcript1 = await transcripts_controller.add(
name="Transcript 1",
source_kind=SourceKind.FILE,
user_id=None,
)
transcript2 = await transcripts_controller.add(
name="Transcript 2",
source_kind=SourceKind.FILE,
user_id=None,
)
# Update created_at to be old
from reflector.db import get_database
from reflector.db.transcripts import transcripts
for t_id in [transcript1.id, transcript2.id]:
await get_database().execute(
transcripts.update()
.where(transcripts.c.id == t_id)
.values(created_at=old_date)
)
with patch("reflector.worker.cleanup.settings") as mock_settings:
mock_settings.PUBLIC_MODE = True
mock_settings.PUBLIC_DATA_RETENTION_DAYS = 7
# Mock remove_by_id to fail for the first transcript
original_remove = transcripts_controller.remove_by_id
call_count = 0
async def mock_remove_by_id(transcript_id, user_id=None):
nonlocal call_count
call_count += 1
if call_count == 1:
raise Exception("Simulated deletion error")
return await original_remove(transcript_id, user_id)
with patch.object(
transcripts_controller, "remove_by_id", side_effect=mock_remove_by_id
):
result = await cleanup_old_public_data()
# Should have one successful deletion and one error
assert result["transcripts_deleted"] == 1
assert len(result["errors"]) == 1
assert "Failed to delete transcript" in result["errors"][0]
@pytest.mark.asyncio
async def test_meeting_consent_cascade_delete():
"""Test that meeting_consent records are automatically deleted when meeting is deleted."""
from reflector.db import get_database
from reflector.db.meetings import (
meeting_consent,
meeting_consent_controller,
meetings,
)
# Create a meeting
meeting_id = "test-cascade-meeting"
await get_database().execute(
meetings.insert().values(
id=meeting_id,
room_name="Test Meeting for CASCADE",
room_url="https://example.com/cascade-test",
host_room_url="https://example.com/cascade-test-host",
start_date=datetime.now(timezone.utc),
end_date=datetime.now(timezone.utc) + timedelta(hours=1),
user_id="test-user",
room_id=None,
)
)
# Create consent records for this meeting
consent1_id = "consent-1"
consent2_id = "consent-2"
await get_database().execute(
meeting_consent.insert().values(
id=consent1_id,
meeting_id=meeting_id,
user_id="user1",
consent_given=True,
consent_timestamp=datetime.now(timezone.utc),
)
)
await get_database().execute(
meeting_consent.insert().values(
id=consent2_id,
meeting_id=meeting_id,
user_id="user2",
consent_given=False,
consent_timestamp=datetime.now(timezone.utc),
)
)
# Verify consent records exist
consents = await meeting_consent_controller.get_by_meeting_id(meeting_id)
assert len(consents) == 2
# Delete the meeting
await get_database().execute(meetings.delete().where(meetings.c.id == meeting_id))
# Verify meeting is deleted
query = meetings.select().where(meetings.c.id == meeting_id)
result = await get_database().fetch_one(query)
assert result is None
# Verify consent records are automatically deleted (CASCADE DELETE)
consents_after = await meeting_consent_controller.get_by_meeting_id(meeting_id)
assert len(consents_after) == 0

View File

@@ -0,0 +1,330 @@
"""
Tests for GPU Modal transcription endpoints.
These tests are marked with the "gpu-modal" group and will not run by default.
Run them with: pytest -m gpu-modal tests/test_gpu_modal_transcript_parakeet.py
Required environment variables:
- TRANSCRIPT_URL: URL to the Modal.com endpoint (required)
- TRANSCRIPT_MODAL_API_KEY: API key for authentication (optional)
- TRANSCRIPT_MODEL: Model name to use (optional, defaults to nvidia/parakeet-tdt-0.6b-v2)
Example with pytest (override default addopts to run ONLY gpu_modal tests):
TRANSCRIPT_URL=https://monadical-sas--reflector-transcriber-parakeet-web-dev.modal.run \
TRANSCRIPT_MODAL_API_KEY=your-api-key \
uv run -m pytest -m gpu_modal --no-cov tests/test_gpu_modal_transcript.py
# Or with completely clean options:
uv run -m pytest -m gpu_modal -o addopts="" tests/
Running Modal locally for testing:
modal serve gpu/modal_deployments/reflector_transcriber_parakeet.py
# This will give you a local URL like https://xxxxx--reflector-transcriber-parakeet-web-dev.modal.run to test against
"""
import os
import tempfile
from pathlib import Path
import httpx
import pytest
# Test audio file URL for testing
TEST_AUDIO_URL = (
"https://reflector-github-pytest.s3.us-east-1.amazonaws.com/test_mathieu_hello.mp3"
)
def get_modal_transcript_url():
"""Get and validate the Modal transcript URL from environment."""
url = os.environ.get("TRANSCRIPT_URL")
if not url:
pytest.skip(
"TRANSCRIPT_URL environment variable is required for GPU Modal tests"
)
return url
def get_auth_headers():
"""Get authentication headers if API key is available."""
api_key = os.environ.get("TRANSCRIPT_MODAL_API_KEY")
if api_key:
return {"Authorization": f"Bearer {api_key}"}
return {}
def get_model_name():
"""Get the model name from environment or use default."""
return os.environ.get("TRANSCRIPT_MODEL", "nvidia/parakeet-tdt-0.6b-v2")
@pytest.mark.gpu_modal
class TestGPUModalTranscript:
"""Test suite for GPU Modal transcription endpoints."""
def test_transcriptions_from_url(self):
"""Test the /v1/audio/transcriptions-from-url endpoint."""
url = get_modal_transcript_url()
headers = get_auth_headers()
with httpx.Client(timeout=60.0) as client:
response = client.post(
f"{url}/v1/audio/transcriptions-from-url",
json={
"audio_file_url": TEST_AUDIO_URL,
"model": get_model_name(),
"language": "en",
"timestamp_offset": 0.0,
},
headers=headers,
)
assert response.status_code == 200, f"Request failed: {response.text}"
result = response.json()
# Verify response structure
assert "text" in result
assert "words" in result
assert isinstance(result["text"], str)
assert isinstance(result["words"], list)
# Verify content is meaningful
assert len(result["text"]) > 0, "Transcript text should not be empty"
assert len(result["words"]) > 0, "Words list must not be empty"
# Verify word structure
for word in result["words"]:
assert "word" in word
assert "start" in word
assert "end" in word
assert isinstance(word["start"], (int, float))
assert isinstance(word["end"], (int, float))
assert word["start"] <= word["end"]
def test_transcriptions_single_file(self):
"""Test the /v1/audio/transcriptions endpoint with a single file."""
url = get_modal_transcript_url()
headers = get_auth_headers()
# Download test audio file to upload
with httpx.Client(timeout=60.0) as client:
audio_response = client.get(TEST_AUDIO_URL)
audio_response.raise_for_status()
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp_file:
tmp_file.write(audio_response.content)
tmp_file_path = tmp_file.name
try:
# Upload the file for transcription
with open(tmp_file_path, "rb") as f:
files = {"file": ("test_audio.mp3", f, "audio/mpeg")}
data = {
"model": get_model_name(),
"language": "en",
"batch": "false",
}
response = client.post(
f"{url}/v1/audio/transcriptions",
files=files,
data=data,
headers=headers,
)
assert response.status_code == 200, f"Request failed: {response.text}"
result = response.json()
# Verify response structure for single file
assert "text" in result
assert "words" in result
assert "filename" in result
assert isinstance(result["text"], str)
assert isinstance(result["words"], list)
# Verify content
assert len(result["text"]) > 0, "Transcript text should not be empty"
finally:
Path(tmp_file_path).unlink(missing_ok=True)
def test_transcriptions_multiple_files(self):
"""Test the /v1/audio/transcriptions endpoint with multiple files (non-batch mode)."""
url = get_modal_transcript_url()
headers = get_auth_headers()
# Create multiple test files (we'll use the same audio content for simplicity)
with httpx.Client(timeout=60.0) as client:
audio_response = client.get(TEST_AUDIO_URL)
audio_response.raise_for_status()
audio_content = audio_response.content
temp_files = []
try:
# Create 3 temporary files
for i in range(3):
tmp_file = tempfile.NamedTemporaryFile(suffix=".mp3", delete=False)
tmp_file.write(audio_content)
tmp_file.close()
temp_files.append(tmp_file.name)
# Upload multiple files for transcription (non-batch)
files = [
("files", (f"test_audio_{i}.mp3", open(f, "rb"), "audio/mpeg"))
for i, f in enumerate(temp_files)
]
data = {
"model": get_model_name(),
"language": "en",
"batch": "false",
}
response = client.post(
f"{url}/v1/audio/transcriptions",
files=files,
data=data,
headers=headers,
)
# Close file handles
for _, file_tuple in files:
file_tuple[1].close()
assert response.status_code == 200, f"Request failed: {response.text}"
result = response.json()
# Verify response structure for multiple files (non-batch)
assert "results" in result
assert isinstance(result["results"], list)
assert len(result["results"]) == 3
for idx, file_result in enumerate(result["results"]):
assert "text" in file_result
assert "words" in file_result
assert "filename" in file_result
assert isinstance(file_result["text"], str)
assert isinstance(file_result["words"], list)
assert len(file_result["text"]) > 0
finally:
for f in temp_files:
Path(f).unlink(missing_ok=True)
def test_transcriptions_multiple_files_batch(self):
"""Test the /v1/audio/transcriptions endpoint with multiple files in batch mode."""
url = get_modal_transcript_url()
headers = get_auth_headers()
# Create multiple test files
with httpx.Client(timeout=60.0) as client:
audio_response = client.get(TEST_AUDIO_URL)
audio_response.raise_for_status()
audio_content = audio_response.content
temp_files = []
try:
# Create 3 temporary files
for i in range(3):
tmp_file = tempfile.NamedTemporaryFile(suffix=".mp3", delete=False)
tmp_file.write(audio_content)
tmp_file.close()
temp_files.append(tmp_file.name)
# Upload multiple files for batch transcription
files = [
("files", (f"test_audio_{i}.mp3", open(f, "rb"), "audio/mpeg"))
for i, f in enumerate(temp_files)
]
data = {
"model": get_model_name(),
"language": "en",
"batch": "true",
}
response = client.post(
f"{url}/v1/audio/transcriptions",
files=files,
data=data,
headers=headers,
)
# Close file handles
for _, file_tuple in files:
file_tuple[1].close()
assert response.status_code == 200, f"Request failed: {response.text}"
result = response.json()
# Verify response structure for batch mode
assert "results" in result
assert isinstance(result["results"], list)
assert len(result["results"]) == 3
for idx, batch_result in enumerate(result["results"]):
assert "text" in batch_result
assert "words" in batch_result
assert "filename" in batch_result
assert isinstance(batch_result["text"], str)
assert isinstance(batch_result["words"], list)
assert len(batch_result["text"]) > 0
finally:
for f in temp_files:
Path(f).unlink(missing_ok=True)
def test_transcriptions_error_handling(self):
"""Test error handling for invalid requests."""
url = get_modal_transcript_url()
headers = get_auth_headers()
with httpx.Client(timeout=60.0) as client:
# Test with unsupported language
response = client.post(
f"{url}/v1/audio/transcriptions-from-url",
json={
"audio_file_url": TEST_AUDIO_URL,
"model": get_model_name(),
"language": "fr", # Parakeet only supports English
"timestamp_offset": 0.0,
},
headers=headers,
)
assert response.status_code == 400
assert "only supports English" in response.text
def test_transcriptions_with_timestamp_offset(self):
"""Test transcription with timestamp offset parameter."""
url = get_modal_transcript_url()
headers = get_auth_headers()
with httpx.Client(timeout=60.0) as client:
# Test with timestamp offset
response = client.post(
f"{url}/v1/audio/transcriptions-from-url",
json={
"audio_file_url": TEST_AUDIO_URL,
"model": get_model_name(),
"language": "en",
"timestamp_offset": 10.0, # Add 10 second offset
},
headers=headers,
)
assert response.status_code == 200, f"Request failed: {response.text}"
result = response.json()
# Verify response structure
assert "text" in result
assert "words" in result
assert len(result["words"]) > 0, "Words list must not be empty"
# Verify that timestamps have been offset
for word in result["words"]:
# All timestamps should be >= 10.0 due to offset
assert (
word["start"] >= 10.0
), f"Word start time {word['start']} should be >= 10.0"
assert (
word["end"] >= 10.0
), f"Word end time {word['end']} should be >= 10.0"

View File

@@ -1,230 +0,0 @@
from datetime import datetime, timedelta, timezone
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from icalendar import Calendar, Event
from reflector.db.calendar_events import calendar_events_controller
from reflector.db.rooms import rooms_controller
from reflector.worker.ics_sync import (
_should_sync,
_sync_all_ics_calendars_async,
_sync_room_ics_async,
)
@pytest.mark.asyncio
async def test_sync_room_ics_task():
room = await rooms_controller.add(
name="task-test-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/task.ics",
ics_enabled=True,
)
cal = Calendar()
event = Event()
event.add("uid", "task-event-1")
event.add("summary", "Task Test Meeting")
from reflector.settings import settings
event.add("location", f"{settings.BASE_URL}/room/{room.name}")
now = datetime.now(timezone.utc)
event.add("dtstart", now + timedelta(hours=1))
event.add("dtend", now + timedelta(hours=2))
cal.add_component(event)
ics_content = cal.to_ical().decode("utf-8")
with patch(
"reflector.services.ics_sync.ICSFetchService.fetch_ics", new_callable=AsyncMock
) as mock_fetch:
mock_fetch.return_value = ics_content
await _sync_room_ics_async(room.id)
events = await calendar_events_controller.get_by_room(room.id)
assert len(events) == 1
assert events[0].ics_uid == "task-event-1"
@pytest.mark.asyncio
async def test_sync_room_ics_disabled():
room = await rooms_controller.add(
name="disabled-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_enabled=False,
)
await _sync_room_ics_async(room.id)
events = await calendar_events_controller.get_by_room(room.id)
assert len(events) == 0
@pytest.mark.asyncio
async def test_sync_all_ics_calendars():
room1 = await rooms_controller.add(
name="sync-all-1",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/1.ics",
ics_enabled=True,
)
room2 = await rooms_controller.add(
name="sync-all-2",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/2.ics",
ics_enabled=True,
)
room3 = await rooms_controller.add(
name="sync-all-3",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_enabled=False,
)
with patch("reflector.worker.ics_sync.sync_room_ics.delay") as mock_delay:
await _sync_all_ics_calendars_async()
assert mock_delay.call_count == 2
called_room_ids = [call.args[0] for call in mock_delay.call_args_list]
assert room1.id in called_room_ids
assert room2.id in called_room_ids
assert room3.id not in called_room_ids
@pytest.mark.asyncio
async def test_should_sync_logic():
room = MagicMock()
room.ics_last_sync = None
assert _should_sync(room) is True
room.ics_last_sync = datetime.now(timezone.utc) - timedelta(seconds=100)
room.ics_fetch_interval = 300
assert _should_sync(room) is False
room.ics_last_sync = datetime.now(timezone.utc) - timedelta(seconds=400)
room.ics_fetch_interval = 300
assert _should_sync(room) is True
@pytest.mark.asyncio
async def test_sync_respects_fetch_interval():
now = datetime.now(timezone.utc)
room1 = await rooms_controller.add(
name="interval-test-1",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/interval.ics",
ics_enabled=True,
ics_fetch_interval=300,
)
await rooms_controller.update(
room1,
{"ics_last_sync": now - timedelta(seconds=100)},
)
room2 = await rooms_controller.add(
name="interval-test-2",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/interval2.ics",
ics_enabled=True,
ics_fetch_interval=60,
)
await rooms_controller.update(
room2,
{"ics_last_sync": now - timedelta(seconds=100)},
)
with patch("reflector.worker.ics_sync.sync_room_ics.delay") as mock_delay:
await _sync_all_ics_calendars_async()
assert mock_delay.call_count == 1
assert mock_delay.call_args[0][0] == room2.id
@pytest.mark.asyncio
async def test_sync_handles_errors_gracefully():
room = await rooms_controller.add(
name="error-task-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/error.ics",
ics_enabled=True,
)
with patch(
"reflector.services.ics_sync.ICSFetchService.fetch_ics", new_callable=AsyncMock
) as mock_fetch:
mock_fetch.side_effect = Exception("Network error")
await _sync_room_ics_async(room.id)
events = await calendar_events_controller.get_by_room(room.id)
assert len(events) == 0

View File

@@ -1,289 +0,0 @@
from datetime import datetime, timedelta, timezone
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from icalendar import Calendar, Event
from reflector.db.calendar_events import calendar_events_controller
from reflector.db.rooms import rooms_controller
from reflector.services.ics_sync import ICSFetchService, ICSSyncService
@pytest.mark.asyncio
async def test_ics_fetch_service_event_matching():
service = ICSFetchService()
room_name = "test-room"
room_url = "https://example.com/room/test-room"
# Create test event
event = Event()
event.add("uid", "test-123")
event.add("summary", "Test Meeting")
# Test matching with full URL in location
event.add("location", "https://example.com/room/test-room")
assert service._event_matches_room(event, room_name, room_url) is True
# Test matching with URL without protocol
event["location"] = "example.com/room/test-room"
assert service._event_matches_room(event, room_name, room_url) is True
# Test matching in description
event["location"] = "Conference Room A"
event.add("description", f"Join at {room_url}")
assert service._event_matches_room(event, room_name, room_url) is True
# Test non-matching
event["location"] = "Different Room"
event["description"] = "No room URL here"
assert service._event_matches_room(event, room_name, room_url) is False
# Test partial paths should NOT match anymore
event["location"] = "/room/test-room"
assert service._event_matches_room(event, room_name, room_url) is False
event["location"] = f"Room: {room_name}"
assert service._event_matches_room(event, room_name, room_url) is False
@pytest.mark.asyncio
async def test_ics_fetch_service_parse_event():
service = ICSFetchService()
# Create test event
event = Event()
event.add("uid", "test-456")
event.add("summary", "Team Standup")
event.add("description", "Daily team sync")
event.add("location", "https://example.com/room/standup")
now = datetime.now(timezone.utc)
event.add("dtstart", now)
event.add("dtend", now + timedelta(hours=1))
# Add attendees
event.add("attendee", "mailto:alice@example.com", parameters={"CN": "Alice"})
event.add("attendee", "mailto:bob@example.com", parameters={"CN": "Bob"})
event.add("organizer", "mailto:carol@example.com", parameters={"CN": "Carol"})
# Parse event
result = service._parse_event(event)
assert result is not None
assert result["ics_uid"] == "test-456"
assert result["title"] == "Team Standup"
assert result["description"] == "Daily team sync"
assert result["location"] == "https://example.com/room/standup"
assert len(result["attendees"]) == 3 # 2 attendees + 1 organizer
@pytest.mark.asyncio
async def test_ics_fetch_service_extract_room_events():
service = ICSFetchService()
room_name = "meeting"
room_url = "https://example.com/room/meeting"
# Create calendar with multiple events
cal = Calendar()
# Event 1: Matches room
event1 = Event()
event1.add("uid", "match-1")
event1.add("summary", "Planning Meeting")
event1.add("location", room_url)
now = datetime.now(timezone.utc)
event1.add("dtstart", now + timedelta(hours=2))
event1.add("dtend", now + timedelta(hours=3))
cal.add_component(event1)
# Event 2: Doesn't match room
event2 = Event()
event2.add("uid", "no-match")
event2.add("summary", "Other Meeting")
event2.add("location", "https://example.com/room/other")
event2.add("dtstart", now + timedelta(hours=4))
event2.add("dtend", now + timedelta(hours=5))
cal.add_component(event2)
# Event 3: Matches room in description
event3 = Event()
event3.add("uid", "match-2")
event3.add("summary", "Review Session")
event3.add("description", f"Meeting link: {room_url}")
event3.add("dtstart", now + timedelta(hours=6))
event3.add("dtend", now + timedelta(hours=7))
cal.add_component(event3)
# Event 4: Cancelled event (should be skipped)
event4 = Event()
event4.add("uid", "cancelled")
event4.add("summary", "Cancelled Meeting")
event4.add("location", room_url)
event4.add("status", "CANCELLED")
event4.add("dtstart", now + timedelta(hours=8))
event4.add("dtend", now + timedelta(hours=9))
cal.add_component(event4)
# Extract events
events = service.extract_room_events(cal, room_name, room_url)
assert len(events) == 2
assert events[0]["ics_uid"] == "match-1"
assert events[1]["ics_uid"] == "match-2"
@pytest.mark.asyncio
async def test_ics_sync_service_sync_room_calendar():
# Create room
room = await rooms_controller.add(
name="sync-test",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/test.ics",
ics_enabled=True,
)
# Mock ICS content
cal = Calendar()
event = Event()
event.add("uid", "sync-event-1")
event.add("summary", "Sync Test Meeting")
# Use the actual BASE_URL from settings
from reflector.settings import settings
event.add("location", f"{settings.BASE_URL}/room/{room.name}")
now = datetime.now(timezone.utc)
event.add("dtstart", now + timedelta(hours=1))
event.add("dtend", now + timedelta(hours=2))
cal.add_component(event)
ics_content = cal.to_ical().decode("utf-8")
# Create sync service and mock fetch
sync_service = ICSSyncService()
with patch.object(
sync_service.fetch_service, "fetch_ics", new_callable=AsyncMock
) as mock_fetch:
mock_fetch.return_value = ics_content
# First sync
result = await sync_service.sync_room_calendar(room)
assert result["status"] == "success"
assert result["events_found"] == 1
assert result["events_created"] == 1
assert result["events_updated"] == 0
assert result["events_deleted"] == 0
# Verify event was created
events = await calendar_events_controller.get_by_room(room.id)
assert len(events) == 1
assert events[0].ics_uid == "sync-event-1"
assert events[0].title == "Sync Test Meeting"
# Second sync with same content (should be unchanged)
# Refresh room to get updated etag and force sync by setting old sync time
room = await rooms_controller.get_by_id(room.id)
await rooms_controller.update(
room, {"ics_last_sync": datetime.now(timezone.utc) - timedelta(minutes=10)}
)
result = await sync_service.sync_room_calendar(room)
assert result["status"] == "unchanged"
# Third sync with updated event
event["summary"] = "Updated Meeting Title"
cal = Calendar()
cal.add_component(event)
ics_content = cal.to_ical().decode("utf-8")
mock_fetch.return_value = ics_content
# Force sync by clearing etag
await rooms_controller.update(room, {"ics_last_etag": None})
result = await sync_service.sync_room_calendar(room)
assert result["status"] == "success"
assert result["events_created"] == 0
assert result["events_updated"] == 1
# Verify event was updated
events = await calendar_events_controller.get_by_room(room.id)
assert len(events) == 1
assert events[0].title == "Updated Meeting Title"
@pytest.mark.asyncio
async def test_ics_sync_service_should_sync():
service = ICSSyncService()
# Room never synced
room = MagicMock()
room.ics_last_sync = None
room.ics_fetch_interval = 300
assert service._should_sync(room) is True
# Room synced recently
room.ics_last_sync = datetime.now(timezone.utc) - timedelta(seconds=100)
assert service._should_sync(room) is False
# Room sync due
room.ics_last_sync = datetime.now(timezone.utc) - timedelta(seconds=400)
assert service._should_sync(room) is True
@pytest.mark.asyncio
async def test_ics_sync_service_skip_disabled():
service = ICSSyncService()
# Room with ICS disabled
room = MagicMock()
room.ics_enabled = False
room.ics_url = "https://calendar.example.com/test.ics"
result = await service.sync_room_calendar(room)
assert result["status"] == "skipped"
assert result["reason"] == "ICS not configured"
# Room without URL
room.ics_enabled = True
room.ics_url = None
result = await service.sync_room_calendar(room)
assert result["status"] == "skipped"
assert result["reason"] == "ICS not configured"
@pytest.mark.asyncio
async def test_ics_sync_service_error_handling():
# Create room
room = await rooms_controller.add(
name="error-test",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/error.ics",
ics_enabled=True,
)
sync_service = ICSSyncService()
with patch.object(
sync_service.fetch_service, "fetch_ics", new_callable=AsyncMock
) as mock_fetch:
mock_fetch.side_effect = Exception("Network error")
result = await sync_service.sync_room_calendar(room)
assert result["status"] == "error"
assert "Network error" in result["error"]

View File

@@ -1,283 +0,0 @@
"""Tests for multiple active meetings per room functionality."""
from datetime import datetime, timedelta, timezone
import pytest
from reflector.db.calendar_events import CalendarEvent, calendar_events_controller
from reflector.db.meetings import meetings_controller
from reflector.db.rooms import rooms_controller
@pytest.mark.asyncio
async def test_multiple_active_meetings_per_room():
"""Test that multiple active meetings can exist for the same room."""
# Create a room
room = await rooms_controller.add(
name="test-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
current_time = datetime.now(timezone.utc)
end_time = current_time + timedelta(hours=2)
# Create first meeting
meeting1 = await meetings_controller.create(
id="meeting-1",
room_name="test-meeting-1",
room_url="https://whereby.com/test-1",
host_room_url="https://whereby.com/test-1-host",
start_date=current_time,
end_date=end_time,
user_id="test-user",
room=room,
)
# Create second meeting for the same room (should succeed now)
meeting2 = await meetings_controller.create(
id="meeting-2",
room_name="test-meeting-2",
room_url="https://whereby.com/test-2",
host_room_url="https://whereby.com/test-2-host",
start_date=current_time,
end_date=end_time,
user_id="test-user",
room=room,
)
# Both meetings should be active
active_meetings = await meetings_controller.get_all_active_for_room(
room=room, current_time=current_time
)
assert len(active_meetings) == 2
assert meeting1.id in [m.id for m in active_meetings]
assert meeting2.id in [m.id for m in active_meetings]
@pytest.mark.asyncio
async def test_get_active_by_calendar_event():
"""Test getting active meeting by calendar event ID."""
# Create a room
room = await rooms_controller.add(
name="test-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
# Create a calendar event
event = CalendarEvent(
room_id=room.id,
ics_uid="test-event-uid",
title="Test Meeting",
start_time=datetime.now(timezone.utc),
end_time=datetime.now(timezone.utc) + timedelta(hours=1),
)
event = await calendar_events_controller.upsert(event)
current_time = datetime.now(timezone.utc)
end_time = current_time + timedelta(hours=2)
# Create meeting linked to calendar event
meeting = await meetings_controller.create(
id="meeting-cal-1",
room_name="test-meeting-cal",
room_url="https://whereby.com/test-cal",
host_room_url="https://whereby.com/test-cal-host",
start_date=current_time,
end_date=end_time,
user_id="test-user",
room=room,
calendar_event_id=event.id,
calendar_metadata={"title": event.title},
)
# Should find the meeting by calendar event
found_meeting = await meetings_controller.get_active_by_calendar_event(
room=room, calendar_event_id=event.id, current_time=current_time
)
assert found_meeting is not None
assert found_meeting.id == meeting.id
assert found_meeting.calendar_event_id == event.id
@pytest.mark.asyncio
async def test_grace_period_logic():
"""Test that meetings have a grace period after last participant leaves."""
# Create a room
room = await rooms_controller.add(
name="test-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
current_time = datetime.now(timezone.utc)
end_time = current_time + timedelta(hours=2)
# Create meeting
meeting = await meetings_controller.create(
id="meeting-grace",
room_name="test-meeting-grace",
room_url="https://whereby.com/test-grace",
host_room_url="https://whereby.com/test-grace-host",
start_date=current_time,
end_date=end_time,
user_id="test-user",
room=room,
)
# Test grace period logic by simulating different states
# Simulate first time all participants left
await meetings_controller.update_meeting(
meeting.id, num_clients=0, last_participant_left_at=current_time
)
# Within grace period (10 min) - should still be active
await meetings_controller.update_meeting(
meeting.id, last_participant_left_at=current_time - timedelta(minutes=10)
)
updated_meeting = await meetings_controller.get_by_id(meeting.id)
assert updated_meeting.is_active is True # Still active during grace period
# Simulate grace period expired (20 min) and deactivate
await meetings_controller.update_meeting(
meeting.id, last_participant_left_at=current_time - timedelta(minutes=20)
)
# Manually test the grace period logic that would be in process_meetings
updated_meeting = await meetings_controller.get_by_id(meeting.id)
if updated_meeting.last_participant_left_at:
grace_period = timedelta(minutes=updated_meeting.grace_period_minutes)
if current_time > updated_meeting.last_participant_left_at + grace_period:
await meetings_controller.update_meeting(meeting.id, is_active=False)
updated_meeting = await meetings_controller.get_by_id(meeting.id)
assert updated_meeting.is_active is False # Now deactivated
@pytest.mark.asyncio
async def test_calendar_meeting_force_close_after_30_min():
"""Test that calendar meetings force close 30 minutes after scheduled end."""
# Create a room
room = await rooms_controller.add(
name="test-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
# Create a calendar event
event = CalendarEvent(
room_id=room.id,
ics_uid="test-event-force",
title="Test Meeting Force Close",
start_time=datetime.now(timezone.utc) - timedelta(hours=2),
end_time=datetime.now(timezone.utc) - timedelta(minutes=35), # Ended 35 min ago
)
event = await calendar_events_controller.upsert(event)
current_time = datetime.now(timezone.utc)
# Create meeting linked to calendar event
meeting = await meetings_controller.create(
id="meeting-force",
room_name="test-meeting-force",
room_url="https://whereby.com/test-force",
host_room_url="https://whereby.com/test-force-host",
start_date=event.start_time,
end_date=event.end_time,
user_id="test-user",
room=room,
calendar_event_id=event.id,
)
# Test that calendar meetings force close 30 min after scheduled end
# The meeting ended 35 minutes ago, so it should be force closed
# Manually test the force close logic that would be in process_meetings
if meeting.calendar_event_id:
if current_time > meeting.end_date + timedelta(minutes=30):
await meetings_controller.update_meeting(meeting.id, is_active=False)
updated_meeting = await meetings_controller.get_by_id(meeting.id)
assert updated_meeting.is_active is False # Force closed after 30 min
@pytest.mark.asyncio
async def test_participant_rejoin_clears_grace_period():
"""Test that participant rejoining clears the grace period."""
# Create a room
room = await rooms_controller.add(
name="test-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
current_time = datetime.now(timezone.utc)
end_time = current_time + timedelta(hours=2)
# Create meeting with grace period already set
meeting = await meetings_controller.create(
id="meeting-rejoin",
room_name="test-meeting-rejoin",
room_url="https://whereby.com/test-rejoin",
host_room_url="https://whereby.com/test-rejoin-host",
start_date=current_time,
end_date=end_time,
user_id="test-user",
room=room,
)
# Set last_participant_left_at to simulate grace period
await meetings_controller.update_meeting(
meeting.id,
last_participant_left_at=current_time - timedelta(minutes=5),
num_clients=0,
)
# Simulate participant rejoining - clear grace period
await meetings_controller.update_meeting(
meeting.id, last_participant_left_at=None, num_clients=1
)
updated_meeting = await meetings_controller.get_by_id(meeting.id)
assert updated_meeting.last_participant_left_at is None # Grace period cleared
assert updated_meeting.is_active is True # Still active

View File

@@ -0,0 +1,633 @@
"""
Tests for PipelineMainFile - file-based processing pipeline
This test verifies the complete file processing pipeline without mocking much,
ensuring all processors are correctly invoked and the happy path works correctly.
"""
from pathlib import Path
from unittest.mock import AsyncMock, MagicMock, patch
from uuid import uuid4
import pytest
from reflector.pipelines.main_file_pipeline import PipelineMainFile
from reflector.processors.file_diarization import FileDiarizationOutput
from reflector.processors.types import (
DiarizationSegment,
TitleSummary,
Word,
)
from reflector.processors.types import (
Transcript as TranscriptType,
)
@pytest.fixture
async def dummy_file_transcript():
"""Mock FileTranscriptAutoProcessor for file processing"""
from reflector.processors.file_transcript import FileTranscriptProcessor
class TestFileTranscriptProcessor(FileTranscriptProcessor):
async def _transcript(self, data):
return TranscriptType(
text="Hello world. How are you today?",
words=[
Word(start=0.0, end=0.5, text="Hello", speaker=0),
Word(start=0.5, end=0.6, text=" ", speaker=0),
Word(start=0.6, end=1.0, text="world", speaker=0),
Word(start=1.0, end=1.1, text=".", speaker=0),
Word(start=1.1, end=1.2, text=" ", speaker=0),
Word(start=1.2, end=1.5, text="How", speaker=0),
Word(start=1.5, end=1.6, text=" ", speaker=0),
Word(start=1.6, end=1.8, text="are", speaker=0),
Word(start=1.8, end=1.9, text=" ", speaker=0),
Word(start=1.9, end=2.1, text="you", speaker=0),
Word(start=2.1, end=2.2, text=" ", speaker=0),
Word(start=2.2, end=2.5, text="today", speaker=0),
Word(start=2.5, end=2.6, text="?", speaker=0),
],
)
with patch(
"reflector.processors.file_transcript_auto.FileTranscriptAutoProcessor.__new__"
) as mock_auto:
mock_auto.return_value = TestFileTranscriptProcessor()
yield
@pytest.fixture
async def dummy_file_diarization():
"""Mock FileDiarizationAutoProcessor for file processing"""
from reflector.processors.file_diarization import FileDiarizationProcessor
class TestFileDiarizationProcessor(FileDiarizationProcessor):
async def _diarize(self, data):
return FileDiarizationOutput(
diarization=[
DiarizationSegment(start=0.0, end=1.1, speaker=0),
DiarizationSegment(start=1.2, end=2.6, speaker=1),
]
)
with patch(
"reflector.processors.file_diarization_auto.FileDiarizationAutoProcessor.__new__"
) as mock_auto:
mock_auto.return_value = TestFileDiarizationProcessor()
yield
@pytest.fixture
async def mock_transcript_in_db(tmpdir):
"""Create a mock transcript in the database"""
from reflector.db.transcripts import Transcript
from reflector.settings import settings
# Set the DATA_DIR to our tmpdir
original_data_dir = settings.DATA_DIR
settings.DATA_DIR = str(tmpdir)
transcript_id = str(uuid4())
data_path = Path(tmpdir) / transcript_id
data_path.mkdir(parents=True, exist_ok=True)
# Create mock transcript object
transcript = Transcript(
id=transcript_id,
name="Test Transcript",
status="processing",
source_kind="file",
source_language="en",
target_language="en",
)
# Mock the controller to return our transcript
try:
with patch(
"reflector.pipelines.main_file_pipeline.transcripts_controller.get_by_id"
) as mock_get:
mock_get.return_value = transcript
with patch(
"reflector.pipelines.main_live_pipeline.transcripts_controller.get_by_id"
) as mock_get2:
mock_get2.return_value = transcript
with patch(
"reflector.pipelines.main_live_pipeline.transcripts_controller.update"
) as mock_update:
mock_update.return_value = None
yield transcript
finally:
# Restore original DATA_DIR
settings.DATA_DIR = original_data_dir
@pytest.fixture
async def mock_storage():
"""Mock storage for file uploads"""
from reflector.storage.base import Storage
class TestStorage(Storage):
async def _put_file(self, path, data):
return None
async def _get_file_url(self, path):
return f"http://test-storage/{path}"
async def _get_file(self, path):
return b"test_audio_data"
async def _delete_file(self, path):
return None
storage = TestStorage()
# Add mock tracking for verification
storage._put_file = AsyncMock(side_effect=storage._put_file)
storage._get_file_url = AsyncMock(side_effect=storage._get_file_url)
with patch(
"reflector.pipelines.main_file_pipeline.get_transcripts_storage"
) as mock_get:
mock_get.return_value = storage
yield storage
@pytest.fixture
async def mock_audio_file_writer():
"""Mock AudioFileWriterProcessor to avoid actual file writing"""
with patch(
"reflector.pipelines.main_file_pipeline.AudioFileWriterProcessor"
) as mock_writer_class:
mock_writer = AsyncMock()
mock_writer.push = AsyncMock()
mock_writer.flush = AsyncMock()
mock_writer_class.return_value = mock_writer
yield mock_writer
@pytest.fixture
async def mock_waveform_processor():
"""Mock AudioWaveformProcessor"""
with patch(
"reflector.pipelines.main_file_pipeline.AudioWaveformProcessor"
) as mock_waveform_class:
mock_waveform = AsyncMock()
mock_waveform.set_pipeline = MagicMock()
mock_waveform.flush = AsyncMock()
mock_waveform_class.return_value = mock_waveform
yield mock_waveform
@pytest.fixture
async def mock_topic_detector():
"""Mock TranscriptTopicDetectorProcessor"""
with patch(
"reflector.pipelines.main_file_pipeline.TranscriptTopicDetectorProcessor"
) as mock_topic_class:
mock_topic = AsyncMock()
mock_topic.set_pipeline = MagicMock()
mock_topic.push = AsyncMock()
mock_topic.flush_called = False
# When flush is called, simulate topic detection by calling the callback
async def flush_with_callback():
mock_topic.flush_called = True
if hasattr(mock_topic, "_callback"):
# Create a minimal transcript for the TitleSummary
test_transcript = TranscriptType(words=[], text="test transcript")
await mock_topic._callback(
TitleSummary(
title="Test Topic",
summary="Test topic summary",
timestamp=0.0,
duration=10.0,
transcript=test_transcript,
)
)
mock_topic.flush = flush_with_callback
def init_with_callback(callback=None):
mock_topic._callback = callback
return mock_topic
mock_topic_class.side_effect = init_with_callback
yield mock_topic
@pytest.fixture
async def mock_title_processor():
"""Mock TranscriptFinalTitleProcessor"""
with patch(
"reflector.pipelines.main_file_pipeline.TranscriptFinalTitleProcessor"
) as mock_title_class:
mock_title = AsyncMock()
mock_title.set_pipeline = MagicMock()
mock_title.push = AsyncMock()
mock_title.flush_called = False
# When flush is called, simulate title generation by calling the callback
async def flush_with_callback():
mock_title.flush_called = True
if hasattr(mock_title, "_callback"):
from reflector.processors.types import FinalTitle
await mock_title._callback(FinalTitle(title="Test Title"))
mock_title.flush = flush_with_callback
def init_with_callback(callback=None):
mock_title._callback = callback
return mock_title
mock_title_class.side_effect = init_with_callback
yield mock_title
@pytest.fixture
async def mock_summary_processor():
"""Mock TranscriptFinalSummaryProcessor"""
with patch(
"reflector.pipelines.main_file_pipeline.TranscriptFinalSummaryProcessor"
) as mock_summary_class:
mock_summary = AsyncMock()
mock_summary.set_pipeline = MagicMock()
mock_summary.push = AsyncMock()
mock_summary.flush_called = False
# When flush is called, simulate summary generation by calling the callbacks
async def flush_with_callback():
mock_summary.flush_called = True
from reflector.processors.types import FinalLongSummary, FinalShortSummary
if hasattr(mock_summary, "_callback"):
await mock_summary._callback(
FinalLongSummary(long_summary="Test long summary", duration=10.0)
)
if hasattr(mock_summary, "_on_short_summary"):
await mock_summary._on_short_summary(
FinalShortSummary(short_summary="Test short summary", duration=10.0)
)
mock_summary.flush = flush_with_callback
def init_with_callback(transcript=None, callback=None, on_short_summary=None):
mock_summary._callback = callback
mock_summary._on_short_summary = on_short_summary
return mock_summary
mock_summary_class.side_effect = init_with_callback
yield mock_summary
@pytest.mark.asyncio
async def test_pipeline_main_file_process(
tmpdir,
mock_transcript_in_db,
dummy_file_transcript,
dummy_file_diarization,
mock_storage,
mock_audio_file_writer,
mock_waveform_processor,
mock_topic_detector,
mock_title_processor,
mock_summary_processor,
):
"""
Test the complete PipelineMainFile processing pipeline.
This test verifies:
1. Audio extraction and writing
2. Audio upload to storage
3. Parallel processing of transcription, diarization, and waveform
4. Assembly of transcript with diarization
5. Topic detection
6. Title and summary generation
"""
# Create a test audio file
test_audio_path = Path(__file__).parent / "records" / "test_mathieu_hello.wav"
# Copy test audio to the transcript's data path as if it was uploaded
upload_path = mock_transcript_in_db.data_path / "upload.wav"
upload_path.write_bytes(test_audio_path.read_bytes())
# Also create the audio.mp3 file that would be created by AudioFileWriterProcessor
# Since we're mocking AudioFileWriterProcessor, we need to create this manually
mp3_path = mock_transcript_in_db.data_path / "audio.mp3"
mp3_path.write_bytes(b"mock_mp3_data")
# Track callback invocations
callback_marks = {
"on_status": [],
"on_duration": [],
"on_waveform": [],
"on_topic": [],
"on_title": [],
"on_long_summary": [],
"on_short_summary": [],
}
# Create pipeline with mocked callbacks
pipeline = PipelineMainFile(transcript_id=mock_transcript_in_db.id)
# Override callbacks to track invocations
async def track_callback(name, data):
callback_marks[name].append(data)
# Call the original callback
original = getattr(PipelineMainFile, name)
return await original(pipeline, data)
for callback_name in callback_marks.keys():
setattr(
pipeline,
callback_name,
lambda data, n=callback_name: track_callback(n, data),
)
# Mock av.open for audio processing
with patch("reflector.pipelines.main_file_pipeline.av.open") as mock_av:
# Mock container for checking video streams
mock_container = MagicMock()
mock_container.streams.video = [] # No video streams (audio only)
mock_container.close = MagicMock()
# Mock container for decoding audio frames
mock_decode_container = MagicMock()
mock_decode_container.decode.return_value = iter(
[MagicMock()]
) # One mock audio frame
mock_decode_container.close = MagicMock()
# Return different containers for different calls
mock_av.side_effect = [mock_container, mock_decode_container]
# Run the pipeline
await pipeline.process(upload_path)
# Verify audio extraction and writing
assert mock_audio_file_writer.push.called
assert mock_audio_file_writer.flush.called
# Verify storage upload
assert mock_storage._put_file.called
assert mock_storage._get_file_url.called
# Verify waveform generation
assert mock_waveform_processor.flush.called
assert mock_waveform_processor.set_pipeline.called
# Verify topic detection
assert mock_topic_detector.push.called
assert mock_topic_detector.flush_called
# Verify title generation
assert mock_title_processor.push.called
assert mock_title_processor.flush_called
# Verify summary generation
assert mock_summary_processor.push.called
assert mock_summary_processor.flush_called
# Verify callbacks were invoked
assert len(callback_marks["on_topic"]) > 0, "Topic callback should be invoked"
assert len(callback_marks["on_title"]) > 0, "Title callback should be invoked"
assert (
len(callback_marks["on_long_summary"]) > 0
), "Long summary callback should be invoked"
assert (
len(callback_marks["on_short_summary"]) > 0
), "Short summary callback should be invoked"
print(f"Callback marks: {callback_marks}")
# Verify the pipeline completed successfully
assert pipeline.logger is not None
print("PipelineMainFile test completed successfully!")
@pytest.mark.asyncio
async def test_pipeline_main_file_with_video(
tmpdir,
mock_transcript_in_db,
dummy_file_transcript,
dummy_file_diarization,
mock_storage,
mock_audio_file_writer,
mock_waveform_processor,
mock_topic_detector,
mock_title_processor,
mock_summary_processor,
):
"""
Test PipelineMainFile with video input (verifies audio extraction).
"""
# Create a test audio file
test_audio_path = Path(__file__).parent / "records" / "test_mathieu_hello.wav"
# Copy test audio to the transcript's data path as if it was a video upload
upload_path = mock_transcript_in_db.data_path / "upload.mp4"
upload_path.write_bytes(test_audio_path.read_bytes())
# Also create the audio.mp3 file that would be created by AudioFileWriterProcessor
mp3_path = mock_transcript_in_db.data_path / "audio.mp3"
mp3_path.write_bytes(b"mock_mp3_data")
# Create pipeline
pipeline = PipelineMainFile(transcript_id=mock_transcript_in_db.id)
# Mock av.open for video processing
with patch("reflector.pipelines.main_file_pipeline.av.open") as mock_av:
# Mock container for checking video streams
mock_container = MagicMock()
mock_container.streams.video = [MagicMock()] # Has video streams
mock_container.close = MagicMock()
# Mock container for decoding audio frames
mock_decode_container = MagicMock()
mock_decode_container.decode.return_value = iter(
[MagicMock()]
) # One mock audio frame
mock_decode_container.close = MagicMock()
# Return different containers for different calls
mock_av.side_effect = [mock_container, mock_decode_container]
# Run the pipeline
await pipeline.process(upload_path)
# Verify audio extraction from video
assert mock_audio_file_writer.push.called
assert mock_audio_file_writer.flush.called
# Verify the rest of the pipeline completed
assert mock_storage._put_file.called
assert mock_waveform_processor.flush.called
assert mock_topic_detector.push.called
assert mock_title_processor.push.called
assert mock_summary_processor.push.called
print("PipelineMainFile video test completed successfully!")
@pytest.mark.asyncio
async def test_pipeline_main_file_no_diarization(
tmpdir,
mock_transcript_in_db,
dummy_file_transcript,
mock_storage,
mock_audio_file_writer,
mock_waveform_processor,
mock_topic_detector,
mock_title_processor,
mock_summary_processor,
):
"""
Test PipelineMainFile with diarization disabled.
"""
from reflector.settings import settings
# Disable diarization
with patch.object(settings, "DIARIZATION_BACKEND", None):
# Create a test audio file
test_audio_path = Path(__file__).parent / "records" / "test_mathieu_hello.wav"
# Copy test audio to the transcript's data path
upload_path = mock_transcript_in_db.data_path / "upload.wav"
upload_path.write_bytes(test_audio_path.read_bytes())
# Also create the audio.mp3 file
mp3_path = mock_transcript_in_db.data_path / "audio.mp3"
mp3_path.write_bytes(b"mock_mp3_data")
# Create pipeline
pipeline = PipelineMainFile(transcript_id=mock_transcript_in_db.id)
# Mock av.open for audio processing
with patch("reflector.pipelines.main_file_pipeline.av.open") as mock_av:
# Mock container for checking video streams
mock_container = MagicMock()
mock_container.streams.video = [] # No video streams
mock_container.close = MagicMock()
# Mock container for decoding audio frames
mock_decode_container = MagicMock()
mock_decode_container.decode.return_value = iter([MagicMock()])
mock_decode_container.close = MagicMock()
# Return different containers for different calls
mock_av.side_effect = [mock_container, mock_decode_container]
# Run the pipeline
await pipeline.process(upload_path)
# Verify the pipeline completed without diarization
assert mock_storage._put_file.called
assert mock_waveform_processor.flush.called
assert mock_topic_detector.push.called
assert mock_title_processor.push.called
assert mock_summary_processor.push.called
print("PipelineMainFile no-diarization test completed successfully!")
@pytest.mark.asyncio
async def test_task_pipeline_file_process(
tmpdir,
mock_transcript_in_db,
dummy_file_transcript,
dummy_file_diarization,
mock_storage,
mock_audio_file_writer,
mock_waveform_processor,
mock_topic_detector,
mock_title_processor,
mock_summary_processor,
):
"""
Test the Celery task entry point for file pipeline processing.
"""
# Direct import of the underlying async function, bypassing the asynctask decorator
# Create a test audio file in the transcript's data path
test_audio_path = Path(__file__).parent / "records" / "test_mathieu_hello.wav"
upload_path = mock_transcript_in_db.data_path / "upload.wav"
upload_path.write_bytes(test_audio_path.read_bytes())
# Also create the audio.mp3 file
mp3_path = mock_transcript_in_db.data_path / "audio.mp3"
mp3_path.write_bytes(b"mock_mp3_data")
# Mock av.open for audio processing
with patch("reflector.pipelines.main_file_pipeline.av.open") as mock_av:
# Mock container for checking video streams
mock_container = MagicMock()
mock_container.streams.video = [] # No video streams
mock_container.close = MagicMock()
# Mock container for decoding audio frames
mock_decode_container = MagicMock()
mock_decode_container.decode.return_value = iter([MagicMock()])
mock_decode_container.close = MagicMock()
# Return different containers for different calls
mock_av.side_effect = [mock_container, mock_decode_container]
# Get the original async function without the asynctask decorator
# The function is wrapped, so we need to call it differently
# For now, we test the pipeline directly since the task is just a thin wrapper
from reflector.pipelines.main_file_pipeline import PipelineMainFile
pipeline = PipelineMainFile(transcript_id=mock_transcript_in_db.id)
await pipeline.process(upload_path)
# Verify the pipeline was executed through the task
assert mock_audio_file_writer.push.called
assert mock_audio_file_writer.flush.called
assert mock_storage._put_file.called
assert mock_waveform_processor.flush.called
assert mock_topic_detector.push.called
assert mock_title_processor.push.called
assert mock_summary_processor.push.called
print("task_pipeline_file_process test completed successfully!")
@pytest.mark.asyncio
async def test_pipeline_file_process_no_transcript():
"""
Test the pipeline with a non-existent transcript.
"""
from reflector.pipelines.main_file_pipeline import PipelineMainFile
# Mock the controller to return None (transcript not found)
with patch(
"reflector.pipelines.main_file_pipeline.transcripts_controller.get_by_id"
) as mock_get:
mock_get.return_value = None
pipeline = PipelineMainFile(transcript_id=str(uuid4()))
# Should raise an exception for missing transcript when get_transcript is called
with pytest.raises(Exception, match="Transcript not found"):
await pipeline.get_transcript()
@pytest.mark.asyncio
async def test_pipeline_file_process_no_audio_file(
mock_transcript_in_db,
):
"""
Test the pipeline when no audio file is found.
"""
from reflector.pipelines.main_file_pipeline import PipelineMainFile
# Don't create any audio files in the data path
# The pipeline's process should handle missing files gracefully
pipeline = PipelineMainFile(transcript_id=mock_transcript_in_db.id)
# Try to process a non-existent file
non_existent_path = mock_transcript_in_db.data_path / "nonexistent.wav"
# This should fail when trying to open the file with av
with pytest.raises(Exception):
await pipeline.process(non_existent_path)

View File

@@ -0,0 +1,265 @@
"""
Tests for Modal-based processors using pytest-recording for HTTP recording/playbook
Note: theses tests require full modal configuration to be able to record
vcr cassettes
Configuration required for the first recording:
- TRANSCRIPT_BACKEND=modal
- TRANSCRIPT_URL=https://xxxxx--reflector-transcriber-parakeet-web.modal.run
- TRANSCRIPT_MODAL_API_KEY=xxxxx
- DIARIZATION_BACKEND=modal
- DIARIZATION_URL=https://xxxxx--reflector-diarizer-web.modal.run
- DIARIZATION_MODAL_API_KEY=xxxxx
"""
from unittest.mock import patch
import pytest
from reflector.processors.file_diarization import FileDiarizationInput
from reflector.processors.file_diarization_modal import FileDiarizationModalProcessor
from reflector.processors.file_transcript import FileTranscriptInput
from reflector.processors.file_transcript_modal import FileTranscriptModalProcessor
from reflector.processors.transcript_diarization_assembler import (
TranscriptDiarizationAssemblerInput,
TranscriptDiarizationAssemblerProcessor,
)
from reflector.processors.types import DiarizationSegment, Transcript, Word
# Public test audio file hosted on S3 specifically for reflector pytests
TEST_AUDIO_URL = (
"https://reflector-github-pytest.s3.us-east-1.amazonaws.com/test_mathieu_hello.mp3"
)
@pytest.mark.asyncio
async def test_file_transcript_modal_processor_missing_url():
with patch("reflector.processors.file_transcript_modal.settings") as mock_settings:
mock_settings.TRANSCRIPT_URL = None
with pytest.raises(Exception, match="TRANSCRIPT_URL required"):
FileTranscriptModalProcessor(modal_api_key="test-api-key")
@pytest.mark.asyncio
async def test_file_diarization_modal_processor_missing_url():
with patch("reflector.processors.file_diarization_modal.settings") as mock_settings:
mock_settings.DIARIZATION_URL = None
with pytest.raises(Exception, match="DIARIZATION_URL required"):
FileDiarizationModalProcessor(modal_api_key="test-api-key")
@pytest.mark.vcr()
@pytest.mark.asyncio
async def test_file_diarization_modal_processor(vcr):
"""Test FileDiarizationModalProcessor using public audio URL and Modal API"""
from reflector.settings import settings
processor = FileDiarizationModalProcessor(
modal_api_key=settings.DIARIZATION_MODAL_API_KEY
)
test_input = FileDiarizationInput(audio_url=TEST_AUDIO_URL)
result = await processor._diarize(test_input)
# Verify the result structure
assert result is not None
assert hasattr(result, "diarization")
assert isinstance(result.diarization, list)
# Check structure of each diarization segment
for segment in result.diarization:
assert "start" in segment
assert "end" in segment
assert "speaker" in segment
assert isinstance(segment["start"], (int, float))
assert isinstance(segment["end"], (int, float))
assert isinstance(segment["speaker"], int)
# Basic sanity check - start should be before end
assert segment["start"] < segment["end"]
@pytest.mark.vcr()
@pytest.mark.asyncio
async def test_file_transcript_modal_processor():
"""Test FileTranscriptModalProcessor using public audio URL and Modal API"""
from reflector.settings import settings
processor = FileTranscriptModalProcessor(
modal_api_key=settings.TRANSCRIPT_MODAL_API_KEY
)
test_input = FileTranscriptInput(
audio_url=TEST_AUDIO_URL,
language="en",
)
# This will record the HTTP interaction on first run, replay on subsequent runs
result = await processor._transcript(test_input)
# Verify the result structure
assert result is not None
assert hasattr(result, "words")
assert isinstance(result.words, list)
# Check structure of each word if present
for word in result.words:
assert hasattr(word, "text")
assert hasattr(word, "start")
assert hasattr(word, "end")
assert isinstance(word.start, (int, float))
assert isinstance(word.end, (int, float))
assert isinstance(word.text, str)
# Basic sanity check - start should be before or equal to end
assert word.start <= word.end
@pytest.mark.asyncio
async def test_transcript_diarization_assembler_processor():
"""Test TranscriptDiarizationAssemblerProcessor without VCR (no HTTP requests)"""
# Create test transcript with words
words = [
Word(text="Hello", start=0.0, end=1.0, speaker=0),
Word(text=" ", start=1.0, end=1.1, speaker=0),
Word(text="world", start=1.1, end=2.0, speaker=0),
Word(text=".", start=2.0, end=2.1, speaker=0),
Word(text=" ", start=2.1, end=2.2, speaker=0),
Word(text="How", start=2.2, end=2.8, speaker=0),
Word(text=" ", start=2.8, end=2.9, speaker=0),
Word(text="are", start=2.9, end=3.2, speaker=0),
Word(text=" ", start=3.2, end=3.3, speaker=0),
Word(text="you", start=3.3, end=3.8, speaker=0),
Word(text="?", start=3.8, end=3.9, speaker=0),
]
transcript = Transcript(words=words)
# Create test diarization segments
diarization = [
DiarizationSegment(start=0.0, end=2.1, speaker=0),
DiarizationSegment(start=2.1, end=3.9, speaker=1),
]
# Create processor and test input
processor = TranscriptDiarizationAssemblerProcessor()
test_input = TranscriptDiarizationAssemblerInput(
transcript=transcript, diarization=diarization
)
# Track emitted results
emitted_results = []
async def capture_result(result):
emitted_results.append(result)
processor.on(capture_result)
# Process the input
await processor.push(test_input)
# Verify result was emitted
assert len(emitted_results) == 1
result = emitted_results[0]
# Verify result structure
assert isinstance(result, Transcript)
assert len(result.words) == len(words)
# Verify speaker assignments were applied
# Words 0-3 (indices) should be speaker 0 (time 0.0-2.0)
# Words 4-10 (indices) should be speaker 1 (time 2.1-3.9)
for i in range(4): # First 4 words (Hello, space, world, .)
assert (
result.words[i].speaker == 0
), f"Word {i} '{result.words[i].text}' should be speaker 0, got {result.words[i].speaker}"
for i in range(4, 11): # Remaining words (space, How, space, are, space, you, ?)
assert (
result.words[i].speaker == 1
), f"Word {i} '{result.words[i].text}' should be speaker 1, got {result.words[i].speaker}"
@pytest.mark.asyncio
async def test_transcript_diarization_assembler_no_diarization():
"""Test TranscriptDiarizationAssemblerProcessor with no diarization data"""
# Create test transcript
words = [Word(text="Hello", start=0.0, end=1.0, speaker=0)]
transcript = Transcript(words=words)
# Create processor and test input with empty diarization
processor = TranscriptDiarizationAssemblerProcessor()
test_input = TranscriptDiarizationAssemblerInput(
transcript=transcript, diarization=[]
)
# Track emitted results
emitted_results = []
async def capture_result(result):
emitted_results.append(result)
processor.on(capture_result)
# Process the input
await processor.push(test_input)
# Verify original transcript was returned unchanged
assert len(emitted_results) == 1
result = emitted_results[0]
assert result is transcript # Should be the same object
assert result.words[0].speaker == 0 # Original speaker unchanged
@pytest.mark.vcr()
@pytest.mark.asyncio
async def test_full_modal_pipeline_integration(vcr):
"""Integration test: Transcription -> Diarization -> Assembly
This test demonstrates the full pipeline:
1. Run transcription via Modal
2. Run diarization via Modal
3. Assemble transcript with diarization
"""
from reflector.settings import settings
# Step 1: Transcription
transcript_processor = FileTranscriptModalProcessor(
modal_api_key=settings.TRANSCRIPT_MODAL_API_KEY
)
transcript_input = FileTranscriptInput(audio_url=TEST_AUDIO_URL, language="en")
transcript = await transcript_processor._transcript(transcript_input)
# Step 2: Diarization
diarization_processor = FileDiarizationModalProcessor(
modal_api_key=settings.DIARIZATION_MODAL_API_KEY
)
diarization_input = FileDiarizationInput(audio_url=TEST_AUDIO_URL)
diarization_result = await diarization_processor._diarize(diarization_input)
# Step 3: Assembly
assembler = TranscriptDiarizationAssemblerProcessor()
assembly_input = TranscriptDiarizationAssemblerInput(
transcript=transcript, diarization=diarization_result.diarization
)
# Track assembled result
assembled_results = []
async def capture_result(result):
assembled_results.append(result)
assembler.on(capture_result)
await assembler.push(assembly_input)
# Verify the full pipeline worked
assert len(assembled_results) == 1
final_transcript = assembled_results[0]
# Verify the final transcript has the original words with updated speaker info
assert isinstance(final_transcript, Transcript)
assert len(final_transcript.words) == len(transcript.words)
assert len(final_transcript.words) > 0
# Verify some words have been assigned speakers from diarization
speakers_found = set(word.speaker for word in final_transcript.words)
assert len(speakers_found) > 0 # At least some speaker assignments

View File

@@ -1,39 +0,0 @@
import pytest
@pytest.mark.asyncio
async def test_basic_process(
dummy_transcript,
dummy_llm,
dummy_processors,
):
# goal is to start the server, and send rtc audio to it
# validate the events received
from pathlib import Path
from reflector.settings import settings
from reflector.tools.process import process_audio_file
# LLM_BACKEND no longer exists in settings
# settings.LLM_BACKEND = "test"
settings.TRANSCRIPT_BACKEND = "whisper"
# event callback
marks = {}
async def event_callback(event):
if event.processor not in marks:
marks[event.processor] = 0
marks[event.processor] += 1
# invoke the process and capture events
path = Path(__file__).parent / "records" / "test_mathieu_hello.wav"
await process_audio_file(path.as_posix(), event_callback)
print(marks)
# validate the events
assert marks["TranscriptLinerProcessor"] == 1
assert marks["TranscriptTranslatorPassthroughProcessor"] == 1
assert marks["TranscriptTopicDetectorProcessor"] == 1
assert marks["TranscriptFinalSummaryProcessor"] == 1
assert marks["TranscriptFinalTitleProcessor"] == 1

View File

@@ -1,225 +0,0 @@
"""
Tests for Room model ICS calendar integration fields.
"""
from datetime import datetime, timezone
import pytest
from reflector.db.rooms import rooms_controller
@pytest.mark.asyncio
async def test_room_create_with_ics_fields():
"""Test creating a room with ICS calendar fields."""
room = await rooms_controller.add(
name="test-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.google.com/calendar/ical/test/private-token/basic.ics",
ics_fetch_interval=600,
ics_enabled=True,
)
assert room.name == "test-room"
assert (
room.ics_url
== "https://calendar.google.com/calendar/ical/test/private-token/basic.ics"
)
assert room.ics_fetch_interval == 600
assert room.ics_enabled is True
assert room.ics_last_sync is None
assert room.ics_last_etag is None
@pytest.mark.asyncio
async def test_room_update_ics_configuration():
"""Test updating room ICS configuration."""
# Create room without ICS
room = await rooms_controller.add(
name="update-test",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
assert room.ics_enabled is False
assert room.ics_url is None
# Update with ICS configuration
await rooms_controller.update(
room,
{
"ics_url": "https://outlook.office365.com/owa/calendar/test/calendar.ics",
"ics_fetch_interval": 300,
"ics_enabled": True,
},
)
assert (
room.ics_url == "https://outlook.office365.com/owa/calendar/test/calendar.ics"
)
assert room.ics_fetch_interval == 300
assert room.ics_enabled is True
@pytest.mark.asyncio
async def test_room_ics_sync_metadata():
"""Test updating room ICS sync metadata."""
room = await rooms_controller.add(
name="sync-test",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://example.com/calendar.ics",
ics_enabled=True,
)
# Update sync metadata
sync_time = datetime.now(timezone.utc)
await rooms_controller.update(
room,
{
"ics_last_sync": sync_time,
"ics_last_etag": "abc123hash",
},
)
assert room.ics_last_sync == sync_time
assert room.ics_last_etag == "abc123hash"
@pytest.mark.asyncio
async def test_room_get_with_ics_fields():
"""Test retrieving room with ICS fields."""
# Create room
created_room = await rooms_controller.add(
name="get-test",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="webcal://calendar.example.com/feed.ics",
ics_fetch_interval=900,
ics_enabled=True,
)
# Get by ID
room = await rooms_controller.get_by_id(created_room.id)
assert room is not None
assert room.ics_url == "webcal://calendar.example.com/feed.ics"
assert room.ics_fetch_interval == 900
assert room.ics_enabled is True
# Get by name
room = await rooms_controller.get_by_name("get-test")
assert room is not None
assert room.ics_url == "webcal://calendar.example.com/feed.ics"
assert room.ics_fetch_interval == 900
assert room.ics_enabled is True
@pytest.mark.asyncio
async def test_room_list_with_ics_enabled_filter():
"""Test listing rooms filtered by ICS enabled status."""
# Create rooms with and without ICS
room1 = await rooms_controller.add(
name="ics-enabled-1",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=True,
ics_enabled=True,
ics_url="https://calendar1.example.com/feed.ics",
)
room2 = await rooms_controller.add(
name="ics-disabled",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=True,
ics_enabled=False,
)
room3 = await rooms_controller.add(
name="ics-enabled-2",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=True,
ics_enabled=True,
ics_url="https://calendar2.example.com/feed.ics",
)
# Get all rooms
all_rooms = await rooms_controller.get_all()
assert len(all_rooms) == 3
# Filter for ICS-enabled rooms (would need to implement this in controller)
ics_rooms = [r for r in all_rooms if r["ics_enabled"]]
assert len(ics_rooms) == 2
assert all(r["ics_enabled"] for r in ics_rooms)
@pytest.mark.asyncio
async def test_room_default_ics_values():
"""Test that ICS fields have correct default values."""
room = await rooms_controller.add(
name="default-test",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
# Don't specify ICS fields
)
assert room.ics_url is None
assert room.ics_fetch_interval == 300 # Default 5 minutes
assert room.ics_enabled is False
assert room.ics_last_sync is None
assert room.ics_last_etag is None

View File

@@ -1,385 +0,0 @@
from datetime import datetime, timedelta, timezone
from unittest.mock import AsyncMock, patch
import pytest
from icalendar import Calendar, Event
from reflector.db.calendar_events import CalendarEvent, calendar_events_controller
from reflector.db.rooms import rooms_controller
@pytest.fixture
async def authenticated_client(client):
from reflector.app import app
from reflector.auth import current_user_optional
app.dependency_overrides[current_user_optional] = lambda: {
"sub": "test-user",
"email": "test@example.com",
}
yield client
del app.dependency_overrides[current_user_optional]
@pytest.mark.asyncio
async def test_create_room_with_ics_fields(authenticated_client):
client = authenticated_client
response = await client.post(
"/rooms",
json={
"name": "test-ics-room",
"zulip_auto_post": False,
"zulip_stream": "",
"zulip_topic": "",
"is_locked": False,
"room_mode": "normal",
"recording_type": "cloud",
"recording_trigger": "automatic-2nd-participant",
"is_shared": False,
"ics_url": "https://calendar.example.com/test.ics",
"ics_fetch_interval": 600,
"ics_enabled": True,
},
)
assert response.status_code == 200
data = response.json()
assert data["name"] == "test-ics-room"
assert data["ics_url"] == "https://calendar.example.com/test.ics"
assert data["ics_fetch_interval"] == 600
assert data["ics_enabled"] is True
@pytest.mark.asyncio
async def test_update_room_ics_configuration(authenticated_client):
client = authenticated_client
response = await client.post(
"/rooms",
json={
"name": "update-ics-room",
"zulip_auto_post": False,
"zulip_stream": "",
"zulip_topic": "",
"is_locked": False,
"room_mode": "normal",
"recording_type": "cloud",
"recording_trigger": "automatic-2nd-participant",
"is_shared": False,
},
)
assert response.status_code == 200
room_id = response.json()["id"]
response = await client.patch(
f"/rooms/{room_id}",
json={
"ics_url": "https://calendar.google.com/updated.ics",
"ics_fetch_interval": 300,
"ics_enabled": True,
},
)
assert response.status_code == 200
data = response.json()
assert data["ics_url"] == "https://calendar.google.com/updated.ics"
assert data["ics_fetch_interval"] == 300
assert data["ics_enabled"] is True
@pytest.mark.asyncio
async def test_trigger_ics_sync(authenticated_client):
client = authenticated_client
room = await rooms_controller.add(
name="sync-api-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/api.ics",
ics_enabled=True,
)
cal = Calendar()
event = Event()
event.add("uid", "api-test-event")
event.add("summary", "API Test Meeting")
from reflector.settings import settings
event.add("location", f"{settings.BASE_URL}/room/{room.name}")
now = datetime.now(timezone.utc)
event.add("dtstart", now + timedelta(hours=1))
event.add("dtend", now + timedelta(hours=2))
cal.add_component(event)
ics_content = cal.to_ical().decode("utf-8")
with patch(
"reflector.services.ics_sync.ICSFetchService.fetch_ics", new_callable=AsyncMock
) as mock_fetch:
mock_fetch.return_value = ics_content
response = await client.post(f"/rooms/{room.name}/ics/sync")
assert response.status_code == 200
data = response.json()
assert data["status"] == "success"
assert data["events_found"] == 1
assert data["events_created"] == 1
@pytest.mark.asyncio
async def test_trigger_ics_sync_unauthorized(client):
room = await rooms_controller.add(
name="sync-unauth-room",
user_id="owner-123",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/api.ics",
ics_enabled=True,
)
response = await client.post(f"/rooms/{room.name}/ics/sync")
assert response.status_code == 403
assert "Only room owner can trigger ICS sync" in response.json()["detail"]
@pytest.mark.asyncio
async def test_trigger_ics_sync_not_configured(authenticated_client):
client = authenticated_client
room = await rooms_controller.add(
name="sync-not-configured",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_enabled=False,
)
response = await client.post(f"/rooms/{room.name}/ics/sync")
assert response.status_code == 400
assert "ICS not configured" in response.json()["detail"]
@pytest.mark.asyncio
async def test_get_ics_status(authenticated_client):
client = authenticated_client
room = await rooms_controller.add(
name="status-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/status.ics",
ics_enabled=True,
ics_fetch_interval=300,
)
now = datetime.now(timezone.utc)
await rooms_controller.update(
room,
{"ics_last_sync": now, "ics_last_etag": "test-etag"},
)
response = await client.get(f"/rooms/{room.name}/ics/status")
assert response.status_code == 200
data = response.json()
assert data["status"] == "enabled"
assert data["last_etag"] == "test-etag"
assert data["events_count"] == 0
@pytest.mark.asyncio
async def test_get_ics_status_unauthorized(client):
room = await rooms_controller.add(
name="status-unauth",
user_id="owner-456",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
ics_url="https://calendar.example.com/status.ics",
ics_enabled=True,
)
response = await client.get(f"/rooms/{room.name}/ics/status")
assert response.status_code == 403
assert "Only room owner can view ICS status" in response.json()["detail"]
@pytest.mark.asyncio
async def test_list_room_meetings(authenticated_client):
client = authenticated_client
room = await rooms_controller.add(
name="meetings-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
now = datetime.now(timezone.utc)
event1 = CalendarEvent(
room_id=room.id,
ics_uid="meeting-1",
title="Past Meeting",
start_time=now - timedelta(hours=2),
end_time=now - timedelta(hours=1),
)
await calendar_events_controller.upsert(event1)
event2 = CalendarEvent(
room_id=room.id,
ics_uid="meeting-2",
title="Future Meeting",
description="Team sync",
start_time=now + timedelta(hours=1),
end_time=now + timedelta(hours=2),
attendees=[{"email": "test@example.com"}],
)
await calendar_events_controller.upsert(event2)
response = await client.get(f"/rooms/{room.name}/meetings")
assert response.status_code == 200
data = response.json()
assert len(data) == 2
assert data[0]["title"] == "Past Meeting"
assert data[1]["title"] == "Future Meeting"
assert data[1]["description"] == "Team sync"
assert data[1]["attendees"] == [{"email": "test@example.com"}]
@pytest.mark.asyncio
async def test_list_room_meetings_non_owner(client):
room = await rooms_controller.add(
name="meetings-privacy",
user_id="owner-789",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
event = CalendarEvent(
room_id=room.id,
ics_uid="private-meeting",
title="Meeting Title",
description="Sensitive info",
start_time=datetime.now(timezone.utc) + timedelta(hours=1),
end_time=datetime.now(timezone.utc) + timedelta(hours=2),
attendees=[{"email": "private@example.com"}],
)
await calendar_events_controller.upsert(event)
response = await client.get(f"/rooms/{room.name}/meetings")
assert response.status_code == 200
data = response.json()
assert len(data) == 1
assert data[0]["title"] == "Meeting Title"
assert data[0]["description"] is None
assert data[0]["attendees"] is None
@pytest.mark.asyncio
async def test_list_upcoming_meetings(authenticated_client):
client = authenticated_client
room = await rooms_controller.add(
name="upcoming-room",
user_id="test-user",
zulip_auto_post=False,
zulip_stream="",
zulip_topic="",
is_locked=False,
room_mode="normal",
recording_type="cloud",
recording_trigger="automatic-2nd-participant",
is_shared=False,
)
now = datetime.now(timezone.utc)
past_event = CalendarEvent(
room_id=room.id,
ics_uid="past",
title="Past",
start_time=now - timedelta(hours=1),
end_time=now - timedelta(minutes=30),
)
await calendar_events_controller.upsert(past_event)
soon_event = CalendarEvent(
room_id=room.id,
ics_uid="soon",
title="Soon",
start_time=now + timedelta(minutes=15),
end_time=now + timedelta(minutes=45),
)
await calendar_events_controller.upsert(soon_event)
later_event = CalendarEvent(
room_id=room.id,
ics_uid="later",
title="Later",
start_time=now + timedelta(hours=2),
end_time=now + timedelta(hours=3),
)
await calendar_events_controller.upsert(later_event)
response = await client.get(f"/rooms/{room.name}/meetings/upcoming")
assert response.status_code == 200
data = response.json()
assert len(data) == 1
assert data[0]["title"] == "Soon"
response = await client.get(
f"/rooms/{room.name}/meetings/upcoming", params={"minutes_ahead": 180}
)
assert response.status_code == 200
data = response.json()
assert len(data) == 2
assert data[0]["title"] == "Soon"
assert data[1]["title"] == "Later"
@pytest.mark.asyncio
async def test_room_not_found_endpoints(client):
response = await client.post("/rooms/nonexistent/ics/sync")
assert response.status_code == 404
response = await client.get("/rooms/nonexistent/ics/status")
assert response.status_code == 404
response = await client.get("/rooms/nonexistent/meetings")
assert response.status_code == 404
response = await client.get("/rooms/nonexistent/meetings/upcoming")
assert response.status_code == 404

View File

@@ -2,13 +2,18 @@
import json
from datetime import datetime, timezone
from unittest.mock import AsyncMock, patch
import pytest
from pydantic import ValidationError
from reflector.db import get_database
from reflector.db.search import SearchParameters, search_controller
from reflector.db.transcripts import transcripts
from reflector.db.search import (
SearchController,
SearchParameters,
SearchResult,
search_controller,
)
from reflector.db.transcripts import SourceKind, transcripts
@pytest.mark.asyncio
@@ -18,39 +23,137 @@ async def test_search_postgresql_only():
assert results == []
assert total == 0
try:
SearchParameters(query_text="")
assert False, "Should have raised validation error"
except ValidationError:
pass # Expected
# Test that whitespace query raises validation error
try:
SearchParameters(query_text=" ")
assert False, "Should have raised validation error"
except ValidationError:
pass # Expected
params_empty = SearchParameters(query_text="")
results_empty, total_empty = await search_controller.search_transcripts(
params_empty
)
assert isinstance(results_empty, list)
assert isinstance(total_empty, int)
@pytest.mark.asyncio
async def test_search_input_validation():
try:
SearchParameters(query_text="")
assert False, "Should have raised ValidationError"
except ValidationError:
pass # Expected
async def test_search_with_empty_query():
"""Test that empty query returns all transcripts."""
params = SearchParameters(query_text="")
results, total = await search_controller.search_transcripts(params)
assert isinstance(results, list)
assert isinstance(total, int)
if len(results) > 1:
for i in range(len(results) - 1):
assert results[i].created_at >= results[i + 1].created_at
@pytest.mark.asyncio
async def test_empty_transcript_title_only_match():
"""Test that transcripts with title-only matches return empty snippets."""
test_id = "test-empty-9b3f2a8d"
# Test that whitespace query raises validation error
try:
SearchParameters(query_text=" \t\n ")
assert False, "Should have raised ValidationError"
except ValidationError:
pass # Expected
await get_database().execute(
transcripts.delete().where(transcripts.c.id == test_id)
)
test_data = {
"id": test_id,
"name": "Empty Transcript",
"title": "Empty Meeting",
"status": "completed",
"locked": False,
"duration": 0.0,
"created_at": datetime.now(timezone.utc),
"short_summary": None,
"long_summary": None,
"topics": json.dumps([]),
"events": json.dumps([]),
"participants": json.dumps([]),
"source_language": "en",
"target_language": "en",
"reviewed": False,
"audio_location": "local",
"share_mode": "private",
"source_kind": "room",
"webvtt": None,
"user_id": "test-user-1",
}
await get_database().execute(transcripts.insert().values(**test_data))
params = SearchParameters(query_text="empty", user_id="test-user-1")
results, total = await search_controller.search_transcripts(params)
assert total >= 1
found = next((r for r in results if r.id == test_id), None)
assert found is not None, "Should find transcript by title match"
assert found.search_snippets == []
assert found.total_match_count == 0
finally:
await get_database().execute(
transcripts.delete().where(transcripts.c.id == test_id)
)
await get_database().disconnect()
@pytest.mark.asyncio
async def test_search_with_long_summary():
"""Test that long_summary content is searchable."""
test_id = "test-long-summary-8a9f3c2d"
try:
await get_database().execute(
transcripts.delete().where(transcripts.c.id == test_id)
)
test_data = {
"id": test_id,
"name": "Test Long Summary",
"title": "Regular Meeting",
"status": "completed",
"locked": False,
"duration": 1800.0,
"created_at": datetime.now(timezone.utc),
"short_summary": "Brief overview",
"long_summary": "Detailed discussion about quantum computing applications and blockchain technology integration",
"topics": json.dumps([]),
"events": json.dumps([]),
"participants": json.dumps([]),
"source_language": "en",
"target_language": "en",
"reviewed": False,
"audio_location": "local",
"share_mode": "private",
"source_kind": "room",
"webvtt": """WEBVTT
00:00:00.000 --> 00:00:10.000
Basic meeting content without special keywords.""",
"user_id": "test-user-2",
}
await get_database().execute(transcripts.insert().values(**test_data))
params = SearchParameters(query_text="quantum computing", user_id="test-user-2")
results, total = await search_controller.search_transcripts(params)
assert total >= 1
found = any(r.id == test_id for r in results)
assert found, "Should find transcript by long_summary content"
test_result = next((r for r in results if r.id == test_id), None)
assert test_result
assert len(test_result.search_snippets) > 0
assert "quantum computing" in test_result.search_snippets[0].lower()
finally:
await get_database().execute(
transcripts.delete().where(transcripts.c.id == test_id)
)
await get_database().disconnect()
@pytest.mark.asyncio
async def test_postgresql_search_with_data():
# collision is improbable
test_id = "test-search-e2e-7f3a9b2c"
try:
@@ -90,32 +193,31 @@ The search feature should support complex queries with ranking.
00:00:30.000 --> 00:00:40.000
We need to implement PostgreSQL tsvector for better performance.""",
"user_id": "test-user-3",
}
await get_database().execute(transcripts.insert().values(**test_data))
# Test 1: Search for a word in title
params = SearchParameters(query_text="planning")
params = SearchParameters(query_text="planning", user_id="test-user-3")
results, total = await search_controller.search_transcripts(params)
assert total >= 1
found = any(r.id == test_id for r in results)
assert found, "Should find test transcript by title word"
# Test 2: Search for a word in webvtt content
params = SearchParameters(query_text="tsvector")
params = SearchParameters(query_text="tsvector", user_id="test-user-3")
results, total = await search_controller.search_transcripts(params)
assert total >= 1
found = any(r.id == test_id for r in results)
assert found, "Should find test transcript by webvtt content"
# Test 3: Search with multiple words
params = SearchParameters(query_text="engineering planning")
params = SearchParameters(
query_text="engineering planning", user_id="test-user-3"
)
results, total = await search_controller.search_transcripts(params)
assert total >= 1
found = any(r.id == test_id for r in results)
assert found, "Should find test transcript by multiple words"
# Test 4: Verify SearchResult structure
test_result = next((r for r in results if r.id == test_id), None)
if test_result:
assert test_result.title == "Engineering Planning Meeting Q4 2024"
@@ -123,15 +225,17 @@ We need to implement PostgreSQL tsvector for better performance.""",
assert test_result.duration == 1800.0
assert 0 <= test_result.rank <= 1, "Rank should be normalized to 0-1"
# Test 5: Search with OR operator
params = SearchParameters(query_text="tsvector OR nosuchword")
params = SearchParameters(
query_text="tsvector OR nosuchword", user_id="test-user-3"
)
results, total = await search_controller.search_transcripts(params)
assert total >= 1
found = any(r.id == test_id for r in results)
assert found, "Should find test transcript with OR query"
# Test 6: Quoted phrase search
params = SearchParameters(query_text='"full-text search"')
params = SearchParameters(
query_text='"full-text search"', user_id="test-user-3"
)
results, total = await search_controller.search_transcripts(params)
assert total >= 1
found = any(r.id == test_id for r in results)
@@ -142,3 +246,240 @@ We need to implement PostgreSQL tsvector for better performance.""",
transcripts.delete().where(transcripts.c.id == test_id)
)
await get_database().disconnect()
@pytest.fixture
def sample_search_params():
"""Create sample search parameters for testing."""
return SearchParameters(
query_text="test query",
limit=20,
offset=0,
user_id="test-user",
room_id="room1",
)
@pytest.fixture
def mock_db_result():
"""Create a mock database result."""
return {
"id": "test-transcript-id",
"title": "Test Transcript",
"created_at": datetime(2024, 6, 15, tzinfo=timezone.utc),
"duration": 3600.0,
"status": "completed",
"user_id": "test-user",
"room_id": "room1",
"source_kind": SourceKind.LIVE,
"webvtt": "WEBVTT\n\n00:00:00.000 --> 00:00:05.000\nThis is a test transcript",
"rank": 0.95,
}
class TestSearchParameters:
"""Test SearchParameters model validation and functionality."""
def test_search_parameters_with_available_filters(self):
"""Test creating SearchParameters with currently available filter options."""
params = SearchParameters(
query_text="search term",
limit=50,
offset=10,
user_id="user123",
room_id="room1",
)
assert params.query_text == "search term"
assert params.limit == 50
assert params.offset == 10
assert params.user_id == "user123"
assert params.room_id == "room1"
def test_search_parameters_defaults(self):
"""Test SearchParameters with default values."""
params = SearchParameters(query_text="test")
assert params.query_text == "test"
assert params.limit == 20
assert params.offset == 0
assert params.user_id is None
assert params.room_id is None
class TestSearchControllerFilters:
"""Test SearchController functionality with various filters."""
@pytest.mark.asyncio
async def test_search_with_source_kind_filter(self):
"""Test search filtering by source_kind."""
controller = SearchController()
with (
patch("reflector.db.search.is_postgresql", return_value=True),
patch("reflector.db.search.get_database") as mock_db,
):
mock_db.return_value.fetch_all = AsyncMock(return_value=[])
mock_db.return_value.fetch_val = AsyncMock(return_value=0)
params = SearchParameters(query_text="test", source_kind=SourceKind.LIVE)
results, total = await controller.search_transcripts(params)
assert results == []
assert total == 0
mock_db.return_value.fetch_all.assert_called_once()
@pytest.mark.asyncio
async def test_search_with_single_room_id(self):
"""Test search filtering by single room ID (currently supported)."""
controller = SearchController()
with (
patch("reflector.db.search.is_postgresql", return_value=True),
patch("reflector.db.search.get_database") as mock_db,
):
mock_db.return_value.fetch_all = AsyncMock(return_value=[])
mock_db.return_value.fetch_val = AsyncMock(return_value=0)
params = SearchParameters(
query_text="test",
room_id="room1",
)
results, total = await controller.search_transcripts(params)
assert results == []
assert total == 0
mock_db.return_value.fetch_all.assert_called_once()
@pytest.mark.asyncio
async def test_search_result_includes_available_fields(self, mock_db_result):
"""Test that search results include available fields like source_kind."""
controller = SearchController()
with (
patch("reflector.db.search.is_postgresql", return_value=True),
patch("reflector.db.search.get_database") as mock_db,
):
class MockRow:
def __init__(self, data):
self._data = data
self._mapping = data
def __iter__(self):
return iter(self._data.items())
def __getitem__(self, key):
return self._data[key]
def keys(self):
return self._data.keys()
mock_row = MockRow(mock_db_result)
mock_db.return_value.fetch_all = AsyncMock(return_value=[mock_row])
mock_db.return_value.fetch_val = AsyncMock(return_value=1)
params = SearchParameters(query_text="test")
results, total = await controller.search_transcripts(params)
assert total == 1
assert len(results) == 1
result = results[0]
assert isinstance(result, SearchResult)
assert result.id == "test-transcript-id"
assert result.title == "Test Transcript"
assert result.rank == 0.95
class TestSearchEndpointParsing:
"""Test parameter parsing in the search endpoint."""
def test_parse_comma_separated_room_ids(self):
"""Test parsing comma-separated room IDs."""
room_ids_str = "room1,room2,room3"
parsed = [rid.strip() for rid in room_ids_str.split(",") if rid.strip()]
assert parsed == ["room1", "room2", "room3"]
room_ids_str = "room1, room2 , room3"
parsed = [rid.strip() for rid in room_ids_str.split(",") if rid.strip()]
assert parsed == ["room1", "room2", "room3"]
room_ids_str = "room1,,room3,"
parsed = [rid.strip() for rid in room_ids_str.split(",") if rid.strip()]
assert parsed == ["room1", "room3"]
def test_parse_source_kind(self):
"""Test parsing source_kind values."""
for kind_str in ["live", "file", "room"]:
parsed = SourceKind(kind_str)
assert parsed == SourceKind(kind_str)
with pytest.raises(ValueError):
SourceKind("invalid_kind")
class TestSearchResultModel:
"""Test SearchResult model and serialization."""
def test_search_result_with_available_fields(self):
"""Test SearchResult model with currently available fields populated."""
result = SearchResult(
id="test-id",
title="Test Title",
user_id="user-123",
room_id="room-456",
source_kind=SourceKind.ROOM,
created_at=datetime(2024, 6, 15, tzinfo=timezone.utc),
status="completed",
rank=0.85,
duration=1800.5,
search_snippets=["snippet 1", "snippet 2"],
)
assert result.id == "test-id"
assert result.title == "Test Title"
assert result.user_id == "user-123"
assert result.room_id == "room-456"
assert result.status == "completed"
assert result.rank == 0.85
assert result.duration == 1800.5
assert len(result.search_snippets) == 2
def test_search_result_with_optional_fields_none(self):
"""Test SearchResult model with optional fields as None."""
result = SearchResult(
id="test-id",
source_kind=SourceKind.FILE,
created_at=datetime.now(timezone.utc),
status="processing",
rank=0.5,
search_snippets=[],
title=None,
user_id=None,
room_id=None,
duration=None,
)
assert result.title is None
assert result.user_id is None
assert result.room_id is None
assert result.duration is None
def test_search_result_datetime_field(self):
"""Test that SearchResult accepts datetime field."""
result = SearchResult(
id="test-id",
source_kind=SourceKind.LIVE,
created_at=datetime(2024, 6, 15, 12, 30, 45, tzinfo=timezone.utc),
status="completed",
rank=0.9,
duration=None,
search_snippets=[],
)
assert result.created_at == datetime(
2024, 6, 15, 12, 30, 45, tzinfo=timezone.utc
)

View File

@@ -0,0 +1,166 @@
"""Tests for long_summary in search functionality."""
import json
from datetime import datetime, timezone
import pytest
from reflector.db import get_database
from reflector.db.search import SearchParameters, search_controller
from reflector.db.transcripts import transcripts
@pytest.mark.asyncio
async def test_long_summary_snippet_prioritization():
"""Test that snippets from long_summary are prioritized over webvtt content."""
test_id = "test-snippet-priority-3f9a2b8c"
try:
# Clean up any existing test data
await get_database().execute(
transcripts.delete().where(transcripts.c.id == test_id)
)
test_data = {
"id": test_id,
"name": "Test Snippet Priority",
"title": "Meeting About Projects",
"status": "completed",
"locked": False,
"duration": 1800.0,
"created_at": datetime.now(timezone.utc),
"short_summary": "Project discussion",
"long_summary": (
"The team discussed advanced robotics applications including "
"autonomous navigation systems and sensor fusion techniques. "
"Robotics development will focus on real-time processing."
),
"topics": json.dumps([]),
"events": json.dumps([]),
"participants": json.dumps([]),
"source_language": "en",
"target_language": "en",
"reviewed": False,
"audio_location": "local",
"share_mode": "private",
"source_kind": "room",
"webvtt": """WEBVTT
00:00:00.000 --> 00:00:10.000
We talked about many different topics today.
00:00:10.000 --> 00:00:20.000
The robotics project is making good progress.
00:00:20.000 --> 00:00:30.000
We need to consider various implementation approaches.""",
"user_id": "test-user-priority",
}
await get_database().execute(transcripts.insert().values(**test_data))
# Search for "robotics" which appears in both long_summary and webvtt
params = SearchParameters(query_text="robotics", user_id="test-user-priority")
results, total = await search_controller.search_transcripts(params)
assert total >= 1
test_result = next((r for r in results if r.id == test_id), None)
assert test_result, "Should find the test transcript"
snippets = test_result.search_snippets
assert len(snippets) > 0, "Should have at least one snippet"
# The first snippets should be from long_summary (more detailed content)
first_snippet = snippets[0].lower()
assert (
"advanced robotics" in first_snippet or "autonomous" in first_snippet
), f"First snippet should be from long_summary with detailed content. Got: {snippets[0]}"
# With max 3 snippets, we should get both from long_summary and webvtt
assert len(snippets) <= 3, "Should respect max snippets limit"
# All snippets should contain the search term
for snippet in snippets:
assert (
"robotics" in snippet.lower()
), f"Snippet should contain search term: {snippet}"
finally:
await get_database().execute(
transcripts.delete().where(transcripts.c.id == test_id)
)
await get_database().disconnect()
@pytest.mark.asyncio
async def test_long_summary_only_search():
"""Test searching for content that only exists in long_summary."""
test_id = "test-long-only-8b3c9f2a"
try:
await get_database().execute(
transcripts.delete().where(transcripts.c.id == test_id)
)
test_data = {
"id": test_id,
"name": "Test Long Only",
"title": "Standard Meeting",
"status": "completed",
"locked": False,
"duration": 1800.0,
"created_at": datetime.now(timezone.utc),
"short_summary": "Team sync",
"long_summary": (
"Detailed analysis of cryptocurrency market trends and "
"decentralized finance protocols. Discussion included "
"yield farming strategies and liquidity pool mechanics."
),
"topics": json.dumps([]),
"events": json.dumps([]),
"participants": json.dumps([]),
"source_language": "en",
"target_language": "en",
"reviewed": False,
"audio_location": "local",
"share_mode": "private",
"source_kind": "room",
"webvtt": """WEBVTT
00:00:00.000 --> 00:00:10.000
Team meeting about general project updates.
00:00:10.000 --> 00:00:20.000
Discussion of timeline and deliverables.""",
"user_id": "test-user-long",
}
await get_database().execute(transcripts.insert().values(**test_data))
# Search for terms only in long_summary
params = SearchParameters(query_text="cryptocurrency", user_id="test-user-long")
results, total = await search_controller.search_transcripts(params)
found = any(r.id == test_id for r in results)
assert found, "Should find transcript by long_summary-only content"
test_result = next((r for r in results if r.id == test_id), None)
assert test_result
assert len(test_result.search_snippets) > 0
# Verify the snippet is about cryptocurrency
snippet = test_result.search_snippets[0].lower()
assert "cryptocurrency" in snippet, "Snippet should contain the search term"
# Search for "yield farming" - a more specific term
params2 = SearchParameters(query_text="yield farming", user_id="test-user-long")
results2, total2 = await search_controller.search_transcripts(params2)
found2 = any(r.id == test_id for r in results2)
assert found2, "Should find transcript by specific long_summary phrase"
finally:
await get_database().execute(
transcripts.delete().where(transcripts.c.id == test_id)
)
await get_database().disconnect()

View File

@@ -1,6 +1,10 @@
"""Unit tests for search snippet generation."""
from reflector.db.search import SearchController
from reflector.db.search import (
SnippetCandidate,
SnippetGenerator,
WebVTTProcessor,
)
class TestExtractWebVTT:
@@ -16,7 +20,7 @@ class TestExtractWebVTT:
00:00:10.000 --> 00:00:20.000
<v Speaker1>Indeed it is a test of WebVTT parsing.
"""
result = SearchController._extract_webvtt_text(webvtt)
result = WebVTTProcessor.extract_text(webvtt)
assert "Hello world, this is a test" in result
assert "Indeed it is a test" in result
assert "<v Speaker" not in result
@@ -25,12 +29,11 @@ class TestExtractWebVTT:
def test_extract_empty_webvtt(self):
"""Test empty WebVTT returns empty string."""
assert SearchController._extract_webvtt_text("") == ""
assert SearchController._extract_webvtt_text(None) == ""
assert WebVTTProcessor.extract_text("") == ""
def test_extract_malformed_webvtt(self):
"""Test malformed WebVTT returns empty string."""
result = SearchController._extract_webvtt_text("Not a valid WebVTT")
result = WebVTTProcessor.extract_text("Not a valid WebVTT")
assert result == ""
@@ -39,8 +42,7 @@ class TestGenerateSnippets:
def test_multiple_matches(self):
"""Test finding multiple occurrences of search term in long text."""
# Create text with Python mentions far apart to get separate snippets
separator = " This is filler text. " * 20 # ~400 chars of padding
separator = " This is filler text. " * 20
text = (
"Python is great for machine learning."
+ separator
@@ -51,18 +53,16 @@ class TestGenerateSnippets:
+ "The Python community is very supportive."
)
snippets = SearchController._generate_snippets(text, "Python")
# With enough separation, we should get multiple snippets
assert len(snippets) >= 2 # At least 2 distinct snippets
snippets = SnippetGenerator.generate(text, "Python")
assert len(snippets) >= 2
# Each snippet should contain "Python"
for snippet in snippets:
assert "python" in snippet.lower()
def test_single_match(self):
"""Test single occurrence returns one snippet."""
text = "This document discusses artificial intelligence and its applications."
snippets = SearchController._generate_snippets(text, "artificial intelligence")
snippets = SnippetGenerator.generate(text, "artificial intelligence")
assert len(snippets) == 1
assert "artificial intelligence" in snippets[0].lower()
@@ -70,24 +70,22 @@ class TestGenerateSnippets:
def test_no_matches(self):
"""Test no matches returns empty list."""
text = "This is some random text without the search term."
snippets = SearchController._generate_snippets(text, "machine learning")
snippets = SnippetGenerator.generate(text, "machine learning")
assert snippets == []
def test_case_insensitive_search(self):
"""Test search is case insensitive."""
# Add enough text between matches to get separate snippets
text = (
"MACHINE LEARNING is important for modern applications. "
+ "It requires lots of data and computational resources. " * 5 # Padding
+ "It requires lots of data and computational resources. " * 5
+ "Machine Learning rocks and transforms industries. "
+ "Deep learning is a subset of it. " * 5 # More padding
+ "Deep learning is a subset of it. " * 5
+ "Finally, machine learning will shape our future."
)
snippets = SearchController._generate_snippets(text, "machine learning")
snippets = SnippetGenerator.generate(text, "machine learning")
# Should find at least 2 (might be 3 if text is long enough)
assert len(snippets) >= 2
for snippet in snippets:
assert "machine learning" in snippet.lower()
@@ -95,61 +93,55 @@ class TestGenerateSnippets:
def test_partial_match_fallback(self):
"""Test fallback to first word when exact phrase not found."""
text = "We use machine intelligence for processing."
snippets = SearchController._generate_snippets(text, "machine learning")
snippets = SnippetGenerator.generate(text, "machine learning")
# Should fall back to finding "machine"
assert len(snippets) == 1
assert "machine" in snippets[0].lower()
def test_snippet_ellipsis(self):
"""Test ellipsis added for truncated snippets."""
# Long text where match is in the middle
text = "a " * 100 + "TARGET_WORD special content here" + " b" * 100
snippets = SearchController._generate_snippets(text, "TARGET_WORD")
snippets = SnippetGenerator.generate(text, "TARGET_WORD")
assert len(snippets) == 1
assert "..." in snippets[0] # Should have ellipsis
assert "..." in snippets[0]
assert "TARGET_WORD" in snippets[0]
def test_overlapping_snippets_deduplicated(self):
"""Test overlapping matches don't create duplicate snippets."""
text = "test test test word" * 10 # Repeated pattern
snippets = SearchController._generate_snippets(text, "test")
text = "test test test word" * 10
snippets = SnippetGenerator.generate(text, "test")
# Should get unique snippets, not duplicates
assert len(snippets) <= 3
assert len(snippets) == len(set(snippets)) # All unique
assert len(snippets) == len(set(snippets))
def test_empty_inputs(self):
"""Test empty text or search term returns empty list."""
assert SearchController._generate_snippets("", "search") == []
assert SearchController._generate_snippets("text", "") == []
assert SearchController._generate_snippets("", "") == []
assert SnippetGenerator.generate("", "search") == []
assert SnippetGenerator.generate("text", "") == []
assert SnippetGenerator.generate("", "") == []
def test_max_snippets_limit(self):
"""Test respects max_snippets parameter."""
# Create text with well-separated occurrences
separator = " filler " * 50 # Ensure snippets don't overlap
text = ("Python is amazing" + separator) * 10 # 10 occurrences
separator = " filler " * 50
text = ("Python is amazing" + separator) * 10
# Test with different limits
snippets_1 = SearchController._generate_snippets(text, "Python", max_snippets=1)
snippets_1 = SnippetGenerator.generate(text, "Python", max_snippets=1)
assert len(snippets_1) == 1
snippets_2 = SearchController._generate_snippets(text, "Python", max_snippets=2)
snippets_2 = SnippetGenerator.generate(text, "Python", max_snippets=2)
assert len(snippets_2) == 2
snippets_5 = SearchController._generate_snippets(text, "Python", max_snippets=5)
assert len(snippets_5) == 5 # Should get exactly 5 with enough separation
snippets_5 = SnippetGenerator.generate(text, "Python", max_snippets=5)
assert len(snippets_5) == 5
def test_snippet_length(self):
"""Test snippet length is reasonable."""
text = "word " * 200 # Long text
snippets = SearchController._generate_snippets(text, "word")
text = "word " * 200
snippets = SnippetGenerator.generate(text, "word")
for snippet in snippets:
# Default max_length is 150 + some context
assert len(snippet) <= 200 # Some buffer for ellipsis
assert len(snippet) <= 200
class TestFullPipeline:
@@ -157,7 +149,6 @@ class TestFullPipeline:
def test_webvtt_to_snippets_integration(self):
"""Test full pipeline from WebVTT to search snippets."""
# Create WebVTT with well-separated content for multiple snippets
webvtt = (
"""WEBVTT
@@ -182,17 +173,362 @@ class TestFullPipeline:
"""
)
# Extract and generate snippets
plain_text = SearchController._extract_webvtt_text(webvtt)
snippets = SearchController._generate_snippets(plain_text, "machine learning")
plain_text = WebVTTProcessor.extract_text(webvtt)
snippets = SnippetGenerator.generate(plain_text, "machine learning")
# Should find at least 2 snippets (text might still be close together)
assert len(snippets) >= 1 # At minimum one snippet containing matches
assert len(snippets) <= 3 # At most 3 by default
assert len(snippets) >= 1
assert len(snippets) <= 3
# No WebVTT artifacts in snippets
for snippet in snippets:
assert "machine learning" in snippet.lower()
assert "<v Speaker" not in snippet
assert "00:00" not in snippet
assert "-->" not in snippet
class TestMultiWordQueryBehavior:
"""Tests for multi-word query behavior and exact phrase matching."""
def test_multi_word_query_snippet_behavior(self):
"""Test that multi-word queries generate snippets based on exact phrase matching."""
sample_text = """This is a sample transcript where user Alice is talking.
Later in the conversation, jordan mentions something important.
The user jordan collaboration was successful.
Another user named Bob joins the discussion."""
user_snippets = SnippetGenerator.generate(sample_text, "user")
assert len(user_snippets) == 2, "Should find 2 snippets for 'user'"
jordan_snippets = SnippetGenerator.generate(sample_text, "jordan")
assert len(jordan_snippets) >= 1, "Should find at least 1 snippet for 'jordan'"
multi_word_snippets = SnippetGenerator.generate(sample_text, "user jordan")
assert len(multi_word_snippets) == 1, (
"Should return exactly 1 snippet for 'user jordan' "
"(only the exact phrase match, not individual word occurrences)"
)
snippet = multi_word_snippets[0]
assert (
"user jordan" in snippet.lower()
), "The snippet should contain the exact phrase 'user jordan'"
assert (
"alice" not in snippet.lower()
), "The snippet should not include the first standalone 'user' with Alice"
def test_multi_word_query_without_exact_match(self):
"""Test snippet generation when exact phrase is not found."""
sample_text = """User Alice is here. Bob and jordan are talking.
Later jordan mentions something. The user is happy."""
snippets = SnippetGenerator.generate(sample_text, "user jordan")
assert (
len(snippets) >= 1
), "Should find at least 1 snippet when falling back to first word"
all_snippets_text = " ".join(snippets).lower()
assert (
"user" in all_snippets_text
), "Snippets should contain 'user' (the first word)"
def test_exact_phrase_at_text_boundaries(self):
"""Test snippet generation when exact phrase appears at text boundaries."""
text_start = "user jordan started the meeting. Other content here."
snippets = SnippetGenerator.generate(text_start, "user jordan")
assert len(snippets) == 1
assert "user jordan" in snippets[0].lower()
text_end = "Other content here. The meeting ended with user jordan"
snippets = SnippetGenerator.generate(text_end, "user jordan")
assert len(snippets) == 1
assert "user jordan" in snippets[0].lower()
def test_multi_word_query_matches_words_appearing_separately_and_together(self):
"""Test that multi-word queries prioritize exact phrase matches over individual word occurrences."""
sample_text = """This is a sample transcript where user Alice is talking.
Later in the conversation, jordan mentions something important.
The user jordan collaboration was successful.
Another user named Bob joins the discussion."""
search_query = "user jordan"
snippets = SnippetGenerator.generate(sample_text, search_query)
assert len(snippets) == 1, (
f"Expected exactly 1 snippet for '{search_query}' when exact phrase exists, "
f"got {len(snippets)}. Should ignore individual word occurrences."
)
snippet = snippets[0]
assert (
search_query in snippet.lower()
), f"Snippet should contain the exact phrase '{search_query}'. Got: {snippet}"
assert (
"jordan mentions" in snippet.lower()
), f"Snippet should include context before the exact phrase match. Got: {snippet}"
assert (
"alice" not in snippet.lower()
), f"Snippet should not include separate occurrences of individual words. Got: {snippet}"
text_2 = """The alpha version was released.
Beta testing started yesterday.
The alpha beta integration is complete."""
snippets_2 = SnippetGenerator.generate(text_2, "alpha beta")
assert len(snippets_2) == 1, "Should return 1 snippet for exact phrase match"
assert "alpha beta" in snippets_2[0].lower(), "Should contain exact phrase"
assert (
"version" not in snippets_2[0].lower()
), "Should not include first separate occurrence"
class TestSnippetGenerationEnhanced:
"""Additional snippet generation tests from test_search_enhancements.py."""
def test_snippet_generation_from_webvtt(self):
"""Test snippet generation from WebVTT content."""
webvtt_content = """WEBVTT
00:00:00.000 --> 00:00:05.000
This is the beginning of the transcript
00:00:05.000 --> 00:00:10.000
The search term appears here in the middle
00:00:10.000 --> 00:00:15.000
And this is the end of the content"""
plain_text = WebVTTProcessor.extract_text(webvtt_content)
snippets = SnippetGenerator.generate(plain_text, "search term")
assert len(snippets) > 0
assert any("search term" in snippet.lower() for snippet in snippets)
def test_extract_webvtt_text_with_malformed_variations(self):
"""Test WebVTT extraction with various malformed content."""
malformed_vtt = "This is not valid WebVTT content"
result = WebVTTProcessor.extract_text(malformed_vtt)
assert result == ""
partial_vtt = "WEBVTT\nNo timestamps here"
result = WebVTTProcessor.extract_text(partial_vtt)
assert result == "" or "No timestamps" not in result
class TestPureFunctions:
"""Test the pure functions extracted for functional programming."""
def test_find_all_matches(self):
"""Test finding all match positions in text."""
text = "Python is great. Python is powerful. I love Python."
matches = list(SnippetGenerator.find_all_matches(text, "Python"))
assert matches == [0, 17, 44]
matches = list(SnippetGenerator.find_all_matches(text, "python"))
assert matches == [0, 17, 44]
matches = list(SnippetGenerator.find_all_matches(text, "Ruby"))
assert matches == []
matches = list(SnippetGenerator.find_all_matches("", "test"))
assert matches == []
matches = list(SnippetGenerator.find_all_matches("test", ""))
assert matches == []
def test_create_snippet(self):
"""Test creating a snippet from a match position."""
text = "This is a long text with the word Python in the middle and more text after."
snippet = SnippetGenerator.create_snippet(text, 35, max_length=150)
assert "Python" in snippet.text()
assert snippet.start >= 0
assert snippet.end <= len(text)
assert isinstance(snippet, SnippetCandidate)
assert len(snippet.text()) > 0
assert snippet.start <= snippet.end
long_text = "A" * 200
snippet = SnippetGenerator.create_snippet(long_text, 100, max_length=50)
assert snippet.text().startswith("...")
assert snippet.text().endswith("...")
snippet = SnippetGenerator.create_snippet("short text", 0, max_length=100)
assert snippet.start == 0
assert "short text" in snippet.text()
def test_filter_non_overlapping(self):
"""Test filtering overlapping snippets."""
candidates = [
SnippetCandidate(_text="First snippet", start=0, _original_text_length=100),
SnippetCandidate(_text="Overlapping", start=10, _original_text_length=100),
SnippetCandidate(
_text="Third snippet", start=40, _original_text_length=100
),
SnippetCandidate(
_text="Fourth snippet", start=65, _original_text_length=100
),
]
filtered = list(SnippetGenerator.filter_non_overlapping(iter(candidates)))
assert filtered == [
"First snippet...",
"...Third snippet...",
"...Fourth snippet...",
]
filtered = list(SnippetGenerator.filter_non_overlapping(iter([])))
assert filtered == []
def test_generate_integration(self):
"""Test the main SnippetGenerator.generate function."""
text = "Machine learning is amazing. Machine learning transforms data. Learn machine learning today."
snippets = SnippetGenerator.generate(text, "machine learning")
assert len(snippets) <= 3
assert all("machine learning" in s.lower() for s in snippets)
snippets = SnippetGenerator.generate(text, "machine learning", max_snippets=2)
assert len(snippets) <= 2
snippets = SnippetGenerator.generate(text, "machine vision")
assert len(snippets) > 0
assert any("machine" in s.lower() for s in snippets)
def test_extract_webvtt_text_basic(self):
"""Test WebVTT text extraction (basic test, full tests exist elsewhere)."""
webvtt = """WEBVTT
00:00:00.000 --> 00:00:02.000
Hello world
00:00:02.000 --> 00:00:04.000
This is a test"""
result = WebVTTProcessor.extract_text(webvtt)
assert "Hello world" in result
assert "This is a test" in result
# Test empty input
assert WebVTTProcessor.extract_text("") == ""
assert WebVTTProcessor.extract_text(None) == ""
def test_generate_webvtt_snippets(self):
"""Test generating snippets from WebVTT content."""
webvtt = """WEBVTT
00:00:00.000 --> 00:00:02.000
Python programming is great
00:00:02.000 --> 00:00:04.000
Learn Python today"""
snippets = WebVTTProcessor.generate_snippets(webvtt, "Python")
assert len(snippets) > 0
assert any("Python" in s for s in snippets)
snippets = WebVTTProcessor.generate_snippets("", "Python")
assert snippets == []
def test_from_summary(self):
"""Test generating snippets from summary text."""
summary = "This meeting discussed Python development and machine learning applications."
snippets = SnippetGenerator.from_summary(summary, "Python")
assert len(snippets) > 0
assert any("Python" in s for s in snippets)
long_summary = "Python " * 20
snippets = SnippetGenerator.from_summary(long_summary, "Python")
assert len(snippets) <= 2
def test_combine_sources(self):
"""Test combining snippets from multiple sources."""
summary = "Python is a great programming language."
webvtt = """WEBVTT
00:00:00.000 --> 00:00:02.000
Learn Python programming
00:00:02.000 --> 00:00:04.000
Python is powerful"""
snippets, total_count = SnippetGenerator.combine_sources(
summary, webvtt, "Python", max_total=3
)
assert len(snippets) <= 3
assert len(snippets) > 0
assert total_count > 0
snippets, total_count = SnippetGenerator.combine_sources(
summary, None, "Python", max_total=3
)
assert len(snippets) > 0
assert all("Python" in s for s in snippets)
assert total_count == 1
snippets, total_count = SnippetGenerator.combine_sources(
None, webvtt, "Python", max_total=3
)
assert len(snippets) > 0
assert total_count == 2
long_summary = "Python " * 10
snippets, total_count = SnippetGenerator.combine_sources(
long_summary, webvtt, "Python", max_total=2
)
assert len(snippets) == 2
assert total_count >= 10
def test_match_counting_sum_logic(self):
"""Test that match counting correctly sums matches from both sources."""
summary = "data science uses data analysis and data mining techniques"
webvtt = """WEBVTT
00:00:00.000 --> 00:00:02.000
Big data processing
00:00:02.000 --> 00:00:04.000
data visualization and data storage"""
snippets, total_count = SnippetGenerator.combine_sources(
summary, webvtt, "data", max_total=3
)
assert total_count == 6
assert len(snippets) <= 3
summary_snippets, summary_count = SnippetGenerator.combine_sources(
summary, None, "data", max_total=3
)
assert summary_count == 3
webvtt_snippets, webvtt_count = SnippetGenerator.combine_sources(
None, webvtt, "data", max_total=3
)
assert webvtt_count == 3
snippets_empty, count_empty = SnippetGenerator.combine_sources(
None, None, "data", max_total=3
)
assert snippets_empty == []
assert count_empty == 0
def test_edge_cases(self):
"""Test edge cases for the pure functions."""
text = "Test with special: @#$%^&*() characters"
snippets = SnippetGenerator.generate(text, "@#$%")
assert len(snippets) > 0
long_query = "a" * 100
snippets = SnippetGenerator.generate("Some text", long_query)
assert snippets == []
text = "Unicode test: café, naïve, 日本語"
snippets = SnippetGenerator.generate(text, "café")
assert len(snippets) > 0
assert "café" in snippets[0]

View File

@@ -19,7 +19,7 @@ async def fake_transcript(tmpdir, client):
transcript = await transcripts_controller.get_by_id(tid)
assert transcript is not None
await transcripts_controller.update(transcript, {"status": "finished"})
await transcripts_controller.update(transcript, {"status": "ended"})
# manually copy a file at the expected location
audio_filename = transcript.audio_mp3_filename

View File

@@ -29,10 +29,10 @@ async def client(app_lifespan):
@pytest.mark.asyncio
async def test_transcript_process(
tmpdir,
whisper_transcript,
dummy_llm,
dummy_processors,
dummy_diarization,
dummy_file_transcript,
dummy_file_diarization,
dummy_storage,
client,
):
@@ -56,8 +56,8 @@ async def test_transcript_process(
assert response.status_code == 200
assert response.json()["status"] == "ok"
# wait for processing to finish (max 10 minutes)
timeout_seconds = 600 # 10 minutes
# wait for processing to finish (max 1 minute)
timeout_seconds = 60
start_time = time.monotonic()
while (time.monotonic() - start_time) < timeout_seconds:
# fetch the transcript and check if it is ended
@@ -75,9 +75,10 @@ async def test_transcript_process(
)
assert response.status_code == 200
assert response.json()["status"] == "ok"
await asyncio.sleep(2)
# wait for processing to finish (max 10 minutes)
timeout_seconds = 600 # 10 minutes
# wait for processing to finish (max 1 minute)
timeout_seconds = 60
start_time = time.monotonic()
while (time.monotonic() - start_time) < timeout_seconds:
# fetch the transcript and check if it is ended
@@ -99,4 +100,4 @@ async def test_transcript_process(
response = await client.get(f"/transcripts/{tid}/topics")
assert response.status_code == 200
assert len(response.json()) == 1
assert "want to share" in response.json()[0]["transcript"]
assert "Hello world. How are you today?" in response.json()[0]["transcript"]

View File

@@ -12,7 +12,8 @@ async def test_transcript_upload_file(
tmpdir,
dummy_llm,
dummy_processors,
dummy_diarization,
dummy_file_transcript,
dummy_file_diarization,
dummy_storage,
client,
):
@@ -36,8 +37,8 @@ async def test_transcript_upload_file(
assert response.status_code == 200
assert response.json()["status"] == "ok"
# wait the processing to finish (max 10 minutes)
timeout_seconds = 600 # 10 minutes
# wait the processing to finish (max 1 minute)
timeout_seconds = 60
start_time = time.monotonic()
while (time.monotonic() - start_time) < timeout_seconds:
# fetch the transcript and check if it is ended
@@ -47,7 +48,7 @@ async def test_transcript_upload_file(
break
await asyncio.sleep(1)
else:
pytest.fail(f"Processing timed out after {timeout_seconds} seconds")
return pytest.fail(f"Processing timed out after {timeout_seconds} seconds")
# check the transcript is ended
transcript = resp.json()
@@ -59,4 +60,4 @@ async def test_transcript_upload_file(
response = await client.get(f"/transcripts/{tid}/topics")
assert response.status_code == 200
assert len(response.json()) == 1
assert "want to share" in response.json()[0]["transcript"]
assert "Hello world. How are you today?" in response.json()[0]["transcript"]

872
server/uv.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,26 +1,67 @@
import React from "react";
import React, { useEffect } from "react";
import { Pagination, IconButton, ButtonGroup } from "@chakra-ui/react";
import { LuChevronLeft, LuChevronRight } from "react-icons/lu";
// explicitly 1-based to prevent +/-1-confusion errors
export const FIRST_PAGE = 1 as PaginationPage;
export const parsePaginationPage = (
page: number,
):
| {
value: PaginationPage;
}
| {
error: string;
} => {
if (page < FIRST_PAGE)
return {
error: "Page must be greater than 0",
};
if (!Number.isInteger(page))
return {
error: "Page must be an integer",
};
return {
value: page as PaginationPage,
};
};
export type PaginationPage = number & { __brand: "PaginationPage" };
export const PaginationPage = (page: number): PaginationPage => {
const v = parsePaginationPage(page);
if ("error" in v) throw new Error(v.error);
return v.value;
};
export const paginationPageTo0Based = (page: PaginationPage): number =>
page - FIRST_PAGE;
type PaginationProps = {
page: number;
setPage: (page: number) => void;
page: PaginationPage;
setPage: (page: PaginationPage) => void;
total: number;
size: number;
};
export const totalPages = (total: number, size: number) => {
return Math.ceil(total / size);
};
export default function PaginationComponent(props: PaginationProps) {
const { page, setPage, total, size } = props;
const totalPages = Math.ceil(total / size);
if (totalPages <= 1) return null;
useEffect(() => {
if (page > totalPages(total, size)) {
console.error(
`Page number (${page}) is greater than total pages (${totalPages}) in pagination`,
);
}
}, [page, totalPages(total, size)]);
return (
<Pagination.Root
count={total}
pageSize={size}
page={page}
onPageChange={(details) => setPage(details.page)}
onPageChange={(details) => setPage(PaginationPage(details.page))}
style={{ display: "flex", justifyContent: "center" }}
>
<ButtonGroup variant="ghost" size="xs">

View File

@@ -1,34 +0,0 @@
import React, { useState } from "react";
import { Flex, Input, Button } from "@chakra-ui/react";
interface SearchBarProps {
onSearch: (searchTerm: string) => void;
}
export default function SearchBar({ onSearch }: SearchBarProps) {
const [searchInputValue, setSearchInputValue] = useState("");
const handleSearch = () => {
onSearch(searchInputValue);
};
const handleKeyDown = (event: React.KeyboardEvent) => {
if (event.key === "Enter") {
handleSearch();
}
};
return (
<Flex alignItems="center">
<Input
placeholder="Search transcriptions..."
value={searchInputValue}
onChange={(e) => setSearchInputValue(e.target.value)}
onKeyDown={handleKeyDown}
/>
<Button ml={2} onClick={handleSearch}>
Search
</Button>
</Flex>
);
}

View File

@@ -4,8 +4,8 @@ import { LuMenu, LuTrash, LuRotateCw } from "react-icons/lu";
interface TranscriptActionsMenuProps {
transcriptId: string;
onDelete: (transcriptId: string) => (e: any) => void;
onReprocess: (transcriptId: string) => (e: any) => void;
onDelete: (transcriptId: string) => void;
onReprocess: (transcriptId: string) => void;
}
export default function TranscriptActionsMenu({
@@ -24,11 +24,17 @@ export default function TranscriptActionsMenu({
<Menu.Content>
<Menu.Item
value="reprocess"
onClick={(e) => onReprocess(transcriptId)(e)}
onClick={() => onReprocess(transcriptId)}
>
<LuRotateCw /> Reprocess
</Menu.Item>
<Menu.Item value="delete" onClick={(e) => onDelete(transcriptId)(e)}>
<Menu.Item
value="delete"
onClick={(e) => {
e.stopPropagation();
onDelete(transcriptId);
}}
>
<LuTrash /> Delete
</Menu.Item>
</Menu.Content>

View File

@@ -1,27 +1,290 @@
import React from "react";
import { Box, Stack, Text, Flex, Link, Spinner } from "@chakra-ui/react";
import React, { useState } from "react";
import {
Box,
Stack,
Text,
Flex,
Link,
Spinner,
Badge,
HStack,
VStack,
} from "@chakra-ui/react";
import NextLink from "next/link";
import { GetTranscriptMinimal } from "../../../api";
import { formatTimeMs, formatLocalDate } from "../../../lib/time";
import TranscriptStatusIcon from "./TranscriptStatusIcon";
import TranscriptActionsMenu from "./TranscriptActionsMenu";
import {
highlightMatches,
generateTextFragment,
} from "../../../lib/textHighlight";
import { SearchResult } from "../../../api";
interface TranscriptCardsProps {
transcripts: GetTranscriptMinimal[];
onDelete: (transcriptId: string) => (e: any) => void;
onReprocess: (transcriptId: string) => (e: any) => void;
loading?: boolean;
results: SearchResult[];
query: string;
isLoading?: boolean;
onDelete: (transcriptId: string) => void;
onReprocess: (transcriptId: string) => void;
}
function highlightText(text: string, query: string): React.ReactNode {
if (!query) return text;
const matches = highlightMatches(text, query);
if (matches.length === 0) return text;
// Sort matches by index to process them in order
const sortedMatches = [...matches].sort((a, b) => a.index - b.index);
const parts: React.ReactNode[] = [];
let lastIndex = 0;
sortedMatches.forEach((match, i) => {
// Add text before the match
if (match.index > lastIndex) {
parts.push(
<Text as="span" key={`text-${i}`} display="inline">
{text.slice(lastIndex, match.index)}
</Text>,
);
}
// Add the highlighted match
parts.push(
<Text
as="mark"
key={`match-${i}`}
bg="yellow.200"
px={0.5}
display="inline"
>
{match.match}
</Text>,
);
lastIndex = match.index + match.match.length;
});
// Add remaining text after last match
if (lastIndex < text.length) {
parts.push(
<Text as="span" key={`text-end`} display="inline">
{text.slice(lastIndex)}
</Text>,
);
}
return parts;
}
const transcriptHref = (
transcriptId: string,
mainSnippet: string,
query: string,
): `/transcripts/${string}` => {
const urlTextFragment = mainSnippet
? generateTextFragment(mainSnippet, query)
: null;
const urlTextFragmentWithHash = urlTextFragment
? `#${urlTextFragment.k}=${encodeURIComponent(urlTextFragment.v)}`
: "";
return `/transcripts/${transcriptId}${urlTextFragmentWithHash}`;
};
// note that it's strongly tied to search logic - in case you want to use it independently, refactor
function TranscriptCard({
result,
query,
onDelete,
onReprocess,
}: {
result: SearchResult;
query: string;
onDelete: (transcriptId: string) => void;
onReprocess: (transcriptId: string) => void;
}) {
const [isExpanded, setIsExpanded] = useState(false);
const mainSnippet = result.search_snippets[0];
const additionalSnippets = result.search_snippets.slice(1);
const totalMatches = result.total_match_count || 0;
const snippetsShown = result.search_snippets.length;
const remainingMatches = totalMatches - snippetsShown;
const hasAdditionalSnippets = additionalSnippets.length > 0;
const resultTitle = result.title || "Unnamed Transcript";
const formattedDuration = result.duration
? formatTimeMs(result.duration)
: "N/A";
const formattedDate = formatLocalDate(result.created_at);
const source =
result.source_kind === "room"
? result.room_name || result.room_id
: result.source_kind;
const handleExpandClick = (e: React.MouseEvent) => {
e.preventDefault();
e.stopPropagation();
setIsExpanded(!isExpanded);
};
return (
<Box borderWidth={1} p={4} borderRadius="md" fontSize="sm">
<Flex justify="space-between" alignItems="flex-start" gap="2">
<Box>
<TranscriptStatusIcon status={result.status} />
</Box>
<Box flex="1">
{/* Title with highlighting and text fragment for deep linking */}
<Link
as={NextLink}
href={transcriptHref(result.id, mainSnippet, query)}
fontWeight="600"
display="block"
mb={2}
>
{highlightText(resultTitle, query)}
</Link>
{/* Metadata - Horizontal on desktop, vertical on mobile */}
<Flex
direction={{ base: "column", md: "row" }}
gap={{ base: 1, md: 2 }}
fontSize="xs"
color="gray.600"
flexWrap="wrap"
align={{ base: "flex-start", md: "center" }}
>
<Flex align="center" gap={1}>
<Text fontWeight="medium" color="gray.500">
Source:
</Text>
<Text>{source}</Text>
</Flex>
<Text display={{ base: "none", md: "block" }} color="gray.400">
</Text>
<Flex align="center" gap={1}>
<Text fontWeight="medium" color="gray.500">
Date:
</Text>
<Text>{formattedDate}</Text>
</Flex>
<Text display={{ base: "none", md: "block" }} color="gray.400">
</Text>
<Flex align="center" gap={1}>
<Text fontWeight="medium" color="gray.500">
Duration:
</Text>
<Text>{formattedDuration}</Text>
</Flex>
</Flex>
{/* Search Results Section - only show when searching */}
{mainSnippet && (
<>
{/* Main Snippet */}
<Box
mt={3}
p={2}
bg="gray.50"
borderLeft="2px solid"
borderLeftColor="blue.400"
borderRadius="sm"
fontSize="xs"
>
<Text color="gray.700">
{highlightText(mainSnippet, query)}
</Text>
</Box>
{hasAdditionalSnippets && (
<>
<Flex
mt={2}
p={2}
bg="blue.50"
borderRadius="sm"
cursor="pointer"
onClick={handleExpandClick}
_hover={{ bg: "blue.100" }}
align="center"
justify="space-between"
>
<HStack gap={2}>
<Badge
bg="blue.500"
color="white"
fontSize="xs"
px={2}
borderRadius="full"
>
{remainingMatches > 0
? `${additionalSnippets.length + remainingMatches}+`
: additionalSnippets.length}
</Badge>
<Text fontSize="xs" color="blue.600" fontWeight="medium">
more{" "}
{additionalSnippets.length + remainingMatches === 1
? "match"
: "matches"}
{remainingMatches > 0 &&
` (${additionalSnippets.length} shown)`}
</Text>
</HStack>
<Text fontSize="xs" color="blue.600">
{isExpanded ? "▲" : "▼"}
</Text>
</Flex>
{/* Additional Snippets */}
{isExpanded && (
<VStack align="stretch" gap={2} mt={2}>
{additionalSnippets.map((snippet, index) => (
<Box
key={index}
p={2}
bg="gray.50"
borderLeft="2px solid"
borderLeftColor="gray.300"
borderRadius="sm"
fontSize="xs"
>
<Text color="gray.700">
{highlightText(snippet, query)}
</Text>
</Box>
))}
</VStack>
)}
</>
)}
</>
)}
</Box>
<TranscriptActionsMenu
transcriptId={result.id}
onDelete={onDelete}
onReprocess={onReprocess}
/>
</Flex>
</Box>
);
}
export default function TranscriptCards({
transcripts,
results,
query,
isLoading,
onDelete,
onReprocess,
loading,
}: TranscriptCardsProps) {
return (
<Box display={{ base: "block", lg: "none" }} position="relative">
{loading && (
<Box position="relative">
{isLoading && (
<Flex
position="absolute"
top={0}
@@ -37,48 +300,19 @@ export default function TranscriptCards({
</Flex>
)}
<Box
opacity={loading ? 0.9 : 1}
pointerEvents={loading ? "none" : "auto"}
opacity={isLoading ? 0.9 : 1}
pointerEvents={isLoading ? "none" : "auto"}
transition="opacity 0.2s ease-in-out"
>
<Stack gap={2}>
{transcripts.map((item) => (
<Box
key={item.id}
borderWidth={1}
p={4}
borderRadius="md"
fontSize="sm"
>
<Flex justify="space-between" alignItems="flex-start" gap="2">
<Box>
<TranscriptStatusIcon status={item.status} />
</Box>
<Box flex="1">
<Link
as={NextLink}
href={`/transcripts/${item.id}`}
fontWeight="600"
display="block"
>
{item.title || "Unnamed Transcript"}
</Link>
<Text>
Source:{" "}
{item.source_kind === "room"
? item.room_name
: item.source_kind}
</Text>
<Text>Date: {formatLocalDate(item.created_at)}</Text>
<Text>Duration: {formatTimeMs(item.duration)}</Text>
</Box>
<TranscriptActionsMenu
transcriptId={item.id}
onDelete={onDelete}
onReprocess={onReprocess}
/>
</Flex>
</Box>
<Stack gap={3}>
{results.map((result) => (
<TranscriptCard
key={result.id}
result={result}
query={query}
onDelete={onDelete}
onReprocess={onReprocess}
/>
))}
</Stack>
</Box>

View File

@@ -1,99 +0,0 @@
import React from "react";
import { Box, Table, Link, Flex, Spinner } from "@chakra-ui/react";
import NextLink from "next/link";
import { GetTranscriptMinimal } from "../../../api";
import { formatTimeMs, formatLocalDate } from "../../../lib/time";
import TranscriptStatusIcon from "./TranscriptStatusIcon";
import TranscriptActionsMenu from "./TranscriptActionsMenu";
interface TranscriptTableProps {
transcripts: GetTranscriptMinimal[];
onDelete: (transcriptId: string) => (e: any) => void;
onReprocess: (transcriptId: string) => (e: any) => void;
loading?: boolean;
}
export default function TranscriptTable({
transcripts,
onDelete,
onReprocess,
loading,
}: TranscriptTableProps) {
return (
<Box display={{ base: "none", lg: "block" }} position="relative">
{loading && (
<Flex
position="absolute"
top={0}
left={0}
right={0}
bottom={0}
align="center"
justify="center"
>
<Spinner size="xl" color="gray.700" />
</Flex>
)}
<Box
opacity={loading ? 0.9 : 1}
pointerEvents={loading ? "none" : "auto"}
transition="opacity 0.2s ease-in-out"
>
<Table.Root>
<Table.Header>
<Table.Row>
<Table.ColumnHeader
width="16px"
fontWeight="600"
></Table.ColumnHeader>
<Table.ColumnHeader width="400px" fontWeight="600">
Transcription Title
</Table.ColumnHeader>
<Table.ColumnHeader width="150px" fontWeight="600">
Source
</Table.ColumnHeader>
<Table.ColumnHeader width="200px" fontWeight="600">
Date
</Table.ColumnHeader>
<Table.ColumnHeader width="100px" fontWeight="600">
Duration
</Table.ColumnHeader>
<Table.ColumnHeader
width="50px"
fontWeight="600"
></Table.ColumnHeader>
</Table.Row>
</Table.Header>
<Table.Body>
{transcripts.map((item) => (
<Table.Row key={item.id}>
<Table.Cell>
<TranscriptStatusIcon status={item.status} />
</Table.Cell>
<Table.Cell>
<Link as={NextLink} href={`/transcripts/${item.id}`}>
{item.title || "Unnamed Transcript"}
</Link>
</Table.Cell>
<Table.Cell>
{item.source_kind === "room"
? item.room_name
: item.source_kind}
</Table.Cell>
<Table.Cell>{formatLocalDate(item.created_at)}</Table.Cell>
<Table.Cell>{formatTimeMs(item.duration)}</Table.Cell>
<Table.Cell>
<TranscriptActionsMenu
transcriptId={item.id}
onDelete={onDelete}
onReprocess={onReprocess}
/>
</Table.Cell>
</Table.Row>
))}
</Table.Body>
</Table.Root>
</Box>
</Box>
);
}

Some files were not shown because too many files have changed in this diff Show More