Files
reflector/server/docs/data_retention.md
Mathieu Virbel 6f0c7c1a5e feat(cleanup): add automatic data retention for public instances (#574)
* feat(cleanup): add automatic data retention for public instances

- Add Celery task to clean up anonymous data after configurable retention period
- Delete transcripts, meetings, and orphaned recordings older than retention days
- Only runs when PUBLIC_MODE is enabled to prevent accidental data loss
- Properly removes all associated files (local and S3 storage)
- Add manual cleanup tool for testing and intervention
- Configure retention via PUBLIC_DATA_RETENTION_DAYS setting (default: 7 days)

Fixes #571

* fix: apply pre-commit formatting fixes

* fix: properly delete recording files from storage during cleanup

- Add storage deletion for orphaned recordings in both cleanup task and manual tool
- Delete from storage before removing database records
- Log warnings if storage deletion fails but continue with database cleanup

* Apply suggestion from @pr-agent-monadical[bot]

Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>

* Apply suggestion from @pr-agent-monadical[bot]

Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>

* refactor: cleanup_old_data for better logging

* fix: linting

* test: fix meeting cleanup test to not require room controller

- Simplify test by directly inserting meetings into database
- Remove dependency on non-existent rooms_controller.create method
- Tests now pass successfully

* fix: linting

* refactor: simplify cleanup tool to use worker implementation

- Remove duplicate cleanup logic from manual tool
- Use the same _cleanup_old_public_data function from worker
- Remove dry-run feature as requested
- Prevent code duplication and ensure consistency
- Update documentation to reflect changes

* refactor: split cleanup worker into smaller functions

- Move all imports to the top of the file
- Extract cleanup logic into separate functions:
  - cleanup_old_transcripts()
  - cleanup_old_meetings()
  - cleanup_orphaned_recordings()
  - log_cleanup_results()
- Make code more maintainable and testable
- Add days parameter support to Celery task
- Update manual tool to work with refactored code

* feat: add TypedDict typing for cleanup stats

- Add CleanupStats TypedDict for better type safety
- Update all function signatures to use proper typing
- Add return type annotations to _cleanup_old_public_data
- Improves code maintainability and IDE support

* feat: add CASCADE DELETE to meeting_consent foreign key

- Add ondelete="CASCADE" to meeting_consent.meeting_id foreign key
- Generate and apply migration to update existing constraint
- Remove manual consent deletion from cleanup code
- Add unit test to verify CASCADE DELETE behavior

* style: linting

* fix: alembic migration branchpoint

* fix: correct downgrade constraint name in CASCADE DELETE migration

* fix: regenerate CASCADE DELETE migration with proper constraint names

- Delete problematic migration and regenerate with correct names
- Use explicit constraint name in both upgrade and downgrade
- Ensure migration works bidirectionally
- All tests passing including CASCADE DELETE test

* style: linting

* refactor: simplify cleanup to use transcripts as entry point

- Remove orphaned_recordings cleanup (not part of this PR scope)
- Remove separate old_meetings cleanup
- Transcripts are now the main entry point for cleanup
- Associated meetings and recordings are deleted with their transcript
- Use single database connection for all operations
- Update tests to reflect new approach

* refactor: cleanup and rename functions for clarity

- Rename _cleanup_old_public_data to cleanup_old_public_data (make public)
- Rename celery task to cleanup_old_public_data_task for clarity
- Update docstrings and improve code organization
- Remove unnecessary comments and simplify deletion logic
- Update tests to use new function names
- All tests passing

* style: linting\

* style: typing and review

* fix: add transaction on cleanup_single_transcript

* fix: naming

---------

Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>
2025-08-29 08:47:14 -06:00

2.8 KiB

Data Retention and Cleanup

Overview

For public instances of Reflector, a data retention policy is automatically enforced to delete anonymous user data after a configurable period (default: 7 days). This ensures compliance with privacy expectations and prevents unbounded storage growth.

Configuration

Environment Variables

  • PUBLIC_MODE (bool): Must be set to true to enable automatic cleanup
  • PUBLIC_DATA_RETENTION_DAYS (int): Number of days to retain anonymous data (default: 7)

What Gets Deleted

When data reaches the retention period, the following items are automatically removed:

  1. Transcripts from anonymous users (where user_id is NULL):
    • Database records
    • Local files (audio.wav, audio.mp3, audio.json waveform)
    • Storage files (cloud storage if configured)

Automatic Cleanup

Celery Beat Schedule

When PUBLIC_MODE=true, a Celery beat task runs daily at 3 AM to clean up old data:

# Automatically scheduled when PUBLIC_MODE=true
"cleanup_old_public_data": {
    "task": "reflector.worker.cleanup.cleanup_old_public_data",
    "schedule": crontab(hour=3, minute=0),  # Daily at 3 AM
}

Running the Worker

Ensure both Celery worker and beat scheduler are running:

# Start Celery worker
uv run celery -A reflector.worker.app worker --loglevel=info

# Start Celery beat scheduler (in another terminal)
uv run celery -A reflector.worker.app beat

Manual Cleanup

For testing or manual intervention, use the cleanup tool:

# Delete data older than 7 days (default)
uv run python -m reflector.tools.cleanup_old_data

# Delete data older than 30 days
uv run python -m reflector.tools.cleanup_old_data --days 30

Note: The manual tool uses the same implementation as the Celery worker task to ensure consistency.

Important Notes

  1. User Data Deletion: Only anonymous data (where user_id is NULL) is deleted. Authenticated user data is preserved.

  2. Storage Cleanup: The system properly cleans up both local files and cloud storage when configured.

  3. Error Handling: If individual deletions fail, the cleanup continues and logs errors. Failed deletions are reported in the task output.

  4. Public Instance Only: The automatic cleanup task only runs when PUBLIC_MODE=true to prevent accidental data loss in private deployments.

Testing

Run the cleanup tests:

uv run pytest tests/test_cleanup.py -v

Monitoring

Check Celery logs for cleanup task execution:

# Look for cleanup task logs
grep "cleanup_old_public_data" celery.log
grep "Starting cleanup of old public data" celery.log

Task statistics are logged after each run:

  • Number of transcripts deleted
  • Number of meetings deleted
  • Number of orphaned recordings deleted
  • Any errors encountered