mirror of
https://github.com/Monadical-SAS/reflector.git
synced 2025-12-20 20:29:06 +00:00
feat(cleanup): add automatic data retention for public instances (#574)
* feat(cleanup): add automatic data retention for public instances - Add Celery task to clean up anonymous data after configurable retention period - Delete transcripts, meetings, and orphaned recordings older than retention days - Only runs when PUBLIC_MODE is enabled to prevent accidental data loss - Properly removes all associated files (local and S3 storage) - Add manual cleanup tool for testing and intervention - Configure retention via PUBLIC_DATA_RETENTION_DAYS setting (default: 7 days) Fixes #571 * fix: apply pre-commit formatting fixes * fix: properly delete recording files from storage during cleanup - Add storage deletion for orphaned recordings in both cleanup task and manual tool - Delete from storage before removing database records - Log warnings if storage deletion fails but continue with database cleanup * Apply suggestion from @pr-agent-monadical[bot] Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com> * Apply suggestion from @pr-agent-monadical[bot] Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com> * refactor: cleanup_old_data for better logging * fix: linting * test: fix meeting cleanup test to not require room controller - Simplify test by directly inserting meetings into database - Remove dependency on non-existent rooms_controller.create method - Tests now pass successfully * fix: linting * refactor: simplify cleanup tool to use worker implementation - Remove duplicate cleanup logic from manual tool - Use the same _cleanup_old_public_data function from worker - Remove dry-run feature as requested - Prevent code duplication and ensure consistency - Update documentation to reflect changes * refactor: split cleanup worker into smaller functions - Move all imports to the top of the file - Extract cleanup logic into separate functions: - cleanup_old_transcripts() - cleanup_old_meetings() - cleanup_orphaned_recordings() - log_cleanup_results() - Make code more maintainable and testable - Add days parameter support to Celery task - Update manual tool to work with refactored code * feat: add TypedDict typing for cleanup stats - Add CleanupStats TypedDict for better type safety - Update all function signatures to use proper typing - Add return type annotations to _cleanup_old_public_data - Improves code maintainability and IDE support * feat: add CASCADE DELETE to meeting_consent foreign key - Add ondelete="CASCADE" to meeting_consent.meeting_id foreign key - Generate and apply migration to update existing constraint - Remove manual consent deletion from cleanup code - Add unit test to verify CASCADE DELETE behavior * style: linting * fix: alembic migration branchpoint * fix: correct downgrade constraint name in CASCADE DELETE migration * fix: regenerate CASCADE DELETE migration with proper constraint names - Delete problematic migration and regenerate with correct names - Use explicit constraint name in both upgrade and downgrade - Ensure migration works bidirectionally - All tests passing including CASCADE DELETE test * style: linting * refactor: simplify cleanup to use transcripts as entry point - Remove orphaned_recordings cleanup (not part of this PR scope) - Remove separate old_meetings cleanup - Transcripts are now the main entry point for cleanup - Associated meetings and recordings are deleted with their transcript - Use single database connection for all operations - Update tests to reflect new approach * refactor: cleanup and rename functions for clarity - Rename _cleanup_old_public_data to cleanup_old_public_data (make public) - Rename celery task to cleanup_old_public_data_task for clarity - Update docstrings and improve code organization - Remove unnecessary comments and simplify deletion logic - Update tests to use new function names - All tests passing * style: linting\ * style: typing and review * fix: add transaction on cleanup_single_transcript * fix: naming --------- Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>
This commit is contained in:
95
server/docs/data_retention.md
Normal file
95
server/docs/data_retention.md
Normal file
@@ -0,0 +1,95 @@
|
||||
# Data Retention and Cleanup
|
||||
|
||||
## Overview
|
||||
|
||||
For public instances of Reflector, a data retention policy is automatically enforced to delete anonymous user data after a configurable period (default: 7 days). This ensures compliance with privacy expectations and prevents unbounded storage growth.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
- `PUBLIC_MODE` (bool): Must be set to `true` to enable automatic cleanup
|
||||
- `PUBLIC_DATA_RETENTION_DAYS` (int): Number of days to retain anonymous data (default: 7)
|
||||
|
||||
### What Gets Deleted
|
||||
|
||||
When data reaches the retention period, the following items are automatically removed:
|
||||
|
||||
1. **Transcripts** from anonymous users (where `user_id` is NULL):
|
||||
- Database records
|
||||
- Local files (audio.wav, audio.mp3, audio.json waveform)
|
||||
- Storage files (cloud storage if configured)
|
||||
|
||||
## Automatic Cleanup
|
||||
|
||||
### Celery Beat Schedule
|
||||
|
||||
When `PUBLIC_MODE=true`, a Celery beat task runs daily at 3 AM to clean up old data:
|
||||
|
||||
```python
|
||||
# Automatically scheduled when PUBLIC_MODE=true
|
||||
"cleanup_old_public_data": {
|
||||
"task": "reflector.worker.cleanup.cleanup_old_public_data",
|
||||
"schedule": crontab(hour=3, minute=0), # Daily at 3 AM
|
||||
}
|
||||
```
|
||||
|
||||
### Running the Worker
|
||||
|
||||
Ensure both Celery worker and beat scheduler are running:
|
||||
|
||||
```bash
|
||||
# Start Celery worker
|
||||
uv run celery -A reflector.worker.app worker --loglevel=info
|
||||
|
||||
# Start Celery beat scheduler (in another terminal)
|
||||
uv run celery -A reflector.worker.app beat
|
||||
```
|
||||
|
||||
## Manual Cleanup
|
||||
|
||||
For testing or manual intervention, use the cleanup tool:
|
||||
|
||||
```bash
|
||||
# Delete data older than 7 days (default)
|
||||
uv run python -m reflector.tools.cleanup_old_data
|
||||
|
||||
# Delete data older than 30 days
|
||||
uv run python -m reflector.tools.cleanup_old_data --days 30
|
||||
```
|
||||
|
||||
Note: The manual tool uses the same implementation as the Celery worker task to ensure consistency.
|
||||
|
||||
## Important Notes
|
||||
|
||||
1. **User Data Deletion**: Only anonymous data (where `user_id` is NULL) is deleted. Authenticated user data is preserved.
|
||||
|
||||
2. **Storage Cleanup**: The system properly cleans up both local files and cloud storage when configured.
|
||||
|
||||
3. **Error Handling**: If individual deletions fail, the cleanup continues and logs errors. Failed deletions are reported in the task output.
|
||||
|
||||
4. **Public Instance Only**: The automatic cleanup task only runs when `PUBLIC_MODE=true` to prevent accidental data loss in private deployments.
|
||||
|
||||
## Testing
|
||||
|
||||
Run the cleanup tests:
|
||||
|
||||
```bash
|
||||
uv run pytest tests/test_cleanup.py -v
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
Check Celery logs for cleanup task execution:
|
||||
|
||||
```bash
|
||||
# Look for cleanup task logs
|
||||
grep "cleanup_old_public_data" celery.log
|
||||
grep "Starting cleanup of old public data" celery.log
|
||||
```
|
||||
|
||||
Task statistics are logged after each run:
|
||||
- Number of transcripts deleted
|
||||
- Number of meetings deleted
|
||||
- Number of orphaned recordings deleted
|
||||
- Any errors encountered
|
||||
Reference in New Issue
Block a user