Files
reflector/server/reflector/tools/cleanup_old_data.py
Mathieu Virbel 6f0c7c1a5e feat(cleanup): add automatic data retention for public instances (#574)
* feat(cleanup): add automatic data retention for public instances

- Add Celery task to clean up anonymous data after configurable retention period
- Delete transcripts, meetings, and orphaned recordings older than retention days
- Only runs when PUBLIC_MODE is enabled to prevent accidental data loss
- Properly removes all associated files (local and S3 storage)
- Add manual cleanup tool for testing and intervention
- Configure retention via PUBLIC_DATA_RETENTION_DAYS setting (default: 7 days)

Fixes #571

* fix: apply pre-commit formatting fixes

* fix: properly delete recording files from storage during cleanup

- Add storage deletion for orphaned recordings in both cleanup task and manual tool
- Delete from storage before removing database records
- Log warnings if storage deletion fails but continue with database cleanup

* Apply suggestion from @pr-agent-monadical[bot]

Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>

* Apply suggestion from @pr-agent-monadical[bot]

Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>

* refactor: cleanup_old_data for better logging

* fix: linting

* test: fix meeting cleanup test to not require room controller

- Simplify test by directly inserting meetings into database
- Remove dependency on non-existent rooms_controller.create method
- Tests now pass successfully

* fix: linting

* refactor: simplify cleanup tool to use worker implementation

- Remove duplicate cleanup logic from manual tool
- Use the same _cleanup_old_public_data function from worker
- Remove dry-run feature as requested
- Prevent code duplication and ensure consistency
- Update documentation to reflect changes

* refactor: split cleanup worker into smaller functions

- Move all imports to the top of the file
- Extract cleanup logic into separate functions:
  - cleanup_old_transcripts()
  - cleanup_old_meetings()
  - cleanup_orphaned_recordings()
  - log_cleanup_results()
- Make code more maintainable and testable
- Add days parameter support to Celery task
- Update manual tool to work with refactored code

* feat: add TypedDict typing for cleanup stats

- Add CleanupStats TypedDict for better type safety
- Update all function signatures to use proper typing
- Add return type annotations to _cleanup_old_public_data
- Improves code maintainability and IDE support

* feat: add CASCADE DELETE to meeting_consent foreign key

- Add ondelete="CASCADE" to meeting_consent.meeting_id foreign key
- Generate and apply migration to update existing constraint
- Remove manual consent deletion from cleanup code
- Add unit test to verify CASCADE DELETE behavior

* style: linting

* fix: alembic migration branchpoint

* fix: correct downgrade constraint name in CASCADE DELETE migration

* fix: regenerate CASCADE DELETE migration with proper constraint names

- Delete problematic migration and regenerate with correct names
- Use explicit constraint name in both upgrade and downgrade
- Ensure migration works bidirectionally
- All tests passing including CASCADE DELETE test

* style: linting

* refactor: simplify cleanup to use transcripts as entry point

- Remove orphaned_recordings cleanup (not part of this PR scope)
- Remove separate old_meetings cleanup
- Transcripts are now the main entry point for cleanup
- Associated meetings and recordings are deleted with their transcript
- Use single database connection for all operations
- Update tests to reflect new approach

* refactor: cleanup and rename functions for clarity

- Rename _cleanup_old_public_data to cleanup_old_public_data (make public)
- Rename celery task to cleanup_old_public_data_task for clarity
- Update docstrings and improve code organization
- Remove unnecessary comments and simplify deletion logic
- Update tests to use new function names
- All tests passing

* style: linting\

* style: typing and review

* fix: add transaction on cleanup_single_transcript

* fix: naming

---------

Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>
2025-08-29 08:47:14 -06:00

73 lines
1.8 KiB
Python

#!/usr/bin/env python
"""
Manual cleanup tool for old public data.
Uses the same implementation as the Celery worker task.
"""
import argparse
import asyncio
import sys
import structlog
from reflector.settings import settings
from reflector.worker.cleanup import _cleanup_old_public_data
logger = structlog.get_logger(__name__)
async def cleanup_old_data(days: int = 7):
logger.info(
"Starting manual cleanup",
retention_days=days,
public_mode=settings.PUBLIC_MODE,
)
if not settings.PUBLIC_MODE:
logger.critical(
"WARNING: PUBLIC_MODE is False. "
"This tool is intended for public instances only."
)
raise Exception("Tool intended for public instances only")
result = await _cleanup_old_public_data(days=days)
if result:
logger.info(
"Cleanup completed",
transcripts_deleted=result.get("transcripts_deleted", 0),
meetings_deleted=result.get("meetings_deleted", 0),
recordings_deleted=result.get("recordings_deleted", 0),
errors_count=len(result.get("errors", [])),
)
if result.get("errors"):
logger.warning(
"Errors encountered during cleanup:", errors=result["errors"][:10]
)
else:
logger.info("Cleanup skipped or completed without results")
def main():
parser = argparse.ArgumentParser(
description="Clean up old transcripts and meetings"
)
parser.add_argument(
"--days",
type=int,
default=7,
help="Number of days to keep data (default: 7)",
)
args = parser.parse_args()
if args.days < 1:
logger.error("Days must be at least 1")
sys.exit(1)
asyncio.run(cleanup_old_data(days=args.days))
if __name__ == "__main__":
main()