mirror of
https://github.com/Monadical-SAS/reflector.git
synced 2025-12-20 12:19:06 +00:00
* feat(cleanup): add automatic data retention for public instances - Add Celery task to clean up anonymous data after configurable retention period - Delete transcripts, meetings, and orphaned recordings older than retention days - Only runs when PUBLIC_MODE is enabled to prevent accidental data loss - Properly removes all associated files (local and S3 storage) - Add manual cleanup tool for testing and intervention - Configure retention via PUBLIC_DATA_RETENTION_DAYS setting (default: 7 days) Fixes #571 * fix: apply pre-commit formatting fixes * fix: properly delete recording files from storage during cleanup - Add storage deletion for orphaned recordings in both cleanup task and manual tool - Delete from storage before removing database records - Log warnings if storage deletion fails but continue with database cleanup * Apply suggestion from @pr-agent-monadical[bot] Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com> * Apply suggestion from @pr-agent-monadical[bot] Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com> * refactor: cleanup_old_data for better logging * fix: linting * test: fix meeting cleanup test to not require room controller - Simplify test by directly inserting meetings into database - Remove dependency on non-existent rooms_controller.create method - Tests now pass successfully * fix: linting * refactor: simplify cleanup tool to use worker implementation - Remove duplicate cleanup logic from manual tool - Use the same _cleanup_old_public_data function from worker - Remove dry-run feature as requested - Prevent code duplication and ensure consistency - Update documentation to reflect changes * refactor: split cleanup worker into smaller functions - Move all imports to the top of the file - Extract cleanup logic into separate functions: - cleanup_old_transcripts() - cleanup_old_meetings() - cleanup_orphaned_recordings() - log_cleanup_results() - Make code more maintainable and testable - Add days parameter support to Celery task - Update manual tool to work with refactored code * feat: add TypedDict typing for cleanup stats - Add CleanupStats TypedDict for better type safety - Update all function signatures to use proper typing - Add return type annotations to _cleanup_old_public_data - Improves code maintainability and IDE support * feat: add CASCADE DELETE to meeting_consent foreign key - Add ondelete="CASCADE" to meeting_consent.meeting_id foreign key - Generate and apply migration to update existing constraint - Remove manual consent deletion from cleanup code - Add unit test to verify CASCADE DELETE behavior * style: linting * fix: alembic migration branchpoint * fix: correct downgrade constraint name in CASCADE DELETE migration * fix: regenerate CASCADE DELETE migration with proper constraint names - Delete problematic migration and regenerate with correct names - Use explicit constraint name in both upgrade and downgrade - Ensure migration works bidirectionally - All tests passing including CASCADE DELETE test * style: linting * refactor: simplify cleanup to use transcripts as entry point - Remove orphaned_recordings cleanup (not part of this PR scope) - Remove separate old_meetings cleanup - Transcripts are now the main entry point for cleanup - Associated meetings and recordings are deleted with their transcript - Use single database connection for all operations - Update tests to reflect new approach * refactor: cleanup and rename functions for clarity - Rename _cleanup_old_public_data to cleanup_old_public_data (make public) - Rename celery task to cleanup_old_public_data_task for clarity - Update docstrings and improve code organization - Remove unnecessary comments and simplify deletion logic - Update tests to use new function names - All tests passing * style: linting\ * style: typing and review * fix: add transaction on cleanup_single_transcript * fix: naming --------- Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>
73 lines
1.8 KiB
Python
73 lines
1.8 KiB
Python
#!/usr/bin/env python
|
|
"""
|
|
Manual cleanup tool for old public data.
|
|
Uses the same implementation as the Celery worker task.
|
|
"""
|
|
|
|
import argparse
|
|
import asyncio
|
|
import sys
|
|
|
|
import structlog
|
|
|
|
from reflector.settings import settings
|
|
from reflector.worker.cleanup import _cleanup_old_public_data
|
|
|
|
logger = structlog.get_logger(__name__)
|
|
|
|
|
|
async def cleanup_old_data(days: int = 7):
|
|
logger.info(
|
|
"Starting manual cleanup",
|
|
retention_days=days,
|
|
public_mode=settings.PUBLIC_MODE,
|
|
)
|
|
|
|
if not settings.PUBLIC_MODE:
|
|
logger.critical(
|
|
"WARNING: PUBLIC_MODE is False. "
|
|
"This tool is intended for public instances only."
|
|
)
|
|
raise Exception("Tool intended for public instances only")
|
|
|
|
result = await _cleanup_old_public_data(days=days)
|
|
|
|
if result:
|
|
logger.info(
|
|
"Cleanup completed",
|
|
transcripts_deleted=result.get("transcripts_deleted", 0),
|
|
meetings_deleted=result.get("meetings_deleted", 0),
|
|
recordings_deleted=result.get("recordings_deleted", 0),
|
|
errors_count=len(result.get("errors", [])),
|
|
)
|
|
if result.get("errors"):
|
|
logger.warning(
|
|
"Errors encountered during cleanup:", errors=result["errors"][:10]
|
|
)
|
|
else:
|
|
logger.info("Cleanup skipped or completed without results")
|
|
|
|
|
|
def main():
|
|
parser = argparse.ArgumentParser(
|
|
description="Clean up old transcripts and meetings"
|
|
)
|
|
parser.add_argument(
|
|
"--days",
|
|
type=int,
|
|
default=7,
|
|
help="Number of days to keep data (default: 7)",
|
|
)
|
|
|
|
args = parser.parse_args()
|
|
|
|
if args.days < 1:
|
|
logger.error("Days must be at least 1")
|
|
sys.exit(1)
|
|
|
|
asyncio.run(cleanup_old_data(days=args.days))
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|