feat(cleanup): add automatic data retention for public instances (#574)

* feat(cleanup): add automatic data retention for public instances - Add Celery task to clean up anonymous data after configurable retention period - Delete transcripts, meetings, and orphaned recordings older than retention days - Only runs when PUBLIC_MODE is enabled to prevent accidental data loss - Properly removes all associated files (local and S3 storage) - Add manual cleanup tool for testing and intervention - Configure retention via PUBLIC_DATA_RETENTION_DAYS setting (default: 7 days) Fixes #571 * fix: apply pre-commit formatting fixes * fix: properly delete recording files from storage during cleanup - Add storage deletion for orphaned recordings in both cleanup task and manual tool - Delete from storage before removing database records - Log warnings if storage deletion fails but continue with database cleanup * Apply suggestion from @pr-agent-monadical[bot] Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com> * Apply suggestion from @pr-agent-monadical[bot] Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com> * refactor: cleanup_old_data for better logging * fix: linting * test: fix meeting cleanup test to not require room controller - Simplify test by directly inserting meetings into database - Remove dependency on non-existent rooms_controller.create method - Tests now pass successfully * fix: linting * refactor: simplify cleanup tool to use worker implementation - Remove duplicate cleanup logic from manual tool - Use the same _cleanup_old_public_data function from worker - Remove dry-run feature as requested - Prevent code duplication and ensure consistency - Update documentation to reflect changes * refactor: split cleanup worker into smaller functions - Move all imports to the top of the file - Extract cleanup logic into separate functions: - cleanup_old_transcripts() - cleanup_old_meetings() - cleanup_orphaned_recordings() - log_cleanup_results() - Make code more maintainable and testable - Add days parameter support to Celery task - Update manual tool to work with refactored code * feat: add TypedDict typing for cleanup stats - Add CleanupStats TypedDict for better type safety - Update all function signatures to use proper typing - Add return type annotations to _cleanup_old_public_data - Improves code maintainability and IDE support * feat: add CASCADE DELETE to meeting_consent foreign key - Add ondelete="CASCADE" to meeting_consent.meeting_id foreign key - Generate and apply migration to update existing constraint - Remove manual consent deletion from cleanup code - Add unit test to verify CASCADE DELETE behavior * style: linting * fix: alembic migration branchpoint * fix: correct downgrade constraint name in CASCADE DELETE migration * fix: regenerate CASCADE DELETE migration with proper constraint names - Delete problematic migration and regenerate with correct names - Use explicit constraint name in both upgrade and downgrade - Ensure migration works bidirectionally - All tests passing including CASCADE DELETE test * style: linting * refactor: simplify cleanup to use transcripts as entry point - Remove orphaned_recordings cleanup (not part of this PR scope) - Remove separate old_meetings cleanup - Transcripts are now the main entry point for cleanup - Associated meetings and recordings are deleted with their transcript - Use single database connection for all operations - Update tests to reflect new approach * refactor: cleanup and rename functions for clarity - Rename _cleanup_old_public_data to cleanup_old_public_data (make public) - Rename celery task to cleanup_old_public_data_task for clarity - Update docstrings and improve code organization - Remove unnecessary comments and simplify deletion logic - Update tests to use new function names - All tests passing * style: linting\ * style: typing and review * fix: add transaction on cleanup_single_transcript * fix: naming --------- Co-authored-by: pr-agent-monadical[bot] <198624643+pr-agent-monadical[bot]@users.noreply.github.com>
2026-02-05 10:26:48 +00:00 · 2025-08-29 08:47:14 -06:00
parent 9dfd76996f
commit 6f0c7c1a5e
11 changed files with 708 additions and 28 deletions
--- a/server/reflector/asynctask.py
+++ b/server/reflector/asynctask.py
@@ -0,0 +1,27 @@
+import asyncio
+import functools
+
+from reflector.db import get_database
+
+
+def asynctask(f):
+    @functools.wraps(f)
+    def wrapper(*args, **kwargs):
+        async def run_with_db():
+            database = get_database()
+            await database.connect()
+            try:
+                return await f(*args, **kwargs)
+            finally:
+                await database.disconnect()
+
+        coro = run_with_db()
+        try:
+            loop = asyncio.get_running_loop()
+        except RuntimeError:
+            loop = None
+        if loop and loop.is_running():
+            return loop.run_until_complete(coro)
+        return asyncio.run(coro)
+
+    return wrapper
--- a/server/reflector/db/meetings.py
+++ b/server/reflector/db/meetings.py
@@ -54,7 +54,12 @@ meeting_consent = sa.Table(
    "meeting_consent",
    metadata,
    sa.Column("id", sa.String, primary_key=True),
-    sa.Column("meeting_id", sa.String, sa.ForeignKey("meeting.id"), nullable=False),
+    sa.Column(
+        "meeting_id",
+        sa.String,
+        sa.ForeignKey("meeting.id", ondelete="CASCADE"),
+        nullable=False,
+    ),
    sa.Column("user_id", sa.String),
    sa.Column("consent_given", sa.Boolean, nullable=False),
    sa.Column("consent_timestamp", sa.DateTime(timezone=True), nullable=False),
--- a/server/reflector/pipelines/main_file_pipeline.py
+++ b/server/reflector/pipelines/main_file_pipeline.py
@@ -13,6 +13,7 @@ import av
 import structlog
 from celery import shared_task

+from reflector.asynctask import asynctask
 from reflector.db.transcripts import (
    Transcript,
    TranscriptStatus,
@@ -21,7 +22,6 @@ from reflector.db.transcripts import (
 from reflector.logger import logger
 from reflector.pipelines.main_live_pipeline import (
    PipelineMainBase,
-    asynctask,
    broadcast_to_sockets,
 )
 from reflector.processors import (
--- a/server/reflector/pipelines/main_live_pipeline.py
+++ b/server/reflector/pipelines/main_live_pipeline.py
@@ -22,7 +22,7 @@ from celery import chord, current_task, group, shared_task
 from pydantic import BaseModel
 from structlog import BoundLogger as Logger

-from reflector.db import get_database
+from reflector.asynctask import asynctask
 from reflector.db.meetings import meeting_consent_controller, meetings_controller
 from reflector.db.recordings import recordings_controller
 from reflector.db.rooms import rooms_controller
@@ -70,29 +70,6 @@ from reflector.zulip import (
 )


-def asynctask(f):
-    @functools.wraps(f)
-    def wrapper(*args, **kwargs):
-        async def run_with_db():
-            database = get_database()
-            await database.connect()
-            try:
-                return await f(*args, **kwargs)
-            finally:
-                await database.disconnect()
-
-        coro = run_with_db()
-        try:
-            loop = asyncio.get_running_loop()
-        except RuntimeError:
-            loop = None
-        if loop and loop.is_running():
-            return loop.run_until_complete(coro)
-        return asyncio.run(coro)
-
-    return wrapper
-
-
 def broadcast_to_sockets(func):
    """
    Decorator to broadcast transcript event to websockets
--- a/server/reflector/settings.py
+++ b/server/reflector/settings.py
@@ -1,3 +1,4 @@
+from pydantic.types import PositiveInt
 from pydantic_settings import BaseSettings, SettingsConfigDict


@@ -90,9 +91,8 @@ class Settings(BaseSettings):
    AUTH_JWT_PUBLIC_KEY: str | None = "authentik.monadical.com_public.pem"
    AUTH_JWT_AUDIENCE: str | None = None

-    # API public mode
-    # if set, all anonymous record will be public
    PUBLIC_MODE: bool = False
+    PUBLIC_DATA_RETENTION_DAYS: PositiveInt = 7

    # Min transcript length to generate topic + summary
    MIN_TRANSCRIPT_LENGTH: int = 750
--- a/server/reflector/tools/cleanup_old_data.py
+++ b/server/reflector/tools/cleanup_old_data.py
@@ -0,0 +1,72 @@
+#!/usr/bin/env python
+"""
+Manual cleanup tool for old public data.
+Uses the same implementation as the Celery worker task.
+"""
+
+import argparse
+import asyncio
+import sys
+
+import structlog
+
+from reflector.settings import settings
+from reflector.worker.cleanup import _cleanup_old_public_data
+
+logger = structlog.get_logger(__name__)
+
+
+async def cleanup_old_data(days: int = 7):
+    logger.info(
+        "Starting manual cleanup",
+        retention_days=days,
+        public_mode=settings.PUBLIC_MODE,
+    )
+
+    if not settings.PUBLIC_MODE:
+        logger.critical(
+            "WARNING: PUBLIC_MODE is False. "
+            "This tool is intended for public instances only."
+        )
+        raise Exception("Tool intended for public instances only")
+
+    result = await _cleanup_old_public_data(days=days)
+
+    if result:
+        logger.info(
+            "Cleanup completed",
+            transcripts_deleted=result.get("transcripts_deleted", 0),
+            meetings_deleted=result.get("meetings_deleted", 0),
+            recordings_deleted=result.get("recordings_deleted", 0),
+            errors_count=len(result.get("errors", [])),
+        )
+        if result.get("errors"):
+            logger.warning(
+                "Errors encountered during cleanup:", errors=result["errors"][:10]
+            )
+    else:
+        logger.info("Cleanup skipped or completed without results")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Clean up old transcripts and meetings"
+    )
+    parser.add_argument(
+        "--days",
+        type=int,
+        default=7,
+        help="Number of days to keep data (default: 7)",
+    )
+
+    args = parser.parse_args()
+
+    if args.days < 1:
+        logger.error("Days must be at least 1")
+        sys.exit(1)
+
+    asyncio.run(cleanup_old_data(days=args.days))
+
+
+if __name__ == "__main__":
+    main()
--- a/server/reflector/worker/app.py
+++ b/server/reflector/worker/app.py
@@ -19,6 +19,7 @@ else:
            "reflector.pipelines.main_live_pipeline",
            "reflector.worker.healthcheck",
            "reflector.worker.process",
+            "reflector.worker.cleanup",
        ]
    )

@@ -38,6 +39,16 @@ else:
        },
    }

+    if settings.PUBLIC_MODE:
+        app.conf.beat_schedule["cleanup_old_public_data"] = {
+            "task": "reflector.worker.cleanup.cleanup_old_public_data_task",
+            "schedule": crontab(hour=3, minute=0),
+        }
+        logger.info(
+            "Public mode cleanup enabled",
+            retention_days=settings.PUBLIC_DATA_RETENTION_DAYS,
+        )
+
    if settings.HEALTHCHECK_URL:
        app.conf.beat_schedule["healthcheck_ping"] = {
            "task": "reflector.worker.healthcheck.healthcheck_ping",
--- a/server/reflector/worker/cleanup.py
+++ b/server/reflector/worker/cleanup.py
@@ -0,0 +1,156 @@
+"""
+Main task for cleanup old public data.
+
+Deletes old anonymous transcripts and their associated meetings/recordings.
+Transcripts are the main entry point - any associated data is also removed.
+"""
+
+import asyncio
+from datetime import datetime, timedelta, timezone
+from typing import TypedDict
+
+import structlog
+from celery import shared_task
+from databases import Database
+from pydantic.types import PositiveInt
+
+from reflector.asynctask import asynctask
+from reflector.db import get_database
+from reflector.db.meetings import meetings
+from reflector.db.recordings import recordings
+from reflector.db.transcripts import transcripts, transcripts_controller
+from reflector.settings import settings
+from reflector.storage import get_recordings_storage
+
+logger = structlog.get_logger(__name__)
+
+
+class CleanupStats(TypedDict):
+    """Statistics for cleanup operation."""
+
+    transcripts_deleted: int
+    meetings_deleted: int
+    recordings_deleted: int
+    errors: list[str]
+
+
+async def delete_single_transcript(
+    db: Database, transcript_data: dict, stats: CleanupStats
+):
+    transcript_id = transcript_data["id"]
+    meeting_id = transcript_data["meeting_id"]
+    recording_id = transcript_data["recording_id"]
+
+    try:
+        async with db.transaction(isolation="serializable"):
+            if meeting_id:
+                await db.execute(meetings.delete().where(meetings.c.id == meeting_id))
+                stats["meetings_deleted"] += 1
+                logger.info("Deleted associated meeting", meeting_id=meeting_id)
+
+            if recording_id:
+                recording = await db.fetch_one(
+                    recordings.select().where(recordings.c.id == recording_id)
+                )
+                if recording:
+                    try:
+                        await get_recordings_storage().delete_file(
+                            recording["object_key"]
+                        )
+                    except Exception as storage_error:
+                        logger.warning(
+                            "Failed to delete recording from storage",
+                            recording_id=recording_id,
+                            object_key=recording["object_key"],
+                            error=str(storage_error),
+                        )
+
+                    await db.execute(
+                        recordings.delete().where(recordings.c.id == recording_id)
+                    )
+                    stats["recordings_deleted"] += 1
+                    logger.info(
+                        "Deleted associated recording", recording_id=recording_id
+                    )
+
+            await transcripts_controller.remove_by_id(transcript_id)
+            stats["transcripts_deleted"] += 1
+            logger.info(
+                "Deleted transcript",
+                transcript_id=transcript_id,
+                created_at=transcript_data["created_at"].isoformat(),
+            )
+    except Exception as e:
+        error_msg = f"Failed to delete transcript {transcript_id}: {str(e)}"
+        logger.error(error_msg, exc_info=e)
+        stats["errors"].append(error_msg)
+
+
+async def cleanup_old_transcripts(
+    db: Database, cutoff_date: datetime, stats: CleanupStats
+):
+    """Delete old anonymous transcripts and their associated recordings/meetings."""
+    query = transcripts.select().where(
+        (transcripts.c.created_at < cutoff_date) & (transcripts.c.user_id.is_(None))
+    )
+    old_transcripts = await db.fetch_all(query)
+
+    logger.info(f"Found {len(old_transcripts)} old transcripts to delete")
+
+    for transcript_data in old_transcripts:
+        await delete_single_transcript(db, transcript_data, stats)
+
+
+def log_cleanup_results(stats: CleanupStats):
+    logger.info(
+        "Cleanup completed",
+        transcripts_deleted=stats["transcripts_deleted"],
+        meetings_deleted=stats["meetings_deleted"],
+        recordings_deleted=stats["recordings_deleted"],
+        errors_count=len(stats["errors"]),
+    )
+
+    if stats["errors"]:
+        logger.warning(
+            "Cleanup completed with errors",
+            errors=stats["errors"][:10],
+        )
+
+
+async def cleanup_old_public_data(
+    days: PositiveInt | None = None,
+) -> CleanupStats | None:
+    if days is None:
+        days = settings.PUBLIC_DATA_RETENTION_DAYS
+
+    if not settings.PUBLIC_MODE:
+        logger.info("Skipping cleanup - not a public instance")
+        return None
+
+    cutoff_date = datetime.now(timezone.utc) - timedelta(days=days)
+    logger.info(
+        "Starting cleanup of old public data",
+        cutoff_date=cutoff_date.isoformat(),
+    )
+
+    stats: CleanupStats = {
+        "transcripts_deleted": 0,
+        "meetings_deleted": 0,
+        "recordings_deleted": 0,
+        "errors": [],
+    }
+
+    db = get_database()
+    await cleanup_old_transcripts(db, cutoff_date, stats)
+
+    log_cleanup_results(stats)
+    return stats
+
+
+@shared_task(
+    autoretry_for=(Exception,),
+    retry_kwargs={"max_retries": 3, "countdown": 300},
+)
+@asynctask
+def cleanup_old_public_data_task(days: int | None = None):
+    asyncio.run(cleanup_old_public_data(days=days))