mirror of https://github.com/Monadical-SAS/reflector.git synced 2026-02-06 02:36:47 +00:00

Files

Igor Loskutov d9dfa17b5a docs: Add presence race condition design document

Comprehensive analysis of the race condition where users in the same
Reflector room can end up in different Daily.co rooms.

Contents:
- Problem statement and symptoms
- Evidence from Hypothesis simulation
- Current system analysis with code references
- Detailed race condition timeline
- Why current mitigations (Daily API, fallback) are insufficient
- Three solution options with trade-offs
- Recommended approach: Track "intent to join" via /joining endpoint
- Implementation checklist and file references

Key insight: The race is a data model gap, not a timing issue. Backend
needs explicit knowledge of joining users before Daily presence API
sees them.

2026-02-05 18:49:06 -05:00

17 KiB

Raw Blame History

Presence System Race Condition: Design Document

Executive Summary

Users in the same Reflector room can end up in different Daily.co rooms due to race conditions in meeting lifecycle management. This document details the root cause, why current mitigations are insufficient, and proposes a solution that eliminates the race by design.

Problem Statement

When a user quickly leaves and rejoins a meeting (e.g., closes tab and reopens within seconds), they may find themselves in a different Daily.co room than other participants in the same Reflector room. This breaks the core assumption that all users in a Reflector room share the same video call.

Symptoms

User A and User B are in the same Reflector room but different Daily.co rooms
User reports "I can't see/hear the other participant"
Meeting appears active but users are isolated

Evidence: Hypothesis Simulation

A simulation was built to model the presence system and find race conditions through randomized action sequences.

Location: server/tests/simulation/

cd server

# Current system config - finds race conditions
uv run pytest tests/simulation/test_presence_race.py::test_presence_race_conditions_current_system
# Result: XFAIL (expected failure - race found)

# Fixed system config - no race conditions
uv run pytest tests/simulation/test_presence_race.py::test_presence_no_race_conditions_fixed_system
# Result: PASS

The simulation models:

Discrete time clock for deterministic replay
Daily.co rooms, participants, presence API with configurable lag
Reflector meetings, sessions, webhooks
User state machine: idle → joining → handshaking → connected → leaving → idle
Background tasks: poll_daily_room_presence, process_meetings

Key Finding

The simulation proves that even with the Daily API call, a race window exists during WebRTC handshake when users are invisible to the presence API.

Current System Analysis

Architecture Overview

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Frontend  │────▶│   Backend   │────▶│  Daily.co   │
│  (Next.js)  │     │  (FastAPI)  │     │    API      │
└─────────────┘     └─────────────┘     └─────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │  Database   │
                    │  (Sessions) │
                    └─────────────┘

Relevant Code Paths

1. Meeting Join Flow

File: server/reflector/views/rooms.py
Endpoint: POST /rooms/{room_name}/meeting
Returns existing active meeting or creates new one
User then connects to Daily.co via WebRTC (frontend)

2. Presence Polling

File: server/reflector/worker/process.py:642
Function: poll_daily_room_presence()
Called by webhooks (participant.joined, participant.left) and /joined, /leave endpoints
Queries Daily API for current participants
Updates daily_participant_sessions table in database

3. Meeting Deactivation

File: server/reflector/worker/process.py:754
Function: process_meetings()
Runs periodically (every 60s via Celery beat)
Checks if meetings should be deactivated

Current implementation (lines 806-833):

if meeting.platform == "daily":
    try:
        presence = await client.get_room_presence(meeting.room_name)
        has_active_sessions = presence.total_count > 0
        # ...
    except Exception:
        logger_.warning("Daily.co presence API failed, falling back to DB sessions")
        room_sessions = await client.get_room_sessions(meeting.room_name)
        has_active_sessions = bool(
            room_sessions and any(s.ended_at is None for s in room_sessions)
        )

Key observation: The code already uses the Daily API (get_room_presence), not just the database. The race condition persists despite this.

Endpoints from feature-leave-endpoint Branch

The feature-leave-endpoint branch added explicit leave/join notifications:

Endpoint	Purpose	Trigger
`POST /rooms/{room_name}/meetings/{meeting_id}/join`	Get meeting info	User navigates to room
`POST /rooms/{room_name}/meetings/{meeting_id}/joined`	Signal connection complete	After WebRTC connects
`POST /rooms/{room_name}/meetings/{meeting_id}/leave`	Signal user leaving	Tab close via sendBeacon

These endpoints trigger poll_daily_room_presence_task to update session state faster than waiting for webhooks.

Race Condition: Detailed Analysis

The Fundamental Problem

The backend has no knowledge of users who are in the process of joining (WebRTC handshake phase).

Data sources available to backend:

Source	What it knows	Limitation
Daily Presence API	Currently connected users	0-500ms lag; doesn't see handshaking users
Database sessions	Historical join/leave events	Stale; updated by polls
Webhooks	Join/leave events	Delayed; can fail

Gap: No source knows about users between "decided to join" and "WebRTC handshake complete".

Race Scenario Timeline

T+0ms:    User A connected to Meeting M1, visible in Daily presence
T+1000ms: User A closes browser tab
T+1050ms: participant.left webhook fires → poll_daily_room_presence queued
T+1500ms: User A reopens tab (quick rejoin)
T+1600ms: POST /meeting returns M1 (still active)
T+1700ms: Frontend starts WebRTC handshake
T+2000ms: User A in handshake - NOT visible to Daily presence API
T+2100ms: poll runs → sees 0 participants → marks session as left_at
T+3000ms: process_meetings runs
T+3100ms: Daily API returns 0 participants (user still handshaking)
T+3200ms: has_active_sessions=False, has_had_sessions=True
T+3300ms: Meeting deactivated, Daily room deleted
T+4000ms: User A WebRTC completes → Daily room is gone!
T+5000ms: User B joins same Reflector room → new Meeting M2 created

RESULT: User A orphaned, User B in different Daily room

Why Current Mitigations Are Insufficient

1. Using Daily API (already implemented)

The code already calls get_room_presence() instead of relying solely on database sessions. This doesn't help because the Daily presence API itself doesn't see users during WebRTC handshake (0-500ms consistency lag + handshake duration of 500-3000ms).

2. Fallback to Database

When Daily API fails, the code falls back to database sessions. This is worse because database is even more stale than the API.

3. Leave/Join Endpoints

The /joined and /leave endpoints trigger immediate polls, reducing the window but not eliminating it. The poll still only sees what Daily presence API reports.

Proposed Solutions

Option A: Grace Period (Not Recommended)

Add a time-based buffer before deactivation.

GRACE_PERIOD_SECONDS = 10

if not has_active_sessions and has_had_sessions:
    recent_activity = await get_recent_activity(meeting_id, within_seconds=GRACE_PERIOD_SECONDS)
    if recent_activity:
        continue  # Skip deactivation

Pros:

Simple to implement
Low risk

Cons:

Arbitrary timeout value (why 10s? why not 5s or 30s?)
Feels like a hack ("setTimeout solution")
Delays legitimate deactivation
Doesn't eliminate race, just makes it less likely

Option B: Track "Intent to Join" (Recommended)

Add explicit state tracking for users who are in the process of joining.

New endpoint: POST /rooms/{room_name}/meetings/{meeting_id}/joining

Flow change:

Current:
1. POST /join → get meeting info
2. Render Daily iframe (start WebRTC)
3. POST /joined (after connected)

Proposed:
1. POST /join → get meeting info
2. POST /joining → "I'm about to connect" ← NEW (wait for 200 OK)
3. Render Daily iframe (start WebRTC)
4. POST /joined (after connected)

Backend tracking:

# On /joining endpoint
await pending_joins.create(meeting_id=meeting_id, user_id=user_id, created_at=now())

# In process_meetings
pending = await pending_joins.get_recent(meeting_id, max_age_seconds=30)
if pending:
    logger.info("Meeting has pending joins, skipping deactivation")
    continue

# On /joined endpoint or timeout
await pending_joins.delete(meeting_id=meeting_id, user_id=user_id)

Pros:

Eliminates race by design (backend knows before Daily does)
Explicit state machine, not time-based guessing
Clear semantics

Cons:

Adds ~50-200ms latency (one round-trip before iframe renders)
Requires frontend changes
Needs cleanup mechanism for abandoned joins (user closes tab during handshake)

Option C: Optimistic Locking with Version

Track meeting "version" that must match for deactivation.

Concept: Each join attempt increments a version. Deactivation only proceeds if version hasn't changed since presence check.

Cons:

Complex to implement correctly
Still has edge cases with concurrent joins

Recommended Approach: Option B

Track "Intent to Join" is the cleanest solution because it:

Eliminates the race by design - no timing windows
Makes state explicit - joining/connected/leaving are tracked, not inferred
Aligns with existing patterns - similar to /joined and /leave endpoints
No arbitrary timeouts - unlike grace period

Data Model Change

Add tracking for pending joins. Options:

Storage	Pros	Cons
Redis key	Fast, auto-expire	Lost on Redis restart
Database table	Persistent, queryable	Slightly slower
In-memory	Fastest	Lost on server restart

Recommendation: Redis with TTL (30s expiry) for simplicity. Pending joins are ephemeral - if Redis restarts, worst case is a brief deactivation delay.

# Redis key format
pending_join:{meeting_id}:{user_id} = {timestamp}
# TTL: 30 seconds

Implementation Checklist

Backend: Add /joining endpoint
- File: server/reflector/views/rooms.py
- Creates Redis key with 30s TTL
- Returns 200 OK
Backend: Modify process_meetings()
- File: server/reflector/worker/process.py
- Before deactivation, check for pending joins
- If any exist, skip deactivation
Backend: Modify /joined endpoint
- Clear pending join on successful connection
Frontend: Call /joining before WebRTC
- File: www/app/[roomName]/components/DailyRoom.tsx
- Await response before rendering Daily iframe
Update simulation
- Add joining state tracking to match new design
- Verify race condition is eliminated
Integration tests
- Test quick rejoin scenario
- Test abandoned join (user closes during handshake)
- Test concurrent joins from multiple users

Files Reference

Core Files to Modify

File	Purpose
`server/reflector/views/rooms.py`	Add `/joining` endpoint
`server/reflector/worker/process.py`	Check pending joins before deactivation
`www/app/[roomName]/components/DailyRoom.tsx`	Call `/joining` before WebRTC

Reference Files

File	Contains
`server/reflector/video_platforms/daily.py:128`	`get_room_presence()` - Daily API call
`server/reflector/worker/process.py:642`	`poll_daily_room_presence()` - presence polling
`server/reflector/views/daily.py:125`	Webhook handlers
`server/tests/simulation/`	Hypothesis simulation proving the race
`server/tests/test_daily_presence_deactivation.py`	Unit tests for presence logic

Simulation Files

File	Purpose
`tests/simulation/system.py`	Main simulation engine
`tests/simulation/config.py`	Current vs fixed system configs
`tests/simulation/state.py`	State dataclasses
`tests/simulation/test_presence_race.py`	Hypothesis stateful tests
`tests/simulation/test_targeted_scenarios.py`	Specific race scenarios
`server/reflector/presence/model.py`	Shared state machine model

Alternative Considered: Remove DB Fallback

One simpler change discussed: remove the database fallback when Daily API fails, and "fail loudly" instead.

# Current (with fallback)
try:
    presence = await client.get_room_presence(meeting.room_name)
    has_active_sessions = presence.total_count > 0
except Exception:
    # Fallback to stale DB
    room_sessions = await client.get_room_sessions(meeting.room_name)
    has_active_sessions = bool(room_sessions and any(s.ended_at is None for s in room_sessions))

# Proposed (fail loudly)
try:
    presence = await client.get_room_presence(meeting.room_name)
    has_active_sessions = presence.total_count > 0
except Exception:
    logger.error("Daily API failed, skipping deactivation check for this meeting")
    continue  # Don't deactivate if we can't verify

This helps but doesn't eliminate the race - it only removes one failure mode (stale DB). The core race (handshake invisibility) remains.

Conclusion

The presence system race condition is a data model gap, not a timing issue that can be solved with grace periods. The backend needs explicit knowledge of users who intend to join, before they become visible to the Daily presence API.

The recommended fix is to add a /joining endpoint that the frontend calls before starting WebRTC. This creates a "reservation" that prevents premature meeting deactivation during the handshake window.

This approach:

Eliminates the race by design
Adds minimal latency (~50-200ms)
Follows explicit state machine principles
Avoids arbitrary timeout hacks

Appendix: Simulation Test Results

$ uv run pytest tests/simulation/ -v

tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_uses_model_states PASSED
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_respects_transitions PASSED
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_invalid_transitions_checked PASSED
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_implements_protocols PASSED
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_uses_shared_invariants PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_state_machine_has_all_states PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_state_machine_valid_transitions PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_state_machine_invalid_transitions_raise PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_guarded_user_state_transitions PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_guarded_user_state_rejects_invalid PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_guarded_user_state_tracks_history PASSED
tests/simulation/test_model_conformance.py::TestInvariantConsistency::test_invariants_same_between_model_and_simulation PASSED
tests/simulation/test_model_conformance.py::test_quick_conformance_check PASSED
tests/simulation/test_presence_race.py::TestPresenceRaceFixed::runTest PASSED
tests/simulation/test_presence_race.py::test_presence_race_conditions_current_system XFAIL
tests/simulation/test_presence_race.py::test_presence_no_race_conditions_fixed_system PASSED
tests/simulation/test_presence_race.py::test_smoke_presence_simulation PASSED
tests/simulation/test_targeted_scenarios.py::TestQuickRejoinRace::test_quick_rejoin_causes_split PASSED
tests/simulation/test_targeted_scenarios.py::TestQuickRejoinRace::test_quick_rejoin_fixed_system PASSED
tests/simulation/test_targeted_scenarios.py::TestSimultaneousJoins::test_two_users_join_simultaneously PASSED
tests/simulation/test_targeted_scenarios.py::TestProcessMeetingsRace::test_process_meetings_during_handshake PASSED
tests/simulation/test_targeted_scenarios.py::TestPresenceLagRace::test_presence_lag_causes_incorrect_count PASSED
tests/simulation/test_targeted_scenarios.py::TestMeetingDeactivationEdgeCases::test_deactivation_with_no_sessions PASSED
tests/simulation/test_targeted_scenarios.py::TestMeetingDeactivationEdgeCases::test_deactivation_requires_had_sessions PASSED
tests/simulation/test_targeted_scenarios.py::TestEventLogTracing::test_event_log_captures_flow PASSED
tests/simulation/test_targeted_scenarios.py::test_config_presets PASSED

================== 25 passed, 1 xfailed ==================

The xfail test (test_presence_race_conditions_current_system) demonstrates that the current system configuration has race conditions that can be found through randomized testing.

17 KiB Raw Blame History

Presence System Race Condition: Design Document

Executive Summary

Problem Statement

Symptoms

Evidence: Hypothesis Simulation

Key Finding

Current System Analysis

Architecture Overview

Relevant Code Paths

1. Meeting Join Flow

2. Presence Polling

3. Meeting Deactivation

Endpoints from feature-leave-endpoint Branch

Race Condition: Detailed Analysis

The Fundamental Problem

Race Scenario Timeline

Why Current Mitigations Are Insufficient

1. Using Daily API (already implemented)

2. Fallback to Database

3. Leave/Join Endpoints

Proposed Solutions

Option A: Grace Period (Not Recommended)

Option B: Track "Intent to Join" (Recommended)

Option C: Optimistic Locking with Version

Recommended Approach: Option B

Data Model Change

Implementation Checklist

Files Reference

Core Files to Modify

Reference Files

Simulation Files

Alternative Considered: Remove DB Fallback

Conclusion

Appendix: Simulation Test Results

17 KiB

Raw Blame History