mirror of
https://github.com/Monadical-SAS/reflector.git
synced 2026-02-06 02:36:47 +00:00
docs: Add presence race condition design document
Comprehensive analysis of the race condition where users in the same Reflector room can end up in different Daily.co rooms. Contents: - Problem statement and symptoms - Evidence from Hypothesis simulation - Current system analysis with code references - Detailed race condition timeline - Why current mitigations (Daily API, fallback) are insufficient - Three solution options with trade-offs - Recommended approach: Track "intent to join" via /joining endpoint - Implementation checklist and file references Key insight: The race is a data model gap, not a timing issue. Backend needs explicit knowledge of joining users before Daily presence API sees them.
This commit is contained in:
414
PRESENCE_RACE_DESIGN_DOC.md
Normal file
414
PRESENCE_RACE_DESIGN_DOC.md
Normal file
@@ -0,0 +1,414 @@
|
|||||||
|
# Presence System Race Condition: Design Document
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
Users in the same Reflector room can end up in **different Daily.co rooms** due to race conditions in meeting lifecycle management. This document details the root cause, why current mitigations are insufficient, and proposes a solution that eliminates the race by design.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem Statement
|
||||||
|
|
||||||
|
When a user quickly leaves and rejoins a meeting (e.g., closes tab and reopens within seconds), they may find themselves in a different Daily.co room than other participants in the same Reflector room. This breaks the core assumption that all users in a Reflector room share the same video call.
|
||||||
|
|
||||||
|
### Symptoms
|
||||||
|
- User A and User B are in the same Reflector room but different Daily.co rooms
|
||||||
|
- User reports "I can't see/hear the other participant"
|
||||||
|
- Meeting appears active but users are isolated
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Evidence: Hypothesis Simulation
|
||||||
|
|
||||||
|
A simulation was built to model the presence system and find race conditions through randomized action sequences.
|
||||||
|
|
||||||
|
**Location**: `server/tests/simulation/`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd server
|
||||||
|
|
||||||
|
# Current system config - finds race conditions
|
||||||
|
uv run pytest tests/simulation/test_presence_race.py::test_presence_race_conditions_current_system
|
||||||
|
# Result: XFAIL (expected failure - race found)
|
||||||
|
|
||||||
|
# Fixed system config - no race conditions
|
||||||
|
uv run pytest tests/simulation/test_presence_race.py::test_presence_no_race_conditions_fixed_system
|
||||||
|
# Result: PASS
|
||||||
|
```
|
||||||
|
|
||||||
|
The simulation models:
|
||||||
|
- Discrete time clock for deterministic replay
|
||||||
|
- Daily.co rooms, participants, presence API with configurable lag
|
||||||
|
- Reflector meetings, sessions, webhooks
|
||||||
|
- User state machine: `idle → joining → handshaking → connected → leaving → idle`
|
||||||
|
- Background tasks: `poll_daily_room_presence`, `process_meetings`
|
||||||
|
|
||||||
|
### Key Finding
|
||||||
|
|
||||||
|
The simulation proves that **even with the Daily API call**, a race window exists during WebRTC handshake when users are invisible to the presence API.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current System Analysis
|
||||||
|
|
||||||
|
### Architecture Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||||
|
│ Frontend │────▶│ Backend │────▶│ Daily.co │
|
||||||
|
│ (Next.js) │ │ (FastAPI) │ │ API │
|
||||||
|
└─────────────┘ └─────────────┘ └─────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────┐
|
||||||
|
│ Database │
|
||||||
|
│ (Sessions) │
|
||||||
|
└─────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Relevant Code Paths
|
||||||
|
|
||||||
|
#### 1. Meeting Join Flow
|
||||||
|
- **File**: `server/reflector/views/rooms.py`
|
||||||
|
- **Endpoint**: `POST /rooms/{room_name}/meeting`
|
||||||
|
- Returns existing active meeting or creates new one
|
||||||
|
- User then connects to Daily.co via WebRTC (frontend)
|
||||||
|
|
||||||
|
#### 2. Presence Polling
|
||||||
|
- **File**: `server/reflector/worker/process.py:642`
|
||||||
|
- **Function**: `poll_daily_room_presence()`
|
||||||
|
- Called by webhooks (`participant.joined`, `participant.left`) and `/joined`, `/leave` endpoints
|
||||||
|
- Queries Daily API for current participants
|
||||||
|
- Updates `daily_participant_sessions` table in database
|
||||||
|
|
||||||
|
#### 3. Meeting Deactivation
|
||||||
|
- **File**: `server/reflector/worker/process.py:754`
|
||||||
|
- **Function**: `process_meetings()`
|
||||||
|
- Runs periodically (every 60s via Celery beat)
|
||||||
|
- Checks if meetings should be deactivated
|
||||||
|
|
||||||
|
**Current implementation** (lines 806-833):
|
||||||
|
```python
|
||||||
|
if meeting.platform == "daily":
|
||||||
|
try:
|
||||||
|
presence = await client.get_room_presence(meeting.room_name)
|
||||||
|
has_active_sessions = presence.total_count > 0
|
||||||
|
# ...
|
||||||
|
except Exception:
|
||||||
|
logger_.warning("Daily.co presence API failed, falling back to DB sessions")
|
||||||
|
room_sessions = await client.get_room_sessions(meeting.room_name)
|
||||||
|
has_active_sessions = bool(
|
||||||
|
room_sessions and any(s.ended_at is None for s in room_sessions)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key observation**: The code already uses the Daily API (`get_room_presence`), not just the database. The race condition persists despite this.
|
||||||
|
|
||||||
|
### Endpoints from feature-leave-endpoint Branch
|
||||||
|
|
||||||
|
The `feature-leave-endpoint` branch added explicit leave/join notifications:
|
||||||
|
|
||||||
|
| Endpoint | Purpose | Trigger |
|
||||||
|
|----------|---------|---------|
|
||||||
|
| `POST /rooms/{room_name}/meetings/{meeting_id}/join` | Get meeting info | User navigates to room |
|
||||||
|
| `POST /rooms/{room_name}/meetings/{meeting_id}/joined` | Signal connection complete | After WebRTC connects |
|
||||||
|
| `POST /rooms/{room_name}/meetings/{meeting_id}/leave` | Signal user leaving | Tab close via sendBeacon |
|
||||||
|
|
||||||
|
These endpoints trigger `poll_daily_room_presence_task` to update session state faster than waiting for webhooks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Race Condition: Detailed Analysis
|
||||||
|
|
||||||
|
### The Fundamental Problem
|
||||||
|
|
||||||
|
**The backend has no knowledge of users who are in the process of joining (WebRTC handshake phase).**
|
||||||
|
|
||||||
|
Data sources available to backend:
|
||||||
|
| Source | What it knows | Limitation |
|
||||||
|
|--------|---------------|------------|
|
||||||
|
| Daily Presence API | Currently connected users | 0-500ms lag; doesn't see handshaking users |
|
||||||
|
| Database sessions | Historical join/leave events | Stale; updated by polls |
|
||||||
|
| Webhooks | Join/leave events | Delayed; can fail |
|
||||||
|
|
||||||
|
**Gap**: No source knows about users between "decided to join" and "WebRTC handshake complete".
|
||||||
|
|
||||||
|
### Race Scenario Timeline
|
||||||
|
|
||||||
|
```
|
||||||
|
T+0ms: User A connected to Meeting M1, visible in Daily presence
|
||||||
|
T+1000ms: User A closes browser tab
|
||||||
|
T+1050ms: participant.left webhook fires → poll_daily_room_presence queued
|
||||||
|
T+1500ms: User A reopens tab (quick rejoin)
|
||||||
|
T+1600ms: POST /meeting returns M1 (still active)
|
||||||
|
T+1700ms: Frontend starts WebRTC handshake
|
||||||
|
T+2000ms: User A in handshake - NOT visible to Daily presence API
|
||||||
|
T+2100ms: poll runs → sees 0 participants → marks session as left_at
|
||||||
|
T+3000ms: process_meetings runs
|
||||||
|
T+3100ms: Daily API returns 0 participants (user still handshaking)
|
||||||
|
T+3200ms: has_active_sessions=False, has_had_sessions=True
|
||||||
|
T+3300ms: Meeting deactivated, Daily room deleted
|
||||||
|
T+4000ms: User A WebRTC completes → Daily room is gone!
|
||||||
|
T+5000ms: User B joins same Reflector room → new Meeting M2 created
|
||||||
|
|
||||||
|
RESULT: User A orphaned, User B in different Daily room
|
||||||
|
```
|
||||||
|
|
||||||
|
### Why Current Mitigations Are Insufficient
|
||||||
|
|
||||||
|
#### 1. Using Daily API (already implemented)
|
||||||
|
The code already calls `get_room_presence()` instead of relying solely on database sessions. **This doesn't help** because the Daily presence API itself doesn't see users during WebRTC handshake (0-500ms consistency lag + handshake duration of 500-3000ms).
|
||||||
|
|
||||||
|
#### 2. Fallback to Database
|
||||||
|
When Daily API fails, the code falls back to database sessions. This is **worse** because database is even more stale than the API.
|
||||||
|
|
||||||
|
#### 3. Leave/Join Endpoints
|
||||||
|
The `/joined` and `/leave` endpoints trigger immediate polls, reducing the window but **not eliminating it**. The poll still only sees what Daily presence API reports.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Proposed Solutions
|
||||||
|
|
||||||
|
### Option A: Grace Period (Not Recommended)
|
||||||
|
|
||||||
|
Add a time-based buffer before deactivation.
|
||||||
|
|
||||||
|
```python
|
||||||
|
GRACE_PERIOD_SECONDS = 10
|
||||||
|
|
||||||
|
if not has_active_sessions and has_had_sessions:
|
||||||
|
recent_activity = await get_recent_activity(meeting_id, within_seconds=GRACE_PERIOD_SECONDS)
|
||||||
|
if recent_activity:
|
||||||
|
continue # Skip deactivation
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- Simple to implement
|
||||||
|
- Low risk
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- Arbitrary timeout value (why 10s? why not 5s or 30s?)
|
||||||
|
- Feels like a hack ("setTimeout solution")
|
||||||
|
- Delays legitimate deactivation
|
||||||
|
- Doesn't eliminate race, just makes it less likely
|
||||||
|
|
||||||
|
### Option B: Track "Intent to Join" (Recommended)
|
||||||
|
|
||||||
|
Add explicit state tracking for users who are in the process of joining.
|
||||||
|
|
||||||
|
**New endpoint**: `POST /rooms/{room_name}/meetings/{meeting_id}/joining`
|
||||||
|
|
||||||
|
**Flow change**:
|
||||||
|
```
|
||||||
|
Current:
|
||||||
|
1. POST /join → get meeting info
|
||||||
|
2. Render Daily iframe (start WebRTC)
|
||||||
|
3. POST /joined (after connected)
|
||||||
|
|
||||||
|
Proposed:
|
||||||
|
1. POST /join → get meeting info
|
||||||
|
2. POST /joining → "I'm about to connect" ← NEW (wait for 200 OK)
|
||||||
|
3. Render Daily iframe (start WebRTC)
|
||||||
|
4. POST /joined (after connected)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Backend tracking**:
|
||||||
|
```python
|
||||||
|
# On /joining endpoint
|
||||||
|
await pending_joins.create(meeting_id=meeting_id, user_id=user_id, created_at=now())
|
||||||
|
|
||||||
|
# In process_meetings
|
||||||
|
pending = await pending_joins.get_recent(meeting_id, max_age_seconds=30)
|
||||||
|
if pending:
|
||||||
|
logger.info("Meeting has pending joins, skipping deactivation")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# On /joined endpoint or timeout
|
||||||
|
await pending_joins.delete(meeting_id=meeting_id, user_id=user_id)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- Eliminates race by design (backend knows before Daily does)
|
||||||
|
- Explicit state machine, not time-based guessing
|
||||||
|
- Clear semantics
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- Adds ~50-200ms latency (one round-trip before iframe renders)
|
||||||
|
- Requires frontend changes
|
||||||
|
- Needs cleanup mechanism for abandoned joins (user closes tab during handshake)
|
||||||
|
|
||||||
|
### Option C: Optimistic Locking with Version
|
||||||
|
|
||||||
|
Track meeting "version" that must match for deactivation.
|
||||||
|
|
||||||
|
**Concept**: Each join attempt increments a version. Deactivation only proceeds if version hasn't changed since presence check.
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- Complex to implement correctly
|
||||||
|
- Still has edge cases with concurrent joins
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended Approach: Option B
|
||||||
|
|
||||||
|
**Track "Intent to Join"** is the cleanest solution because it:
|
||||||
|
|
||||||
|
1. **Eliminates the race by design** - no timing windows
|
||||||
|
2. **Makes state explicit** - joining/connected/leaving are tracked, not inferred
|
||||||
|
3. **Aligns with existing patterns** - similar to `/joined` and `/leave` endpoints
|
||||||
|
4. **No arbitrary timeouts** - unlike grace period
|
||||||
|
|
||||||
|
### Data Model Change
|
||||||
|
|
||||||
|
Add tracking for pending joins. Options:
|
||||||
|
|
||||||
|
| Storage | Pros | Cons |
|
||||||
|
|---------|------|------|
|
||||||
|
| Redis key | Fast, auto-expire | Lost on Redis restart |
|
||||||
|
| Database table | Persistent, queryable | Slightly slower |
|
||||||
|
| In-memory | Fastest | Lost on server restart |
|
||||||
|
|
||||||
|
**Recommendation**: Redis with TTL (30s expiry) for simplicity. Pending joins are ephemeral - if Redis restarts, worst case is a brief deactivation delay.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Redis key format
|
||||||
|
pending_join:{meeting_id}:{user_id} = {timestamp}
|
||||||
|
# TTL: 30 seconds
|
||||||
|
```
|
||||||
|
|
||||||
|
### Implementation Checklist
|
||||||
|
|
||||||
|
1. **Backend: Add `/joining` endpoint**
|
||||||
|
- File: `server/reflector/views/rooms.py`
|
||||||
|
- Creates Redis key with 30s TTL
|
||||||
|
- Returns 200 OK
|
||||||
|
|
||||||
|
2. **Backend: Modify `process_meetings()`**
|
||||||
|
- File: `server/reflector/worker/process.py`
|
||||||
|
- Before deactivation, check for pending joins
|
||||||
|
- If any exist, skip deactivation
|
||||||
|
|
||||||
|
3. **Backend: Modify `/joined` endpoint**
|
||||||
|
- Clear pending join on successful connection
|
||||||
|
|
||||||
|
4. **Frontend: Call `/joining` before WebRTC**
|
||||||
|
- File: `www/app/[roomName]/components/DailyRoom.tsx`
|
||||||
|
- Await response before rendering Daily iframe
|
||||||
|
|
||||||
|
5. **Update simulation**
|
||||||
|
- Add `joining` state tracking to match new design
|
||||||
|
- Verify race condition is eliminated
|
||||||
|
|
||||||
|
6. **Integration tests**
|
||||||
|
- Test quick rejoin scenario
|
||||||
|
- Test abandoned join (user closes during handshake)
|
||||||
|
- Test concurrent joins from multiple users
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files Reference
|
||||||
|
|
||||||
|
### Core Files to Modify
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `server/reflector/views/rooms.py` | Add `/joining` endpoint |
|
||||||
|
| `server/reflector/worker/process.py` | Check pending joins before deactivation |
|
||||||
|
| `www/app/[roomName]/components/DailyRoom.tsx` | Call `/joining` before WebRTC |
|
||||||
|
|
||||||
|
### Reference Files
|
||||||
|
| File | Contains |
|
||||||
|
|------|----------|
|
||||||
|
| `server/reflector/video_platforms/daily.py:128` | `get_room_presence()` - Daily API call |
|
||||||
|
| `server/reflector/worker/process.py:642` | `poll_daily_room_presence()` - presence polling |
|
||||||
|
| `server/reflector/views/daily.py:125` | Webhook handlers |
|
||||||
|
| `server/tests/simulation/` | Hypothesis simulation proving the race |
|
||||||
|
| `server/tests/test_daily_presence_deactivation.py` | Unit tests for presence logic |
|
||||||
|
|
||||||
|
### Simulation Files
|
||||||
|
| File | Purpose |
|
||||||
|
|------|---------|
|
||||||
|
| `tests/simulation/system.py` | Main simulation engine |
|
||||||
|
| `tests/simulation/config.py` | Current vs fixed system configs |
|
||||||
|
| `tests/simulation/state.py` | State dataclasses |
|
||||||
|
| `tests/simulation/test_presence_race.py` | Hypothesis stateful tests |
|
||||||
|
| `tests/simulation/test_targeted_scenarios.py` | Specific race scenarios |
|
||||||
|
| `server/reflector/presence/model.py` | Shared state machine model |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Alternative Considered: Remove DB Fallback
|
||||||
|
|
||||||
|
One simpler change discussed: remove the database fallback when Daily API fails, and "fail loudly" instead.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Current (with fallback)
|
||||||
|
try:
|
||||||
|
presence = await client.get_room_presence(meeting.room_name)
|
||||||
|
has_active_sessions = presence.total_count > 0
|
||||||
|
except Exception:
|
||||||
|
# Fallback to stale DB
|
||||||
|
room_sessions = await client.get_room_sessions(meeting.room_name)
|
||||||
|
has_active_sessions = bool(room_sessions and any(s.ended_at is None for s in room_sessions))
|
||||||
|
|
||||||
|
# Proposed (fail loudly)
|
||||||
|
try:
|
||||||
|
presence = await client.get_room_presence(meeting.room_name)
|
||||||
|
has_active_sessions = presence.total_count > 0
|
||||||
|
except Exception:
|
||||||
|
logger.error("Daily API failed, skipping deactivation check for this meeting")
|
||||||
|
continue # Don't deactivate if we can't verify
|
||||||
|
```
|
||||||
|
|
||||||
|
**This helps but doesn't eliminate the race** - it only removes one failure mode (stale DB). The core race (handshake invisibility) remains.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
The presence system race condition is a **data model gap**, not a timing issue that can be solved with grace periods. The backend needs explicit knowledge of users who intend to join, before they become visible to the Daily presence API.
|
||||||
|
|
||||||
|
The recommended fix is to add a `/joining` endpoint that the frontend calls before starting WebRTC. This creates a "reservation" that prevents premature meeting deactivation during the handshake window.
|
||||||
|
|
||||||
|
This approach:
|
||||||
|
- Eliminates the race by design
|
||||||
|
- Adds minimal latency (~50-200ms)
|
||||||
|
- Follows explicit state machine principles
|
||||||
|
- Avoids arbitrary timeout hacks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Appendix: Simulation Test Results
|
||||||
|
|
||||||
|
```
|
||||||
|
$ uv run pytest tests/simulation/ -v
|
||||||
|
|
||||||
|
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_uses_model_states PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_respects_transitions PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_invalid_transitions_checked PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_implements_protocols PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_uses_shared_invariants PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_state_machine_has_all_states PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_state_machine_valid_transitions PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_state_machine_invalid_transitions_raise PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_guarded_user_state_transitions PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_guarded_user_state_rejects_invalid PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_guarded_user_state_tracks_history PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::TestInvariantConsistency::test_invariants_same_between_model_and_simulation PASSED
|
||||||
|
tests/simulation/test_model_conformance.py::test_quick_conformance_check PASSED
|
||||||
|
tests/simulation/test_presence_race.py::TestPresenceRaceFixed::runTest PASSED
|
||||||
|
tests/simulation/test_presence_race.py::test_presence_race_conditions_current_system XFAIL
|
||||||
|
tests/simulation/test_presence_race.py::test_presence_no_race_conditions_fixed_system PASSED
|
||||||
|
tests/simulation/test_presence_race.py::test_smoke_presence_simulation PASSED
|
||||||
|
tests/simulation/test_targeted_scenarios.py::TestQuickRejoinRace::test_quick_rejoin_causes_split PASSED
|
||||||
|
tests/simulation/test_targeted_scenarios.py::TestQuickRejoinRace::test_quick_rejoin_fixed_system PASSED
|
||||||
|
tests/simulation/test_targeted_scenarios.py::TestSimultaneousJoins::test_two_users_join_simultaneously PASSED
|
||||||
|
tests/simulation/test_targeted_scenarios.py::TestProcessMeetingsRace::test_process_meetings_during_handshake PASSED
|
||||||
|
tests/simulation/test_targeted_scenarios.py::TestPresenceLagRace::test_presence_lag_causes_incorrect_count PASSED
|
||||||
|
tests/simulation/test_targeted_scenarios.py::TestMeetingDeactivationEdgeCases::test_deactivation_with_no_sessions PASSED
|
||||||
|
tests/simulation/test_targeted_scenarios.py::TestMeetingDeactivationEdgeCases::test_deactivation_requires_had_sessions PASSED
|
||||||
|
tests/simulation/test_targeted_scenarios.py::TestEventLogTracing::test_event_log_captures_flow PASSED
|
||||||
|
tests/simulation/test_targeted_scenarios.py::test_config_presets PASSED
|
||||||
|
|
||||||
|
================== 25 passed, 1 xfailed ==================
|
||||||
|
```
|
||||||
|
|
||||||
|
The `xfail` test (`test_presence_race_conditions_current_system`) demonstrates that the current system configuration has race conditions that can be found through randomized testing.
|
||||||
Reference in New Issue
Block a user