Files
reflector/PRESENCE_RACE_DESIGN_DOC.md
Igor Loskutov d9dfa17b5a docs: Add presence race condition design document
Comprehensive analysis of the race condition where users in the same
Reflector room can end up in different Daily.co rooms.

Contents:
- Problem statement and symptoms
- Evidence from Hypothesis simulation
- Current system analysis with code references
- Detailed race condition timeline
- Why current mitigations (Daily API, fallback) are insufficient
- Three solution options with trade-offs
- Recommended approach: Track "intent to join" via /joining endpoint
- Implementation checklist and file references

Key insight: The race is a data model gap, not a timing issue. Backend
needs explicit knowledge of joining users before Daily presence API
sees them.
2026-02-05 18:49:06 -05:00

17 KiB

Presence System Race Condition: Design Document

Executive Summary

Users in the same Reflector room can end up in different Daily.co rooms due to race conditions in meeting lifecycle management. This document details the root cause, why current mitigations are insufficient, and proposes a solution that eliminates the race by design.


Problem Statement

When a user quickly leaves and rejoins a meeting (e.g., closes tab and reopens within seconds), they may find themselves in a different Daily.co room than other participants in the same Reflector room. This breaks the core assumption that all users in a Reflector room share the same video call.

Symptoms

  • User A and User B are in the same Reflector room but different Daily.co rooms
  • User reports "I can't see/hear the other participant"
  • Meeting appears active but users are isolated

Evidence: Hypothesis Simulation

A simulation was built to model the presence system and find race conditions through randomized action sequences.

Location: server/tests/simulation/

cd server

# Current system config - finds race conditions
uv run pytest tests/simulation/test_presence_race.py::test_presence_race_conditions_current_system
# Result: XFAIL (expected failure - race found)

# Fixed system config - no race conditions
uv run pytest tests/simulation/test_presence_race.py::test_presence_no_race_conditions_fixed_system
# Result: PASS

The simulation models:

  • Discrete time clock for deterministic replay
  • Daily.co rooms, participants, presence API with configurable lag
  • Reflector meetings, sessions, webhooks
  • User state machine: idle → joining → handshaking → connected → leaving → idle
  • Background tasks: poll_daily_room_presence, process_meetings

Key Finding

The simulation proves that even with the Daily API call, a race window exists during WebRTC handshake when users are invisible to the presence API.


Current System Analysis

Architecture Overview

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Frontend  │────▶│   Backend   │────▶│  Daily.co   │
│  (Next.js)  │     │  (FastAPI)  │     │    API      │
└─────────────┘     └─────────────┘     └─────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │  Database   │
                    │  (Sessions) │
                    └─────────────┘

Relevant Code Paths

1. Meeting Join Flow

  • File: server/reflector/views/rooms.py
  • Endpoint: POST /rooms/{room_name}/meeting
  • Returns existing active meeting or creates new one
  • User then connects to Daily.co via WebRTC (frontend)

2. Presence Polling

  • File: server/reflector/worker/process.py:642
  • Function: poll_daily_room_presence()
  • Called by webhooks (participant.joined, participant.left) and /joined, /leave endpoints
  • Queries Daily API for current participants
  • Updates daily_participant_sessions table in database

3. Meeting Deactivation

  • File: server/reflector/worker/process.py:754
  • Function: process_meetings()
  • Runs periodically (every 60s via Celery beat)
  • Checks if meetings should be deactivated

Current implementation (lines 806-833):

if meeting.platform == "daily":
    try:
        presence = await client.get_room_presence(meeting.room_name)
        has_active_sessions = presence.total_count > 0
        # ...
    except Exception:
        logger_.warning("Daily.co presence API failed, falling back to DB sessions")
        room_sessions = await client.get_room_sessions(meeting.room_name)
        has_active_sessions = bool(
            room_sessions and any(s.ended_at is None for s in room_sessions)
        )

Key observation: The code already uses the Daily API (get_room_presence), not just the database. The race condition persists despite this.

Endpoints from feature-leave-endpoint Branch

The feature-leave-endpoint branch added explicit leave/join notifications:

Endpoint Purpose Trigger
POST /rooms/{room_name}/meetings/{meeting_id}/join Get meeting info User navigates to room
POST /rooms/{room_name}/meetings/{meeting_id}/joined Signal connection complete After WebRTC connects
POST /rooms/{room_name}/meetings/{meeting_id}/leave Signal user leaving Tab close via sendBeacon

These endpoints trigger poll_daily_room_presence_task to update session state faster than waiting for webhooks.


Race Condition: Detailed Analysis

The Fundamental Problem

The backend has no knowledge of users who are in the process of joining (WebRTC handshake phase).

Data sources available to backend:

Source What it knows Limitation
Daily Presence API Currently connected users 0-500ms lag; doesn't see handshaking users
Database sessions Historical join/leave events Stale; updated by polls
Webhooks Join/leave events Delayed; can fail

Gap: No source knows about users between "decided to join" and "WebRTC handshake complete".

Race Scenario Timeline

T+0ms:    User A connected to Meeting M1, visible in Daily presence
T+1000ms: User A closes browser tab
T+1050ms: participant.left webhook fires → poll_daily_room_presence queued
T+1500ms: User A reopens tab (quick rejoin)
T+1600ms: POST /meeting returns M1 (still active)
T+1700ms: Frontend starts WebRTC handshake
T+2000ms: User A in handshake - NOT visible to Daily presence API
T+2100ms: poll runs → sees 0 participants → marks session as left_at
T+3000ms: process_meetings runs
T+3100ms: Daily API returns 0 participants (user still handshaking)
T+3200ms: has_active_sessions=False, has_had_sessions=True
T+3300ms: Meeting deactivated, Daily room deleted
T+4000ms: User A WebRTC completes → Daily room is gone!
T+5000ms: User B joins same Reflector room → new Meeting M2 created

RESULT: User A orphaned, User B in different Daily room

Why Current Mitigations Are Insufficient

1. Using Daily API (already implemented)

The code already calls get_room_presence() instead of relying solely on database sessions. This doesn't help because the Daily presence API itself doesn't see users during WebRTC handshake (0-500ms consistency lag + handshake duration of 500-3000ms).

2. Fallback to Database

When Daily API fails, the code falls back to database sessions. This is worse because database is even more stale than the API.

3. Leave/Join Endpoints

The /joined and /leave endpoints trigger immediate polls, reducing the window but not eliminating it. The poll still only sees what Daily presence API reports.


Proposed Solutions

Add a time-based buffer before deactivation.

GRACE_PERIOD_SECONDS = 10

if not has_active_sessions and has_had_sessions:
    recent_activity = await get_recent_activity(meeting_id, within_seconds=GRACE_PERIOD_SECONDS)
    if recent_activity:
        continue  # Skip deactivation

Pros:

  • Simple to implement
  • Low risk

Cons:

  • Arbitrary timeout value (why 10s? why not 5s or 30s?)
  • Feels like a hack ("setTimeout solution")
  • Delays legitimate deactivation
  • Doesn't eliminate race, just makes it less likely

Add explicit state tracking for users who are in the process of joining.

New endpoint: POST /rooms/{room_name}/meetings/{meeting_id}/joining

Flow change:

Current:
1. POST /join → get meeting info
2. Render Daily iframe (start WebRTC)
3. POST /joined (after connected)

Proposed:
1. POST /join → get meeting info
2. POST /joining → "I'm about to connect" ← NEW (wait for 200 OK)
3. Render Daily iframe (start WebRTC)
4. POST /joined (after connected)

Backend tracking:

# On /joining endpoint
await pending_joins.create(meeting_id=meeting_id, user_id=user_id, created_at=now())

# In process_meetings
pending = await pending_joins.get_recent(meeting_id, max_age_seconds=30)
if pending:
    logger.info("Meeting has pending joins, skipping deactivation")
    continue

# On /joined endpoint or timeout
await pending_joins.delete(meeting_id=meeting_id, user_id=user_id)

Pros:

  • Eliminates race by design (backend knows before Daily does)
  • Explicit state machine, not time-based guessing
  • Clear semantics

Cons:

  • Adds ~50-200ms latency (one round-trip before iframe renders)
  • Requires frontend changes
  • Needs cleanup mechanism for abandoned joins (user closes tab during handshake)

Option C: Optimistic Locking with Version

Track meeting "version" that must match for deactivation.

Concept: Each join attempt increments a version. Deactivation only proceeds if version hasn't changed since presence check.

Cons:

  • Complex to implement correctly
  • Still has edge cases with concurrent joins

Track "Intent to Join" is the cleanest solution because it:

  1. Eliminates the race by design - no timing windows
  2. Makes state explicit - joining/connected/leaving are tracked, not inferred
  3. Aligns with existing patterns - similar to /joined and /leave endpoints
  4. No arbitrary timeouts - unlike grace period

Data Model Change

Add tracking for pending joins. Options:

Storage Pros Cons
Redis key Fast, auto-expire Lost on Redis restart
Database table Persistent, queryable Slightly slower
In-memory Fastest Lost on server restart

Recommendation: Redis with TTL (30s expiry) for simplicity. Pending joins are ephemeral - if Redis restarts, worst case is a brief deactivation delay.

# Redis key format
pending_join:{meeting_id}:{user_id} = {timestamp}
# TTL: 30 seconds

Implementation Checklist

  1. Backend: Add /joining endpoint

    • File: server/reflector/views/rooms.py
    • Creates Redis key with 30s TTL
    • Returns 200 OK
  2. Backend: Modify process_meetings()

    • File: server/reflector/worker/process.py
    • Before deactivation, check for pending joins
    • If any exist, skip deactivation
  3. Backend: Modify /joined endpoint

    • Clear pending join on successful connection
  4. Frontend: Call /joining before WebRTC

    • File: www/app/[roomName]/components/DailyRoom.tsx
    • Await response before rendering Daily iframe
  5. Update simulation

    • Add joining state tracking to match new design
    • Verify race condition is eliminated
  6. Integration tests

    • Test quick rejoin scenario
    • Test abandoned join (user closes during handshake)
    • Test concurrent joins from multiple users

Files Reference

Core Files to Modify

File Purpose
server/reflector/views/rooms.py Add /joining endpoint
server/reflector/worker/process.py Check pending joins before deactivation
www/app/[roomName]/components/DailyRoom.tsx Call /joining before WebRTC

Reference Files

File Contains
server/reflector/video_platforms/daily.py:128 get_room_presence() - Daily API call
server/reflector/worker/process.py:642 poll_daily_room_presence() - presence polling
server/reflector/views/daily.py:125 Webhook handlers
server/tests/simulation/ Hypothesis simulation proving the race
server/tests/test_daily_presence_deactivation.py Unit tests for presence logic

Simulation Files

File Purpose
tests/simulation/system.py Main simulation engine
tests/simulation/config.py Current vs fixed system configs
tests/simulation/state.py State dataclasses
tests/simulation/test_presence_race.py Hypothesis stateful tests
tests/simulation/test_targeted_scenarios.py Specific race scenarios
server/reflector/presence/model.py Shared state machine model

Alternative Considered: Remove DB Fallback

One simpler change discussed: remove the database fallback when Daily API fails, and "fail loudly" instead.

# Current (with fallback)
try:
    presence = await client.get_room_presence(meeting.room_name)
    has_active_sessions = presence.total_count > 0
except Exception:
    # Fallback to stale DB
    room_sessions = await client.get_room_sessions(meeting.room_name)
    has_active_sessions = bool(room_sessions and any(s.ended_at is None for s in room_sessions))

# Proposed (fail loudly)
try:
    presence = await client.get_room_presence(meeting.room_name)
    has_active_sessions = presence.total_count > 0
except Exception:
    logger.error("Daily API failed, skipping deactivation check for this meeting")
    continue  # Don't deactivate if we can't verify

This helps but doesn't eliminate the race - it only removes one failure mode (stale DB). The core race (handshake invisibility) remains.


Conclusion

The presence system race condition is a data model gap, not a timing issue that can be solved with grace periods. The backend needs explicit knowledge of users who intend to join, before they become visible to the Daily presence API.

The recommended fix is to add a /joining endpoint that the frontend calls before starting WebRTC. This creates a "reservation" that prevents premature meeting deactivation during the handshake window.

This approach:

  • Eliminates the race by design
  • Adds minimal latency (~50-200ms)
  • Follows explicit state machine principles
  • Avoids arbitrary timeout hacks

Appendix: Simulation Test Results

$ uv run pytest tests/simulation/ -v

tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_uses_model_states PASSED
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_respects_transitions PASSED
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_invalid_transitions_checked PASSED
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_implements_protocols PASSED
tests/simulation/test_model_conformance.py::TestModelConformance::test_simulation_uses_shared_invariants PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_state_machine_has_all_states PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_state_machine_valid_transitions PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_state_machine_invalid_transitions_raise PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_guarded_user_state_transitions PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_guarded_user_state_rejects_invalid PASSED
tests/simulation/test_model_conformance.py::TestProductionStateMachine::test_guarded_user_state_tracks_history PASSED
tests/simulation/test_model_conformance.py::TestInvariantConsistency::test_invariants_same_between_model_and_simulation PASSED
tests/simulation/test_model_conformance.py::test_quick_conformance_check PASSED
tests/simulation/test_presence_race.py::TestPresenceRaceFixed::runTest PASSED
tests/simulation/test_presence_race.py::test_presence_race_conditions_current_system XFAIL
tests/simulation/test_presence_race.py::test_presence_no_race_conditions_fixed_system PASSED
tests/simulation/test_presence_race.py::test_smoke_presence_simulation PASSED
tests/simulation/test_targeted_scenarios.py::TestQuickRejoinRace::test_quick_rejoin_causes_split PASSED
tests/simulation/test_targeted_scenarios.py::TestQuickRejoinRace::test_quick_rejoin_fixed_system PASSED
tests/simulation/test_targeted_scenarios.py::TestSimultaneousJoins::test_two_users_join_simultaneously PASSED
tests/simulation/test_targeted_scenarios.py::TestProcessMeetingsRace::test_process_meetings_during_handshake PASSED
tests/simulation/test_targeted_scenarios.py::TestPresenceLagRace::test_presence_lag_causes_incorrect_count PASSED
tests/simulation/test_targeted_scenarios.py::TestMeetingDeactivationEdgeCases::test_deactivation_with_no_sessions PASSED
tests/simulation/test_targeted_scenarios.py::TestMeetingDeactivationEdgeCases::test_deactivation_requires_had_sessions PASSED
tests/simulation/test_targeted_scenarios.py::TestEventLogTracing::test_event_log_captures_flow PASSED
tests/simulation/test_targeted_scenarios.py::test_config_presets PASSED

================== 25 passed, 1 xfailed ==================

The xfail test (test_presence_race_conditions_current_system) demonstrates that the current system configuration has race conditions that can be found through randomized testing.