Skip to content

Fix ghost participants accumulating in sessions#904

Merged
marcodejongh merged 4 commits intomainfrom
fix/ghost-participants-cleanup
Mar 2, 2026
Merged

Fix ghost participants accumulating in sessions#904
marcodejongh merged 4 commits intomainfrom
fix/ghost-participants-cleanup

Conversation

@marcodejongh
Copy link
Owner

Summary

  • Fix ghost participants: When the backend restarts (HMR, deployment, crash), WebSocket connections drop but Redis session membership sets (boardsesh:session:SESSION_ID:members) are never cleaned up. Each browser reconnect adds a new connection ID while old ones persist, causing inflated participant counts (e.g., 28 members from 4 dead instances when only 2 browsers are connected).
  • Add dead instance cleanup: Discovers dead instances via heartbeat checking, prunes their orphaned connections on startup and every 2 minutes, with a Lua script for atomic session member pruning and leader re-election.
  • Replace board_session_clients with board_session_participants: One row per authenticated user per session (permanent historical record) instead of per-connection rows that accumulate as ghosts. Leader state stays Redis-only.

Changes

Part A: Redis Ghost Cleanup (distributed-state.ts)

  • Added PRUNE_STALE_SESSION_MEMBERS_SCRIPT Lua script for atomic stale member removal + leader re-election
  • Added discoverDeadInstances(), cleanupDeadInstanceConnections(), cleanupStaleSessionMembers() methods
  • Fixed getSessionMemberCount() to filter stale entries via pipeline EXISTS checks (was using raw SCARD)
  • Fixed hasSessionMembers() to delegate to corrected count method
  • Added startup cleanup call in start()
  • Added 2h TTL on instance connection sets for self-healing

Part B: Replace board_session_clients (room-manager.ts, schema)

  • New board_session_participants table with UNIQUE(session_id, user_id) — only for authenticated users
  • Refactored persistSessionJoin() to upsert into new table instead of board_session_clients
  • Removed persistSessionLeave() (participants are permanent records)
  • Removed persistLeaderChange() (leader state is Redis-only)
  • Removed Postgres write from updateUsername()
  • Updated mySessions query to use distributedState.getSessionMemberCount()
  • Old board_session_clients table is no longer written to (can be dropped in follow-up)

Part C: Periodic Cleanup (server.ts)

  • Added dead instance cleanup interval (every 2 minutes)

Test plan

  • npm run typecheck:backend and npm run typecheck:db pass
  • Start backend, create session with 2 browsers
  • Kill and restart backend → startup cleanup removes stale connections
  • Reconnect browsers → participant count matches actual count (not inflated)
  • Check Redis: SCARD boardsesh:session:SESSION_ID:members equals actual count
  • Check Postgres: board_session_participants has one row per authenticated user
  • Old board_session_clients is no longer written to

🤖 Generated with Claude Code

When the backend restarts (HMR, deployment, crash), WebSocket connections
drop but Redis session membership sets are never cleaned up. Each browser
reconnect adds a new connection ID while old ones persist, causing inflated
participant counts.

This commit:

- Adds a Lua script to atomically prune stale session members and re-elect
  leaders when needed
- Adds dead instance discovery via heartbeat checking and batch cleanup of
  orphaned connections on startup and every 2 minutes
- Fixes getSessionMemberCount/hasSessionMembers to filter out stale entries
  by checking connection hash existence
- Adds TTL (2h) to instance connection sets for self-healing
- Replaces per-connection board_session_clients rows with a per-user
  board_session_participants table (one row per authenticated user per
  session, permanent historical record)
- Removes persistSessionLeave and persistLeaderChange Postgres writes
  (leader state is Redis-only)
- Updates mySessions query to use distributed state for participant counts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 2, 2026

Claude Review

Ready to merge - Minor issues noted below, but nothing blocking.

Test Coverage

  • Missing tests for new cleanup methods: The new methods discoverDeadInstances(), cleanupDeadInstanceConnections(), and cleanupStaleSessionMembers() in packages/backend/src/services/distributed-state.ts lack dedicated unit tests. The existing test suite covers connection management and leader election but doesn't verify the dead instance cleanup logic or the PRUNE_STALE_SESSION_MEMBERS_SCRIPT Lua script behavior.

Documentation

  • docs/websocket-implementation.md: Consider updating this documentation to reflect:
    • The new board_session_participants table (permanent historical record vs. per-connection tracking)
    • The dead instance cleanup mechanism (heartbeat-based discovery + 2-minute periodic cleanup)
    • How participant counts are now calculated using getSessionMemberCount() with stale entry filtering

Minor Notes

  • The PR cleanly replaces the ghost participant tracking approach with a more robust solution using heartbeat-based instance discovery and atomic Lua scripts for cleanup
  • The migration properly creates the new board_session_participants table with appropriate indexes

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f9f0c9769

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…cycle

Instead of adding another setInterval in server.ts, piggyback the cleanup
on the existing 30s heartbeat in DistributedStateManager. Runs every 4th
heartbeat (~2 minutes), keeping the cleanup logic co-located with the
state it manages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 2, 2026

Claude Review

Ready to merge - Minor issues noted below, but nothing blocking.

Documentation

  • docs/entity-structure.md references board_session_clients in 4+ places (lines 62, 116, 269, 278) but should mention the new board_session_participants table or note the migration status

Test Coverage

  • No tests added for new cleanup functions (discoverDeadInstances, cleanupDeadInstanceConnections, cleanupStaleSessionMembers, PRUNE_STALE_SESSION_MEMBERS_SCRIPT Lua script)
  • These are critical recovery mechanisms that would benefit from integration tests

Code Review

The implementation looks solid. The Lua scripts are well-designed for atomic operations, the periodic cleanup piggybacks efficiently on heartbeats, and the self-healing TTLs provide good defense-in-depth.

- Pass authenticatedUserId to registerClient in WebSocket onConnect so
  client.userId is populated and persistSessionJoin actually inserts into
  board_session_participants for authenticated users
- Refresh instance connection-set TTL on every heartbeat (not just at
  registration) so long-lived instances remain discoverable after a crash
- Add tests for discoverDeadInstances, cleanupDeadInstanceConnections,
  cleanupStaleSessionMembers, and getSessionMemberCount stale filtering
- Update websocket-implementation.md: document active dead instance
  cleanup, board_session_participants table, TTL annotations, and
  participant tracking architecture

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 2, 2026

Claude Review

Ready to merge - Well-designed solution for ghost participant cleanup with comprehensive tests and updated documentation.

No significant issues found. The implementation correctly:

  • Adds active dead instance cleanup on startup and every ~2 minutes
  • Filters stale members at read time via pipelined EXISTS checks
  • Uses atomic Lua scripts for session pruning and leader re-election
  • Replaces per-connection board_session_clients with per-user board_session_participants table
  • Updates docs/websocket-implementation.md with the new architecture

Replace the unique index on (session_id, user_id) with a proper composite
primary key for consistency with other tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@marcodejongh marcodejongh merged commit 8df8752 into main Mar 2, 2026
8 of 9 checks passed
@marcodejongh marcodejongh deleted the fix/ghost-participants-cleanup branch March 2, 2026 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant