Fix ghost participants accumulating in sessions by marcodejongh · Pull Request #904 · marcodejongh/boardsesh

marcodejongh · 2026-03-02T20:34:08Z

Summary

Fix ghost participants: When the backend restarts (HMR, deployment, crash), WebSocket connections drop but Redis session membership sets (boardsesh:session:SESSION_ID:members) are never cleaned up. Each browser reconnect adds a new connection ID while old ones persist, causing inflated participant counts (e.g., 28 members from 4 dead instances when only 2 browsers are connected).
Add dead instance cleanup: Discovers dead instances via heartbeat checking, prunes their orphaned connections on startup and every 2 minutes, with a Lua script for atomic session member pruning and leader re-election.
Replace board_session_clients with board_session_participants: One row per authenticated user per session (permanent historical record) instead of per-connection rows that accumulate as ghosts. Leader state stays Redis-only.

Changes

Part A: Redis Ghost Cleanup (`distributed-state.ts`)

Added PRUNE_STALE_SESSION_MEMBERS_SCRIPT Lua script for atomic stale member removal + leader re-election
Added discoverDeadInstances(), cleanupDeadInstanceConnections(), cleanupStaleSessionMembers() methods
Fixed getSessionMemberCount() to filter stale entries via pipeline EXISTS checks (was using raw SCARD)
Fixed hasSessionMembers() to delegate to corrected count method
Added startup cleanup call in start()
Added 2h TTL on instance connection sets for self-healing

Part B: Replace `board_session_clients` (`room-manager.ts`, schema)

New board_session_participants table with UNIQUE(session_id, user_id) — only for authenticated users
Refactored persistSessionJoin() to upsert into new table instead of board_session_clients
Removed persistSessionLeave() (participants are permanent records)
Removed persistLeaderChange() (leader state is Redis-only)
Removed Postgres write from updateUsername()
Updated mySessions query to use distributedState.getSessionMemberCount()
Old board_session_clients table is no longer written to (can be dropped in follow-up)

Part C: Periodic Cleanup (`server.ts`)

Added dead instance cleanup interval (every 2 minutes)

Test plan

npm run typecheck:backend and npm run typecheck:db pass
Start backend, create session with 2 browsers
Kill and restart backend → startup cleanup removes stale connections
Reconnect browsers → participant count matches actual count (not inflated)
Check Redis: SCARD boardsesh:session:SESSION_ID:members equals actual count
Check Postgres: board_session_participants has one row per authenticated user
Old board_session_clients is no longer written to

🤖 Generated with Claude Code

When the backend restarts (HMR, deployment, crash), WebSocket connections drop but Redis session membership sets are never cleaned up. Each browser reconnect adds a new connection ID while old ones persist, causing inflated participant counts. This commit: - Adds a Lua script to atomically prune stale session members and re-elect leaders when needed - Adds dead instance discovery via heartbeat checking and batch cleanup of orphaned connections on startup and every 2 minutes - Fixes getSessionMemberCount/hasSessionMembers to filter out stale entries by checking connection hash existence - Adds TTL (2h) to instance connection sets for self-healing - Replaces per-connection board_session_clients rows with a per-user board_session_participants table (one row per authenticated user per session, permanent historical record) - Removes persistSessionLeave and persistLeaderChange Postgres writes (leader state is Redis-only) - Updates mySessions query to use distributed state for participant counts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-02T20:37:29Z

Claude Review

✅ Ready to merge - Minor issues noted below, but nothing blocking.

Test Coverage

Missing tests for new cleanup methods: The new methods discoverDeadInstances(), cleanupDeadInstanceConnections(), and cleanupStaleSessionMembers() in packages/backend/src/services/distributed-state.ts lack dedicated unit tests. The existing test suite covers connection management and leader election but doesn't verify the dead instance cleanup logic or the PRUNE_STALE_SESSION_MEMBERS_SCRIPT Lua script behavior.

Documentation

docs/websocket-implementation.md: Consider updating this documentation to reflect:
- The new board_session_participants table (permanent historical record vs. per-connection tracking)
- The dead instance cleanup mechanism (heartbeat-based discovery + 2-minute periodic cleanup)
- How participant counts are now calculated using getSessionMemberCount() with stale entry filtering

Minor Notes

The PR cleanly replaces the ghost participant tracking approach with a more robust solution using heartbeat-based instance discovery and atomic Lua scripts for cleanup
The migration properly creates the new board_session_participants table with appropriate indexes

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f9f0c9769

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

packages/backend/src/services/room-manager.ts

packages/backend/src/services/distributed-state.ts

…cycle Instead of adding another setInterval in server.ts, piggyback the cleanup on the existing 30s heartbeat in DistributedStateManager. Runs every 4th heartbeat (~2 minutes), keeping the cleanup logic co-located with the state it manages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-02T20:43:08Z

Claude Review

✅ Ready to merge - Minor issues noted below, but nothing blocking.

Documentation

docs/entity-structure.md references board_session_clients in 4+ places (lines 62, 116, 269, 278) but should mention the new board_session_participants table or note the migration status

Test Coverage

No tests added for new cleanup functions (discoverDeadInstances, cleanupDeadInstanceConnections, cleanupStaleSessionMembers, PRUNE_STALE_SESSION_MEMBERS_SCRIPT Lua script)
These are critical recovery mechanisms that would benefit from integration tests

Code Review

The implementation looks solid. The Lua scripts are well-designed for atomic operations, the periodic cleanup piggybacks efficiently on heartbeats, and the self-healing TTLs provide good defense-in-depth.

- Pass authenticatedUserId to registerClient in WebSocket onConnect so client.userId is populated and persistSessionJoin actually inserts into board_session_participants for authenticated users - Refresh instance connection-set TTL on every heartbeat (not just at registration) so long-lived instances remain discoverable after a crash - Add tests for discoverDeadInstances, cleanupDeadInstanceConnections, cleanupStaleSessionMembers, and getSessionMemberCount stale filtering - Update websocket-implementation.md: document active dead instance cleanup, board_session_participants table, TTL annotations, and participant tracking architecture Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-02T20:49:21Z

Claude Review

✅ Ready to merge - Well-designed solution for ghost participant cleanup with comprehensive tests and updated documentation.

No significant issues found. The implementation correctly:

Adds active dead instance cleanup on startup and every ~2 minutes
Filters stale members at read time via pipelined EXISTS checks
Uses atomic Lua scripts for session pruning and leader re-election
Replaces per-connection board_session_clients with per-user board_session_participants table
Updates docs/websocket-implementation.md with the new architecture

Replace the unique index on (session_id, user_id) with a proper composite primary key for consistency with other tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector bot reviewed Mar 2, 2026

View reviewed changes

packages/backend/src/services/room-manager.ts Show resolved Hide resolved

packages/backend/src/services/distributed-state.ts Show resolved Hide resolved

Add composite primary key to board_session_participants

bde7076

Replace the unique index on (session_id, user_id) with a proper composite primary key for consistency with other tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

marcodejongh merged commit 8df8752 into main Mar 2, 2026
8 of 9 checks passed

marcodejongh deleted the fix/ghost-participants-cleanup branch March 2, 2026 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ghost participants accumulating in sessions#904

Fix ghost participants accumulating in sessions#904
marcodejongh merged 4 commits intomainfrom
fix/ghost-participants-cleanup

marcodejongh commented Mar 2, 2026

Uh oh!

claude bot commented Mar 2, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

claude bot commented Mar 2, 2026

Uh oh!

claude bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcodejongh commented Mar 2, 2026

Summary

Changes

Part A: Redis Ghost Cleanup (distributed-state.ts)

Part B: Replace board_session_clients (room-manager.ts, schema)

Part C: Periodic Cleanup (server.ts)

Test plan

Uh oh!

claude bot commented Mar 2, 2026

Claude Review

Test Coverage

Documentation

Minor Notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

claude bot commented Mar 2, 2026

Claude Review

Documentation

Test Coverage

Code Review

Uh oh!

claude bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Part A: Redis Ghost Cleanup (`distributed-state.ts`)

Part B: Replace `board_session_clients` (`room-manager.ts`, schema)

Part C: Periodic Cleanup (`server.ts`)

claude bot commented Mar 2, 2026 •

edited

Loading