fix: tolerate redis outages in realtime websockets#612
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 04e00a0eab
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| except Exception as exc: | ||
| logger.warning(f"[Realtime] Redis pubsub subscriber unavailable; remote websocket routing disabled: {exc}") | ||
| return |
There was a problem hiding this comment.
Keep retrying the subscriber after startup Redis outages
When Redis is down during startup, this branch exits _subscriber_loop permanently while start() has already set _started = True and backend/app/main.py only calls realtime_router.start(...) once during lifespan startup. If Redis recovers later, this instance can still publish to other instances, but it never resubscribes to its own instance channel, so cross-instance messages addressed to it are dropped until the process restarts. Consider retrying subscription instead of returning from the subscriber task.
Useful? React with 👍 / 👎.
| except Exception as exc: | ||
| logger.warning(f"[Realtime] Redis presence lookup failed: {exc}") | ||
| return [] |
There was a problem hiding this comment.
Preserve local viewer checks when Redis is unavailable
With Redis unavailable, _list_presence() now returns an empty list, so manager.is_user_viewing_session() always reports false even though register_connection() now allows local WebSockets to stay connected without Redis. In that local-only fallback scenario, maybe_mark_session_read_for_active_viewer() in backend/app/api/websocket.py stops marking messages read for the user currently viewing the session, leaving unread state stale while chat delivery works. The fallback needs to include local active connections or avoid treating Redis failure as no local viewers.
Useful? React with 👍 / 👎.
04e00a0 to
be4a912
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be4a912ab0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| except Exception as exc: | ||
| logger.warning(f"[Realtime] Redis presence unavailable; using local websocket only: {exc}") | ||
| return connection_id |
There was a problem hiding this comment.
Retry presence registration after Redis returns
When Redis is unavailable during register_connection, the socket is kept only in active_connections and this branch returns without ever writing its presence record later. If Redis recovers while the WebSocket stays open, remote instances still cannot discover this viewer in _list_presence, so cross-instance messages for that user/session are skipped until the user reconnects. Consider retrying the presence write or refreshing local connections once Redis is reachable again.
Useful? React with 👍 / 👎.
Summary
Fixes agent chat appearing unresponsive when Redis is unavailable locally.
The WebSocket setup path previously required Redis presence registration to succeed. If Redis was down,
/ws/chat/{agent_id}failed during setup before the frontend could receive chat responses. This changekeeps local WebSocket delivery working and degrades only cross-instance realtime routing when Redis is
unavailable.
Checklist