Skip to content

fix: tolerate redis outages in realtime websockets#612

Open
tongshiyuan wants to merge 1 commit into
dataelement:mainfrom
tongshiyuan:realtime-redis-fallback-fix
Open

fix: tolerate redis outages in realtime websockets#612
tongshiyuan wants to merge 1 commit into
dataelement:mainfrom
tongshiyuan:realtime-redis-fallback-fix

Conversation

@tongshiyuan
Copy link
Copy Markdown

Summary

Fixes agent chat appearing unresponsive when Redis is unavailable locally.

The WebSocket setup path previously required Redis presence registration to succeed. If Redis was down,
/ws/chat/{agent_id} failed during setup before the frontend could receive chat responses. This change
keeps local WebSocket delivery working and degrades only cross-instance realtime routing when Redis is
unavailable.

Checklist

  • Tested locally
  • No unrelated changes included

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 04e00a0eab

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +171 to +173
except Exception as exc:
logger.warning(f"[Realtime] Redis pubsub subscriber unavailable; remote websocket routing disabled: {exc}")
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep retrying the subscriber after startup Redis outages

When Redis is down during startup, this branch exits _subscriber_loop permanently while start() has already set _started = True and backend/app/main.py only calls realtime_router.start(...) once during lifespan startup. If Redis recovers later, this instance can still publish to other instances, but it never resubscribes to its own instance channel, so cross-instance messages addressed to it are dropped until the process restarts. Consider retrying subscription instead of returning from the subscriber task.

Useful? React with 👍 / 👎.

Comment on lines +213 to 215
except Exception as exc:
logger.warning(f"[Realtime] Redis presence lookup failed: {exc}")
return []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve local viewer checks when Redis is unavailable

With Redis unavailable, _list_presence() now returns an empty list, so manager.is_user_viewing_session() always reports false even though register_connection() now allows local WebSockets to stay connected without Redis. In that local-only fallback scenario, maybe_mark_session_read_for_active_viewer() in backend/app/api/websocket.py stops marking messages read for the user currently viewing the session, leaving unread state stale while chat delivery works. The fallback needs to include local active connections or avoid treating Redis failure as no local viewers.

Useful? React with 👍 / 👎.

@tongshiyuan tongshiyuan force-pushed the realtime-redis-fallback-fix branch from 04e00a0 to be4a912 Compare May 27, 2026 09:24
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be4a912ab0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +61 to 63
except Exception as exc:
logger.warning(f"[Realtime] Redis presence unavailable; using local websocket only: {exc}")
return connection_id
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Retry presence registration after Redis returns

When Redis is unavailable during register_connection, the socket is kept only in active_connections and this branch returns without ever writing its presence record later. If Redis recovers while the WebSocket stays open, remote instances still cannot discover this viewer in _list_presence, so cross-instance messages for that user/session are skipped until the user reconnects. Consider retrying the presence write or refreshing local connections once Redis is reachable again.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant