Skip to content

fix: Handle Redis connection errors gracefully instead of panicking#111

Open
sbernauer wants to merge 4 commits intomainfrom
fix/redis-connection
Open

fix: Handle Redis connection errors gracefully instead of panicking#111
sbernauer wants to merge 4 commits intomainfrom
fix/redis-connection

Conversation

@sbernauer
Copy link
Copy Markdown
Member

Maybe fixes #109

When Redis goes down (e.g. master failover), a broken pipe error would cause get_queued_query_count to panic due to an .unwrap(), which then poisoned the RwLock used in metrics callbacks, cascading into further panics and leaving pods unresponsive until the liveness probe killed them.

  • Replace .unwrap() with proper error propagation in get_queued_query_count
  • Handle poisoned locks, closed channels, and panicked threads gracefully in both metrics callbacks

sbernauer and others added 4 commits April 17, 2026 15:57
When Redis goes down (e.g. master failover), a broken pipe error would
cause `get_queued_query_count` to panic due to an `.unwrap()`, which
then poisoned the RwLock used in metrics callbacks, cascading into
further panics and leaving pods unresponsive until the liveness probe
killed them.

- Replace `.unwrap()` with proper error propagation in
  `get_queued_query_count`
- Handle poisoned locks, closed channels, and panicked threads
  gracefully in both metrics callbacks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sbernauer sbernauer self-assigned this Apr 17, 2026
@sbernauer sbernauer moved this to Development: In Progress in Stackable Engineering Apr 17, 2026
@sbernauer sbernauer moved this from Development: In Progress to Development: Waiting for Review in Stackable Engineering Apr 17, 2026
@NickLarsenNZ NickLarsenNZ self-requested a review April 17, 2026 14:57
Copy link
Copy Markdown
Member

@NickLarsenNZ NickLarsenNZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

.scard::<_, Option<u64>>(queued_query_set_name(cluster_group))
.await
.unwrap()
.context(GetQueuedQueryCountSnafu { cluster_group })?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A valuable lesson :D

Comment thread trino-lb/src/metrics.rs
match receiver.blocking_recv() {
Some(v) => Some(v),
None => {
error!("cluster_counts_per_state metrics channel closed");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the strings could be interpolated (so if the metric name changes, then so does the error). Not sure how feasible that is, so totally optional right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Development: Waiting for Review

Development

Successfully merging this pull request may close these issues.

trino-lb is stuck after redis master shuts down

2 participants