Skip to content

fix(pb-envoy): hard abort ws to tunnel on ping error#4776

Open
MasterPtato wants to merge 1 commit into04-24-fix_outbound_implement_drain_grace_periodfrom
04-24-fix_pb-envoy_hard_abort_ws_to_tunnel_on_ping_error
Open

fix(pb-envoy): hard abort ws to tunnel on ping error#4776
MasterPtato wants to merge 1 commit into04-24-fix_outbound_implement_drain_grace_periodfrom
04-24-fix_pb-envoy_hard_abort_ws_to_tunnel_on_ping_error

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@MasterPtato MasterPtato force-pushed the 04-24-fix_pb-envoy_hard_abort_ws_to_tunnel_on_ping_error branch from 3aa9cc4 to f537c84 Compare April 25, 2026 01:14
@MasterPtato MasterPtato force-pushed the 04-24-fix_outbound_implement_drain_grace_period branch from e50b3e3 to 5f322c0 Compare April 25, 2026 01:14
@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Code Review: fix(pb-envoy): hard abort ws to tunnel on ping error

Overview

This PR addresses three issues:

  1. actor_event_demuxer.rs: Adds a CancellationToken to cleanly shut down channel handler tasks when the demuxer is destroyed, and replaces unsafe task aborts during GC with natural channel closure.
  2. lib.rs: Hard-aborts ws_to_tunnel when the ping task errors, enforcing the invariant that an actor's KV is only accessed from one place at a time.
  3. namespace_runner_config.rs: Fixes a significant bug in the v4->v5 migration default for drain_grace_period.

Bug Fix: Critical issue in namespace_runner_config.rs

The change from 10_000 to 10 for drain_grace_period is the most impactful fix here. drain_grace_period is consumed as Duration::from_secs(drain_grace_period as u64) (see pegboard-outbound/src/lib.rs:565), so the old default of 10_000 would have set a drain grace period of ~2.78 hours for any namespace migrating from v4. The corrected value of 10 seconds correctly matches the legacy serverless_drain_grace_period default of 10,000 ms. This is a significant migration bug fix and deserves explicit mention in the PR description.


actor_event_demuxer.rs - clean and correct

  • The CancellationToken + DropGuard pattern is the right tool here. When the demuxer is dropped, all channel handlers exit cleanly without explicit abort calls.
  • Replacing channel.handle.abort() in the GC path with natural tx-side drop is safer: the task completes any in-flight dispatch_events call before exiting, rather than being torn down mid-signal. The removed TODO: Verify aborting is safe here comment confirms this was a known concern.
  • The tokio::select! in channel_handler correctly races channel closure against cancellation.

Minor concern: in shutdown(), channel handles previously removed by GC (their JoinHandle was dropped/detached) are not awaited. If a GC-expired task is mid-dispatch_events at shutdown time, the cancellation token helps it exit, but there is no back-pressure wait on those detached tasks. Pre-existing limitation rather than a regression.

Nit - comment grammar:

// This will drop the tx side of the channel and it's task will exit automatically.

it's should be its.


lib.rs - correctness fix, minor comment issue

The core change is correct: a ping failure means the actor-side WebSocket connection is broken, so in-flight KV requests via ws_to_tunnel must be terminated immediately. Sending the soft abort signal AND hard-aborting achieves this.

One thing worth confirming: does ws_to_tunnel_task hold any state requiring cleanup before abort (e.g. in-progress SQLite transactions, open file handles)? A hard abort via AbortHandle skips all cleanup code. If cleanup invariants exist, they should be caught at the task boundary or confirmed safe to skip.

The removal of the unused _tx2 clones is a clean simplification - those clones were only needed when the abort signals lived in separately spawned tasks without shared scope. Inline tokio::join! blocks share the outer scope directly.

Comment wording is unclear:

The comment reads: "This stops all in flight kv requests from being completed immediately." This phrasing reads as if it prevents them from completing quickly - the intent is the opposite (it immediately cancels them). The comment should be rewritten to clarify.


Summary

Area Assessment
drain_grace_period fix Critical bug fix: 10,000 s to 10 s is correct
CancellationToken approach Clean and idiomatic
Hard abort on ping failure Correct - confirm ws_to_tunnel has no cleanup invariants
Comment quality Minor grammar and clarity nits
Test coverage No tests added; ping-error path and demuxer cancellation would benefit from coverage

The PR description checklist is empty - worth filling in (noting this is a bug fix and which tests were run).

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Code Review

Overview

This PR has two distinct changes:

  1. Main fix (lib.rs, actor_event_demuxer.rs): On ping task failure, hard-abort the ws_to_tunnel task to enforce the invariant that an actor's KV store is only accessed from one place at a time.
  2. Bundled bug fix (namespace_runner_config.rs): Fix a unit confusion in the v4->v5 migration default for drain_grace_period.

lib.rs — Hard abort on ping failure

The core change is sound. When the ping task errors (connection timeout or dead actor), sending a soft abort via the watch channel is insufficient because ws_to_tunnel may be blocked mid-await inside a long-running KV operation. Without the hard abort, a new actor instance could start accessing the same KV prefix concurrently before the stale task exits. Using JoinHandle::abort_handle() is the correct tool here.

One minor comment issue:

"This stops all in flight kv requests from being completed immediately."

The phrasing is ambiguous — "immediately" reads as modifying "completed" rather than "stops". Also:

"an actor's KV is only being from one place at a time"

Missing "accessed". Suggested revision:

// If the ping task fails, hard-abort ws_to_tunnel immediately to cancel any
// in-flight KV requests. This enforces the invariant that an actor's KV is
// only accessed from one place at a time.

Cleanup win: Removing the three .clone() calls on the abort senders is valid — the tokio::join! async blocks capture by reference (not move), and watch::Sender::send takes &self, so shared borrows across the join arms work correctly.


actor_event_demuxer.rs — Graceful shutdown via CancellationToken

The design is correct and an improvement over the old handle.abort() approach:

  • Dropping the tx sender naturally drains in-flight channel work before the handler exits (recv_many returns 0 when all senders are gone), which is better than hard-aborting mid-dispatch.
  • The DropGuard approach ensures the cancellation token fires even if shutdown() is never called explicitly — a good defensive pattern.
  • In shutdown(), both paths (tx-dropped and cancellation) fire simultaneously, but that is harmless since both cause the handler to break its loop.

One note: the tokio::select! in channel_handler is unbiased by default, so if both the rx and the cancellation token are ready simultaneously, there is no preference for cancellation. For a shutdown path this is fine, but worth knowing.

Minor typo in comment: "it's task will exit automatically" should be "its task will exit automatically".


namespace_runner_config.rsdrain_grace_period fix

This is a significant correctness fix. The field is consumed as Duration::from_secs(drain_grace_period as u64) in pegboard-outbound/src/lib.rs:565, meaning the unit is seconds. The old migration default of 10_000 set a drain grace period of ~2.7 hours. The fix to 10 correctly matches the deprecated serverless_drain_grace_period config default (10,000 ms = 10 s).

This fix is important for any serverless runners provisioned via the v4->v5 migration path.


Summary

Area Verdict
Hard abort on ping failure Correct, necessary for KV safety
Demuxer cancellation token Cleaner than abort(), shutdown logic is correct
drain_grace_period fix Fixes a silent ~2.7-hour default regression
Comment quality Two comments have typos/missing words
Test coverage No new tests; a note on how the ping-error path was manually verified would help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant