fix(pb-envoy): hard abort ws to tunnel on ping error by MasterPtato · Pull Request #4776 · rivet-dev/rivet

MasterPtato · 2026-04-25T01:08:15Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

MasterPtato · 2026-04-25T01:08:31Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-04-25T01:16:56Z

Code Review: fix(pb-envoy): hard abort ws to tunnel on ping error

Overview

This PR addresses three issues:

actor_event_demuxer.rs: Adds a CancellationToken to cleanly shut down channel handler tasks when the demuxer is destroyed, and replaces unsafe task aborts during GC with natural channel closure.
lib.rs: Hard-aborts ws_to_tunnel when the ping task errors, enforcing the invariant that an actor's KV is only accessed from one place at a time.
namespace_runner_config.rs: Fixes a significant bug in the v4->v5 migration default for drain_grace_period.

Bug Fix: Critical issue in `namespace_runner_config.rs`

The change from 10_000 to 10 for drain_grace_period is the most impactful fix here. drain_grace_period is consumed as Duration::from_secs(drain_grace_period as u64) (see pegboard-outbound/src/lib.rs:565), so the old default of 10_000 would have set a drain grace period of ~2.78 hours for any namespace migrating from v4. The corrected value of 10 seconds correctly matches the legacy serverless_drain_grace_period default of 10,000 ms. This is a significant migration bug fix and deserves explicit mention in the PR description.

`actor_event_demuxer.rs` - clean and correct

The CancellationToken + DropGuard pattern is the right tool here. When the demuxer is dropped, all channel handlers exit cleanly without explicit abort calls.
Replacing channel.handle.abort() in the GC path with natural tx-side drop is safer: the task completes any in-flight dispatch_events call before exiting, rather than being torn down mid-signal. The removed TODO: Verify aborting is safe here comment confirms this was a known concern.
The tokio::select! in channel_handler correctly races channel closure against cancellation.

Minor concern: in shutdown(), channel handles previously removed by GC (their JoinHandle was dropped/detached) are not awaited. If a GC-expired task is mid-dispatch_events at shutdown time, the cancellation token helps it exit, but there is no back-pressure wait on those detached tasks. Pre-existing limitation rather than a regression.

Nit - comment grammar:

// This will drop the tx side of the channel and it's task will exit automatically.

it's should be its.

`lib.rs` - correctness fix, minor comment issue

The core change is correct: a ping failure means the actor-side WebSocket connection is broken, so in-flight KV requests via ws_to_tunnel must be terminated immediately. Sending the soft abort signal AND hard-aborting achieves this.

One thing worth confirming: does ws_to_tunnel_task hold any state requiring cleanup before abort (e.g. in-progress SQLite transactions, open file handles)? A hard abort via AbortHandle skips all cleanup code. If cleanup invariants exist, they should be caught at the task boundary or confirmed safe to skip.

The removal of the unused _tx2 clones is a clean simplification - those clones were only needed when the abort signals lived in separately spawned tasks without shared scope. Inline tokio::join! blocks share the outer scope directly.

Comment wording is unclear:

The comment reads: "This stops all in flight kv requests from being completed immediately." This phrasing reads as if it prevents them from completing quickly - the intent is the opposite (it immediately cancels them). The comment should be rewritten to clarify.

Summary

Area	Assessment
`drain_grace_period` fix	Critical bug fix: 10,000 s to 10 s is correct
`CancellationToken` approach	Clean and idiomatic
Hard abort on ping failure	Correct - confirm `ws_to_tunnel` has no cleanup invariants
Comment quality	Minor grammar and clarity nits
Test coverage	No tests added; ping-error path and demuxer cancellation would benefit from coverage

The PR description checklist is empty - worth filling in (noting this is a bug fix and which tests were run).

claude · 2026-04-25T01:19:21Z

Code Review

Overview

This PR has two distinct changes:

Main fix (lib.rs, actor_event_demuxer.rs): On ping task failure, hard-abort the ws_to_tunnel task to enforce the invariant that an actor's KV store is only accessed from one place at a time.
Bundled bug fix (namespace_runner_config.rs): Fix a unit confusion in the v4->v5 migration default for drain_grace_period.

`lib.rs` — Hard abort on ping failure

The core change is sound. When the ping task errors (connection timeout or dead actor), sending a soft abort via the watch channel is insufficient because ws_to_tunnel may be blocked mid-await inside a long-running KV operation. Without the hard abort, a new actor instance could start accessing the same KV prefix concurrently before the stale task exits. Using JoinHandle::abort_handle() is the correct tool here.

One minor comment issue:

"This stops all in flight kv requests from being completed immediately."

The phrasing is ambiguous — "immediately" reads as modifying "completed" rather than "stops". Also:

"an actor's KV is only being from one place at a time"

Missing "accessed". Suggested revision:

// If the ping task fails, hard-abort ws_to_tunnel immediately to cancel any
// in-flight KV requests. This enforces the invariant that an actor's KV is
// only accessed from one place at a time.

Cleanup win: Removing the three .clone() calls on the abort senders is valid — the tokio::join! async blocks capture by reference (not move), and watch::Sender::send takes &self, so shared borrows across the join arms work correctly.

`actor_event_demuxer.rs` — Graceful shutdown via CancellationToken

The design is correct and an improvement over the old handle.abort() approach:

Dropping the tx sender naturally drains in-flight channel work before the handler exits (recv_many returns 0 when all senders are gone), which is better than hard-aborting mid-dispatch.
The DropGuard approach ensures the cancellation token fires even if shutdown() is never called explicitly — a good defensive pattern.
In shutdown(), both paths (tx-dropped and cancellation) fire simultaneously, but that is harmless since both cause the handler to break its loop.

One note: the tokio::select! in channel_handler is unbiased by default, so if both the rx and the cancellation token are ready simultaneously, there is no preference for cancellation. For a shutdown path this is fine, but worth knowing.

Minor typo in comment: "it's task will exit automatically" should be "its task will exit automatically".

`namespace_runner_config.rs` — `drain_grace_period` fix

This is a significant correctness fix. The field is consumed as Duration::from_secs(drain_grace_period as u64) in pegboard-outbound/src/lib.rs:565, meaning the unit is seconds. The old migration default of 10_000 set a drain grace period of ~2.7 hours. The fix to 10 correctly matches the deprecated serverless_drain_grace_period config default (10,000 ms = 10 s).

This fix is important for any serverless runners provisioned via the v4->v5 migration path.

Summary

Area	Verdict
Hard abort on ping failure	Correct, necessary for KV safety
Demuxer cancellation token	Cleaner than abort(), shutdown logic is correct
`drain_grace_period` fix	Fixes a silent ~2.7-hour default regression
Comment quality	Two comments have typos/missing words
Test coverage	No new tests; a note on how the ping-error path was manually verified would help

MasterPtato requested a review from NathanFlurry April 25, 2026 01:08

MasterPtato mentioned this pull request Apr 25, 2026

fix(outbound): implement drain grace period #4775

Open

11 tasks

fix(pb-envoy): hard abort ws to tunnel on ping error

f537c84

MasterPtato force-pushed the 04-24-fix_pb-envoy_hard_abort_ws_to_tunnel_on_ping_error branch from 3aa9cc4 to f537c84 Compare April 25, 2026 01:14

MasterPtato force-pushed the 04-24-fix_outbound_implement_drain_grace_period branch from e50b3e3 to 5f322c0 Compare April 25, 2026 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pb-envoy): hard abort ws to tunnel on ping error#4776

fix(pb-envoy): hard abort ws to tunnel on ping error#4776
MasterPtato wants to merge 1 commit into04-24-fix_outbound_implement_drain_grace_periodfrom
04-24-fix_pb-envoy_hard_abort_ws_to_tunnel_on_ping_error

MasterPtato commented Apr 25, 2026

Uh oh!

MasterPtato commented Apr 25, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 25, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MasterPtato commented Apr 25, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

MasterPtato commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: fix(pb-envoy): hard abort ws to tunnel on ping error

Overview

Bug Fix: Critical issue in namespace_runner_config.rs

actor_event_demuxer.rs - clean and correct

lib.rs - correctness fix, minor comment issue

Summary

Uh oh!

claude Bot commented Apr 25, 2026

Code Review

Overview

lib.rs — Hard abort on ping failure

actor_event_demuxer.rs — Graceful shutdown via CancellationToken

namespace_runner_config.rs — drain_grace_period fix

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MasterPtato commented Apr 25, 2026 •

edited

Loading

claude Bot commented Apr 25, 2026 •

edited

Loading

Bug Fix: Critical issue in `namespace_runner_config.rs`

`actor_event_demuxer.rs` - clean and correct

`lib.rs` - correctness fix, minor comment issue

`lib.rs` — Hard abort on ping failure

`actor_event_demuxer.rs` — Graceful shutdown via CancellationToken

`namespace_runner_config.rs` — `drain_grace_period` fix