Skip to content

fix(rivetkit): refresh stale actor handles#4794

Closed
NathanFlurry wants to merge 1 commit into04-24-replay/rivetkit-napi-private-test-layoutfrom
04-24-replay/ts-client-stale-actor-retry
Closed

fix(rivetkit): refresh stale actor handles#4794
NathanFlurry wants to merge 1 commit into04-24-replay/rivetkit-napi-private-test-layoutfrom
04-24-replay/ts-client-stale-actor-retry

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 27, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

PR #4794 Review: fix(rivetkit): refresh stale actor handles

Files changed: 2 TypeScript files, 110 additions, 61 deletions.

Summary: Wraps #sendQueueMessage in an AsyncMutex to serialize concurrent queue sends; adds #shouldRetryQueueDispatchOverload to retry on actor/overloaded; adds not_configured and dropped_reply to the stale-actor error set in both actor-handle.ts and actor-query.ts.


Issues Found

1. Severe throughput regression: mutex serializes all concurrent queue sends

The entire #sendQueueMessage body, including the retry loop and each #waitForRetryWindow() call, runs inside #queueSendMutex.run(...). All concurrent send() calls on the same ActorHandleRaw instance are now fully serialized — happy-path sends (no retries needed) still queue up behind each other.

#waitForRetryWindow() sleeps 100 ms per attempt; with max attempt counts of 24 (getForKey) and 60 (getOrCreate), a single failing send can hold the mutex for up to 2.4-6 seconds, blocking all concurrent senders for that duration. ActorConnRaw.#sendQueueMessage in actor-conn.ts has no such mutex, creating a surprising asymmetry.

The mutex is only needed to protect shared mutable state #resolvedActorId / #resolvingActorId during retry paths. Consider a narrower fix: acquire the mutex only on the retry/re-resolve path (around #clearResolvedActorId + re-resolution), not the entire send loop.

2. #shouldRetryQueueDispatchOverload does not clear the cached actor ID

When this returns true, #clearResolvedActorId() is never called. If an actor is overloaded because it is shutting down, the client will keep hammering the same instance until maxAttempts is exhausted rather than re-resolving. Contrast with #shouldRetryDynamicLifecycleError, which always clears the cached ID and sets useQueryTarget = true. Since overload retries now fire before lifecycle retries, this can exhaust all attempts before re-resolution is attempted.

3. dropped_reply in stale-handle detection may cause unnecessary actor recreation

actor/dropped_reply fires when a Rust Reply channel is dropped before a response is sent — this can happen during normal in-progress shutdown. Adding it to the stale-error set causes #resolvedActorId to be cleared and retried with useQueryTarget = true. For getOrCreate, this could inadvertently create a second actor if the first is mid-destruction but not yet removed from the registry. Additionally, dispatch.rs server-side already suppresses dropped_reply from workflow dispatch results (Ok((false, None))), making client-side and server-side handling asymmetric. A comment explaining why dropped_reply should trigger handle refresh would help.

4. not_configured in the lifecycle retry set is overly broad

actor/not_configured is returned when a capability like SQLite or WebSocket is not wired up — a permanent configuration error, not a transient lifecycle state. Retrying it will consume all attempts (up to 60 x 100 ms = 6 seconds) before surfacing the error, misleading developers about what went wrong. All other codes in this list (starting, stopping, destroyed_*, not_found) are transient. Either exclude not_configured, or add a comment explaining why retrying is expected to eventually succeed.

5. #sendActionNow and #fetchWithResolvedActor still have the race condition

#sendQueueMessage is now serialized via mutex, but the action() and fetch() paths also mutate #resolvedActorId via #clearResolvedActorId() / #invalidateResolvedActorId() and still have the original race. If the motivation was protecting shared resolved-actor state, the mutex should be extended to cover those methods or the resolution should be addressed uniformly.

6. AsyncMutex imported from a database module

import { AsyncMutex } from "@/common/database/shared";

AsyncMutex is a general-purpose utility living alongside SQLite binding helpers. Importing it into a client-facing networking module creates unexpected coupling. Consider moving it to @/common/utils or re-exporting from a neutral utilities entry point.

7. No tests for new behavior

No new or updated tests cover:

  • Serialization behavior of the new #queueSendMutex (concurrent sends are ordered; second send sees updated actor ID from first retry).
  • #shouldRetryQueueDispatchOverload retry logic on actor/overloaded.
  • New stale-error codes not_configured and dropped_reply triggering handle refresh.

Minor / Non-blocking

  • Retry ordering shift is undocumented. shouldRetryQueueDispatchOverload now fires before shouldRetryDynamicLifecycleError. An inline comment explaining the priority order (overload retries then lifecycle retries then stale-ID invalidation) would make this easier to maintain.
  • isDynamicActorQuery guard in #shouldRetryQueueDispatchOverload is redundant for static queries (getForId returns maxAttempts = 1, making attempt >= maxAttempts - 1 always true). The guard is correct but a clarifying comment would help.
  • PR description is empty. Type-of-change, test plan, and checklist are all unfilled, making it hard to understand the intended behavior and verification approach.

@github-actions
Copy link
Copy Markdown
Contributor

Preview packages published to npm

Install with:

npm install rivetkit@pr-4794

All packages published as 0.0.0-pr.4794.efd07f2 with tag pr-4794.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-efd07f2
docker pull rivetdev/engine:full-efd07f2
Individual packages
npm install rivetkit@pr-4794
npm install @rivetkit/react@pr-4794
npm install @rivetkit/rivetkit-napi@pr-4794
npm install @rivetkit/workflow-engine@pr-4794

@NathanFlurry NathanFlurry force-pushed the 04-24-replay/ts-client-stale-actor-retry branch from f582eeb to 9634c9b Compare April 27, 2026 03:27
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/rivetkit-napi-private-test-layout branch 2 times, most recently from 7cd073f to 8691b85 Compare April 27, 2026 03:59
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/ts-client-stale-actor-retry branch from 9634c9b to 0ad9786 Compare April 27, 2026 03:59
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/rivetkit-napi-private-test-layout branch from 8691b85 to f2496c3 Compare April 27, 2026 04:35
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/ts-client-stale-actor-retry branch from 0ad9786 to 02bf749 Compare April 27, 2026 04:35
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/rivetkit-napi-private-test-layout branch from f2496c3 to bb40216 Compare April 27, 2026 05:37
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/ts-client-stale-actor-retry branch from 02bf749 to 8bd50ce Compare April 27, 2026 05:37
@NathanFlurry NathanFlurry marked this pull request as ready for review April 27, 2026 05:56
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/ts-client-stale-actor-retry branch from 8bd50ce to ccd2fc4 Compare April 27, 2026 07:30
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/rivetkit-napi-private-test-layout branch 2 times, most recently from 86aa51c to dcc5feb Compare April 27, 2026 07:57
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/ts-client-stale-actor-retry branch 2 times, most recently from 1b393b1 to 77a93a5 Compare April 27, 2026 08:31
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/ts-client-stale-actor-retry branch from 77a93a5 to d5cb423 Compare April 27, 2026 17:35
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/rivetkit-napi-private-test-layout branch from abf3c28 to ab43d8c Compare April 27, 2026 19:06
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/ts-client-stale-actor-retry branch 2 times, most recently from bbd3d54 to f84be28 Compare April 27, 2026 19:40
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/rivetkit-napi-private-test-layout branch from ab43d8c to 9ec450d Compare April 27, 2026 19:40
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/ts-client-stale-actor-retry branch from f84be28 to f589b15 Compare April 27, 2026 20:48
@NathanFlurry NathanFlurry force-pushed the 04-24-replay/ts-client-stale-actor-retry branch from f589b15 to dcb690f Compare April 27, 2026 21:38
@NathanFlurry
Copy link
Copy Markdown
Member Author

Landed in main via stack-merge fast-forward push. Commits are in main; closing to match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant