Skip to content

fix(rivetkit): share lifecycle shutdown grace#4810

Closed
NathanFlurry wants to merge 1 commit intomainfrom
codex/lifecycle-grace-fix
Closed

fix(rivetkit): share lifecycle shutdown grace#4810
NathanFlurry wants to merge 1 commit intomainfrom
codex/lifecycle-grace-fix

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

This was referenced Apr 27, 2026
Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 27, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

PR Review: fix(rivetkit): share lifecycle shutdown grace

Summary

The core fix is correct and well-motivated: collapsing the per-phase shutdown budgets (run handler, onSleep/onDestroy, shutdown tasks) into a single shared wall-clock grace period prevents the total actor teardown time from exceeding the engine's hard-kill threshold. The Rust task_lifecycle.rs test uses start_paused = true + tokio::time::advance for deterministic behavior — a good pattern.


Issues to Address Before Merge

1. _ => {} on a closed enum in task_lifecycle.rs (~line 88) — CLAUDE.md violation

match event {
    ActorEvent::SerializeState { reply, .. } => { ... }
    ActorEvent::RunGracefulCleanup { reply, .. } => { ... }
    _ => {}  // silently drops all other variants
}

CLAUDE.md explicitly prohibits _ => fall-through on Rust enums. ActorEvent is a closed internal enum with many variants. Silent drops could mask future bugs. Use _ => unreachable!() or enumerate all expected silent variants explicitly.

2. #waitForRunHandler receives a pre-computed duration instead of an absolute deadline (mod.ts ~line 980)

const shutdownTaskDeadlineTs = Date.now() + this.#getEffectiveSleepGracePeriod();
await this.#waitForRunHandler(shutdownTaskDeadlineTs - Date.now()); // fragile

Every other shutdown helper (#callOnSleep, #callOnDestroy, #waitShutdownTasks, #waitForIdleSleepWindow) receives the absolute deadlineTs and computes remaining budget internally. #waitForRunHandler computing remaining at the call site means any code inserted between the two lines silently eats into the budget. Change the signature to accept deadlineTs: number for consistency and robustness.

3. runStopTimeout override is still set in actor-driver.ts (~line 1066) but no longer consumed

Before this PR, runStopTimeout gave the run handler its own bounded budget. Now the run handler uses the shared grace period, but staticActor.overrides.runStopTimeout = drainMax is still being set. Either remove the dead assignment or document why it is retained.

4. Rust destroy-grace path still uses effective_on_destroy_timeout() while TS uses sleepGracePeriod for both modes

In task.rs start_grace:

ShutdownKind::Destroy => self.factory.config().effective_on_destroy_timeout(),

The TS side now uses sleepGracePeriod as the shared budget for both sleep and destroy. The Rust destroy path uses a separate timeout. The new test only covers ShutdownKind::Sleep. If this asymmetry is intentional, add a comment. If unintentional, add a ShutdownKind::Destroy symmetric test.

5. Test coverage regression from removing lifecycle.rs

The deleted lifecycle.rs contained 13 tests across startup, sleep-shutdown, and destroy-shutdown paths. The new task_lifecycle.rs provides 2 tests. Scenarios not visibly covered elsewhere include:

  • startup_catches_run_handler_panics
  • startup_restores_connections_and_processes_overdue_events
  • startup_runs_on_migrate_before_on_wake_for_new_and_restored_actors
  • sleep_shutdown_reports_error_when_on_sleep_fails
  • destroy_shutdown_skips_idle_wait_and_disconnects_all_connections
  • destroy_shutdown_reports_error_when_on_destroy_fails

Please confirm these are covered elsewhere, or add coverage for the destroy-mode equivalents at minimum.


Minor

  • std::sync::Mutex in task_lifecycle.rs: NotifyOnDrop uses Drop (a forced-sync context), so parking_lot::Mutex is correct per CLAUDE.md rather than std::sync::Mutex.
  • The fixture rename to use sleepGracePeriod: 100 clearly reflects the new behavior.
  • The actor-driver.ts wiring of sleepGracePeriod as an override is the right approach.

Reviewed using the repository's CLAUDE.md conventions.

@NathanFlurry NathanFlurry force-pushed the codex/lifecycle-grace-fix branch from 32b1dc2 to 228f064 Compare April 27, 2026 19:06
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4810 to 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow April 27, 2026 19:06
@NathanFlurry NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from a9093f9 to c1976e3 Compare April 27, 2026 20:48
@NathanFlurry NathanFlurry force-pushed the codex/lifecycle-grace-fix branch from 228f064 to ad4ef06 Compare April 27, 2026 20:48
@NathanFlurry NathanFlurry force-pushed the codex/lifecycle-grace-fix branch from ad4ef06 to 57d5a84 Compare April 27, 2026 22:10
Base automatically changed from 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow to main April 27, 2026 22:10
@NathanFlurry
Copy link
Copy Markdown
Member Author

Landed in main as ad4ef06 (identical patch, different parent SHA due to rebase). Closing.

@NathanFlurry NathanFlurry deleted the codex/lifecycle-grace-fix branch April 27, 2026 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant