Skip to content

test(engine): cover envoy Sleep crash policy wake and recovery paths#4816

Closed
NathanFlurry wants to merge 1 commit intoengine-stabilize/sqlite-storage-vbare-protocolfrom
engine-stabilize/envoy-sleep-crash-policy-tests
Closed

test(engine): cover envoy Sleep crash policy wake and recovery paths#4816
NathanFlurry wants to merge 1 commit intoengine-stabilize/sqlite-storage-vbare-protocolfrom
engine-stabilize/envoy-sleep-crash-policy-tests

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 27, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Code Review: PR #4816test(engine): cover envoy Sleep crash policy wake and recovery paths

Summary

This PR adds three new integration tests for the envoy actor sleep crash policy, plus a spawn_wake_ping helper:

  1. envoy_crash_policy_sleep_wakes_on_request — verifies that a sleeping actor (put to sleep after its first crash) is woken up and becomes connectable when an HTTP request arrives.
  2. envoy_crash_policy_sleep_recovers_after_wake_crash — verifies that an actor that crashes twice still eventually becomes connectable (the recovery path survives a second crash after wake).
  3. envoy_crash_policy_sleep_persists_on_creation — verifies that CrashPolicy::Sleep set at creation time is durably persisted on the actor record.

All code is in engine/packages/engine/tests/envoy/actors_lifecycle.rs. There is no production code change.


Code Quality

Overall: The new tests are well-structured and follow the existing file conventions (same common::run driver, same Arc<Mutex<usize>> pattern for crash counting, same poll-with-timeout loops, same tracing calls). The spawn_wake_ping helper is clearly motivated.


Issues Found

Minor: Unbounded inner poll loops produce unhelpful timeout messages

The "wait for crash-induced sleep" loop in both envoy_crash_policy_sleep_wakes_on_request and envoy_crash_policy_sleep_recovers_after_wake_crash has no timeout:

loop {
    let actor = common::try_get_actor(...)...;
    if actor.sleep_ts.is_some() && actor.connectable_ts.is_none() {
        break;
    }
    tokio::time::sleep(tokio::time::Duration::from_millis(50)).await;
}

If the actor never enters sleep (e.g., due to a regression), the test hangs until the outer wall-clock limit fires, producing a generic "test timed out" failure with no actionable message. Wrapping this inner loop with tokio::time::timeout(...) would give a clearer failure message. The existing envoy_crash_policy_sleep test has the same weakness, but it's worth flagging in new tests.

Nit: Redundant asserts after loop exit

In both envoy_crash_policy_sleep_wakes_on_request and envoy_crash_policy_sleep_recovers_after_wake_crash, the outer wait loop only breaks when actor.connectable_ts.is_some(), so the immediately following assert:

assert!(
    actor.connectable_ts.is_some(),
    "actor should be connectable after waking from sleep"
);

is always true at that point and can never fail. The meaningful check is the panic inside the timeout branch. These asserts can be removed to reduce noise (though harmless).

Nit: Em dash in comment violates CLAUDE.md style

In envoy_crash_policy_sleep_recovers_after_wake_crash:

// recovery path tolerates additional crashes after wake without
// abandoning the actor — eventual success is what the test asserts on,

Per CLAUDE.md: "Do not use em dashes (—). Use periods to separate sentences instead." This should be rewritten as two sentences.

Info: reqwest::Client::new() per-spawn in spawn_wake_ping

Creating a new reqwest::Client inside each spawned task skips connection pooling. For a single fire-and-forget test request this is fine and consistent with existing test code in this file, but worth noting.


Subtle Assumption in envoy_crash_policy_sleep_persists_on_creation

This test creates an actor and immediately calls assert_actor_exists (a single-shot read, no polling) without waiting for the actor to be scheduled. This assumes the actors_create endpoint returns only after the actor record is durably readable via the list API. This should hold in practice, but it's a subtle invariant worth documenting if the behavior ever changes.


Test Coverage Assessment

The three tests cover exactly the gap described in the PR title. The existing envoy_crash_policy_sleep only verified the actor enters sleep state after a crash; it did not verify wake-on-request, recovery after wake-crash, or persistence of the crash policy field. Coverage is appropriate.

The design choice to fire the wake request in the background via spawn_wake_ping (fire-and-forget) is correct: the test cares about the wake effect (actor becomes connectable), not whether the triggering request succeeds, since the actor may crash again before responding.


Summary

Severity Issue
Minor Two inner poll loops lack a timeout, producing unhelpful failure messages on regression.
Nit Redundant assert!(actor.connectable_ts.is_some()) after loops that already guarantee this.
Nit Em dash () in a comment block, violating CLAUDE.md style.
Info reqwest::Client::new() per-spawn is consistent with existing test code but not ideal.

None of the issues are blockers. This is a clean, well-motivated addition of missing test coverage. Recommend merging after addressing the em dash nit (the only CLAUDE.md violation) and optionally wrapping the unbounded inner loops with timeouts for better failure diagnostics.

@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-sleep-crash-policy-tests branch 3 times, most recently from df86b9b to b0976d0 Compare April 27, 2026 19:40
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/sqlite-storage-vbare-protocol branch 2 times, most recently from 3bb9717 to 9b501ed Compare April 27, 2026 20:07
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-sleep-crash-policy-tests branch from b0976d0 to 0e9363a Compare April 27, 2026 20:07
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/sqlite-storage-vbare-protocol branch from 9b501ed to 58ff56b Compare April 27, 2026 20:30
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-sleep-crash-policy-tests branch 2 times, most recently from 2861431 to ff021cb Compare April 27, 2026 20:48
@NathanFlurry NathanFlurry marked this pull request as ready for review April 27, 2026 20:49
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-sleep-crash-policy-tests branch from ff021cb to b9734d6 Compare April 27, 2026 21:29
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/sqlite-storage-vbare-protocol branch from 3d61f72 to 6d1d4e3 Compare April 27, 2026 21:29
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/sqlite-storage-vbare-protocol branch from 6d1d4e3 to ee84fb7 Compare April 27, 2026 21:38
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-sleep-crash-policy-tests branch from b9734d6 to 1109409 Compare April 27, 2026 21:38
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-sleep-crash-policy-tests branch from 1109409 to 48a6f5b Compare April 27, 2026 22:03
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/sqlite-storage-vbare-protocol branch from ee84fb7 to 0c85098 Compare April 27, 2026 22:03
@NathanFlurry
Copy link
Copy Markdown
Member Author

Landed in main via stack-merge fast-forward push. Commits are in main; closing to match.

@github-actions
Copy link
Copy Markdown
Contributor

Preview packages published to npm

Install with:

npm install rivetkit@pr-4816

All packages published as 0.0.0-pr.4816.86daff8 with tag pr-4816.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-86daff8
docker pull rivetdev/engine:full-86daff8
Individual packages
npm install rivetkit@pr-4816
npm install @rivetkit/react@pr-4816
npm install @rivetkit/rivetkit-napi@pr-4816
npm install @rivetkit/workflow-engine@pr-4816

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant