Skip to content

test(engine): add regression test for alarm-during-sleep-transition race#4755

Merged
NathanFlurry merged 1 commit intomainfrom
sleep-cleanup/10-alarm-overdue-during-sleep-test
Apr 27, 2026
Merged

test(engine): add regression test for alarm-during-sleep-transition race#4755
NathanFlurry merged 1 commit intomainfrom
sleep-cleanup/10-alarm-overdue-during-sleep-test

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 24, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

PR Review: Regression Test for Alarm-During-Sleep-Transition Race

Overview

Adds a single regression test alarm_overdue_during_sleep_transition_fires_via_reallocation in engine/packages/engine/tests/runner/actors_alarm.rs to guard against the alarm-during-sleep-transition race. The scenario: a near-overdue alarm is silently dropped when Decision::Sleep clears state.alarm_ts without first checking whether the alarm has already fired.

The test reuses existing infrastructure (AlarmAndSleepOnceActor, wait_for_actor_wake_from_alarm, setup_runner, create_actor) with a 100ms alarm offset to create the timing pressure needed to hit the race window.


Issues

1. #[ignore] renders this an inert regression guard

The test never runs in CI, so if the overdue-alarm reallocation path in Decision::Sleep is later reverted there is no automated catch. Other #[ignore]-annotated tests in the file (envoy_at_max_capacity) are ignored for infrastructure capacity reasons; this one is different — the fix is in production logic that should stay covered.

Options:

  • Remove #[ignore] and bump the alarm offset to ~500ms to give more margin. The test takes at most 10–15 s, which is reasonable for CI.
  • If the race window is too non-deterministic for reliable CI, at minimum add a .agent/todo entry so the gap is not invisible.

2. Lifecycle subscription set up after the race window opens

lifecycle_rx is created after ready_rx.await. With a 100ms alarm, the actor can complete the sleep transition, receive the overdue-alarm reallocation, restart at generation 1, and emit Started { generation: 1 } — all before subscribe_lifecycle_events() is called. In that case the test hangs until the 10 s timeout fires.

The existing alarm_fires_at_correct_time avoids this by subscribing after wait_for_actor_sleep so the only outstanding event is the wake. Here the subscription should be set up before ready_rx.await (or before create_actor) so no lifecycle events can be missed:

let lifecycle_rx = runner.subscribe_lifecycle_events();

let res = common::create_actor(...).await;
let actor_id = res.actor.actor_id.to_string();
ready_rx.await.expect("actor should send ready signal");

wait_for_actor_wake_from_alarm(lifecycle_rx, &actor_id, 1, 10)
    .await
    .expect("...");

3. Minor: inline comment duplication

The comment block just before wait_for_actor_wake_from_alarm restates what the .expect(...) message already explains. It can be removed to reduce noise.


Positives

  • The doc comment on the test function is thorough and correctly describes the before/after behaviour.
  • Reuse of AlarmAndSleepOnceActor (gen-0 sleeps, gen-1 stays awake) is the right fixture for this scenario.
  • The 10 s TestOpts timeout is appropriate for a 100ms alarm with scheduling slack.
  • tracing::info!, Arc<Mutex<Option<Sender>>>, and structured field usage all follow file conventions.

@NathanFlurry NathanFlurry force-pushed the sleep-cleanup/10-alarm-overdue-during-sleep-test branch from 50500b5 to d151463 Compare April 24, 2026 13:16
@NathanFlurry NathanFlurry force-pushed the sleep-cleanup/09-rust-sleep-tests branch from ed93ef7 to 8d93174 Compare April 24, 2026 13:16
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

Preview packages published to npm

Install with:

npm install rivetkit@pr-4755

All packages published as 0.0.0-pr.4755.64f52bc with tag pr-4755.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-64f52bc
docker pull rivetdev/engine:full-64f52bc
Individual packages
npm install rivetkit@pr-4755
npm install @rivetkit/react@pr-4755
npm install @rivetkit/rivetkit-napi@pr-4755
npm install @rivetkit/workflow-engine@pr-4755

Base automatically changed from sleep-cleanup/09-rust-sleep-tests to main April 27, 2026 07:13
@NathanFlurry NathanFlurry merged commit 5322de5 into main Apr 27, 2026
18 of 22 checks passed
@NathanFlurry NathanFlurry deleted the sleep-cleanup/10-alarm-overdue-during-sleep-test branch April 27, 2026 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant