Skip to content

fix(envoy-client): dedupe replayed commands by index#4811

Closed
NathanFlurry wants to merge 2 commits intoengine-stabilize/envoy-client-lifecyclefrom
engine-stabilize/envoy-client-command-dedup
Closed

fix(envoy-client): dedupe replayed commands by index#4811
NathanFlurry wants to merge 2 commits intoengine-stabilize/envoy-client-lifecyclefrom
engine-stabilize/envoy-client-command-dedup

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Stack Context

Builds on #4801 (engine-stabilize/envoy-client-lifecycle), which already added
an immediate send_command_ack after every handle_commands batch. This PR
closes the residual race where a command arrives, is processed, but the ack
frame doesn't reach the engine before the WS drops — on reconnect,
pegboard-envoy re-streams the same command from UDB and the envoy processes
it twice.

Why?

EnvoyContext::handle_commands had no dedup on the receive path. Two concrete
bugs:

  1. Replayed CommandStartActor clobbered live actors. insert_actor
    unconditionally re-inserts into ctx.actors[actor_id][generation],
    dropping the existing handle, event_history, and active_http_request_count
    on the floor. The spawned actor task was orphaned.
  2. Replayed CommandStopActor re-fired ToActor::Stop. Idempotent in
    best case, double-shutdown signal in worst case.

ActorEntry::last_command_idx already existed but was only read by
send_command_ack — never checked against incoming command indices.

What?

  • EnvoyContext gains processed_command_idx: HashMap<(String, u32), i64>.
    Persists across remove_actor so a replayed start for an
    already-stopped actor cannot resurrect it.
  • handle_commands skips any command whose checkpoint.index <= last_processed
    for the (actor_id, generation) pair, then records the new index before
    dispatching.
  • New integration test under tests/command_dedup.rs covers same-index replay,
    stale lower-index replay, monotonic processing, and per-actor / per-generation
    isolation. Lives in tests/ rather than inline #[cfg(test)] mod tests
    because a pre-existing 7-vs-8 arg mismatch in actor.rs test code blocks the
    lib-test target on this branch — the integration test target compiles
    independently.

Test plan

  • cargo check -p rivet-envoy-client
  • cargo test -p rivet-envoy-client --test command_dedup (2 passed)

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 27, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

PR Review: fix(envoy-client): dedupe replayed commands by index

Overall: The fix is correct and well-targeted. The two concrete bugs (replayed CommandStartActor clobbering live actors, replayed CommandStopActor re-firing ToActor::Stop) were real. A context-level dedup map that survives remove_actor is the right design. A few issues are worth addressing before merging.


Issues

1. Stopped actors are never acked after removal

send_command_ack iterates only ctx.actors, so once an actor is removed via remove_actor, its final command index is never acked to the engine. With processed_command_idx intentionally surviving removal, the engine will keep replaying the stop command on every reconnect — the dedup silences the double-dispatch locally, but the engine's replay loop stays open indefinitely.

Concretely: a CommandStopActor at index 5 is processed, processed_command_idx[("a", 1)] = 5 is set, remove_actor cleans up ctx.actors["a"], but send_command_ack never emits a checkpoint for ("a", 1). On reconnect the engine re-sends index 5, the dedup drops it — but the cycle repeats forever.

Fix: After the ctx.actors loop in send_command_ack, also emit checkpoints for entries in processed_command_idx that aren't already covered by a live actor.

2. Two parallel tracking mechanisms can diverge

ActorEntry::last_command_idx and ctx.processed_command_idx are now both updated for commands but independently. last_command_idx is pruned when an actor is removed; processed_command_idx persists. The two mechanisms serve different roles (ack source vs. replay filter), but without a comment this distinction is invisible and will invite future drift. A brief comment clarifying which is the ack source and which is the replay filter would reduce confusion.

3. Missing test: CommandStartActor replay

The PR description identifies replayed CommandStartActor as the primary bug, but command_dedup.rs only adds tests for CommandStopActor. A test asserting that a replayed start command does not re-create the actor would directly cover bug #1 and prevent regression. Currently the critical path has no automated coverage.

4. Unbounded map growth

processed_command_idx is never pruned — one entry per (actor_id, generation) pair ever seen. In practice this is bounded by the total actors ever hosted per envoy process, but a comment noting this assumption (or a size cap for long-lived envoys cycling through many actors) would be prudent.


Minor

  • tracing::debug! uses actor_id = %checkpoint.actor_id (Display) but last_idx as a bare positional value. Per CLAUDE.md conventions, prefer named structured fields: last_idx = last_idx or ?last_idx.

  • std::collections::HashMap in EnvoyContext is correct here — EnvoyContext is owned exclusively by the single envoy_loop task, so the scc::HashMap rule (for shared concurrent maps) does not apply.

  • Tests correctly live under tests/ rather than inline #[cfg(test)], matching CLAUDE.md conventions.

  • The pre-existing 7-vs-8 arg mismatch in actor.rs test code noted in the PR description should be tracked as a follow-up to keep the test surface coherent after this lands.

@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-client-command-dedup branch 2 times, most recently from e4a53d3 to a6d7cb6 Compare April 27, 2026 19:40
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-client-lifecycle branch from 31dfe70 to 45f80a0 Compare April 27, 2026 19:40
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-client-command-dedup branch from a6d7cb6 to 93e48e9 Compare April 27, 2026 20:48
@NathanFlurry NathanFlurry marked this pull request as ready for review April 27, 2026 20:49
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-client-command-dedup branch from 93e48e9 to 4ef7f4e Compare April 27, 2026 21:29
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-client-lifecycle branch from 3dbc40f to 1cd5bb4 Compare April 27, 2026 21:29
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-client-lifecycle branch from 1cd5bb4 to d5c12d7 Compare April 27, 2026 21:38
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-client-command-dedup branch from 4ef7f4e to 71ea925 Compare April 27, 2026 21:38
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-client-lifecycle branch from d5c12d7 to 436dba7 Compare April 27, 2026 22:03
@NathanFlurry NathanFlurry force-pushed the engine-stabilize/envoy-client-command-dedup branch from 71ea925 to 058a579 Compare April 27, 2026 22:03
@NathanFlurry
Copy link
Copy Markdown
Member Author

Landed in main via stack-merge fast-forward push. Commits are in main; closing to match.

@github-actions
Copy link
Copy Markdown
Contributor

Preview packages published to npm

Install with:

npm install rivetkit@pr-4811

All packages published as 0.0.0-pr.4811.6b7e318 with tag pr-4811.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-6b7e318
docker pull rivetdev/engine:full-6b7e318
Individual packages
npm install rivetkit@pr-4811
npm install @rivetkit/react@pr-4811
npm install @rivetkit/rivetkit-napi@pr-4811
npm install @rivetkit/workflow-engine@pr-4811

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Preview packages published to npm

Install with:

npm install rivetkit@pr-4811

All packages published as 0.0.0-pr.4811.6b7e318 with tag pr-4811.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-6b7e318
docker pull rivetdev/engine:full-6b7e318
Individual packages
npm install rivetkit@pr-4811
npm install @rivetkit/react@pr-4811
npm install @rivetkit/rivetkit-napi@pr-4811
npm install @rivetkit/workflow-engine@pr-4811

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant