Skip to content

fix(envoy-client): clear processed_command_idx after ack send#4812

Closed
NathanFlurry wants to merge 1 commit intoengine-stabilize/envoy-client-command-dedupfrom
04-27-fix_envoy-client_clear_processed_command_idx_after_ack_send
Closed

fix(envoy-client): clear processed_command_idx after ack send#4812
NathanFlurry wants to merge 1 commit intoengine-stabilize/envoy-client-command-dedupfrom
04-27-fix_envoy-client_clear_processed_command_idx_after_ack_send

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 27, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Code Review

Overview

This PR clears processed_command_idx entries after a command ack is sent to pegboard-envoy. Previously these entries were never removed, causing the dedup map to grow unboundedly across actor lifecycles. The fix is narrowly scoped and the race condition is documented honestly.


Correctness: Scope of the Clear is Right

The cleanup loop iterates last_command_checkpoints, which is built from ctx.actors — so only currently active actors get their entries cleared. Entries for already-stopped actors (intentionally kept across remove_actor) are not touched. This preserves the resurrection-prevention invariant described in the field’s doc comment.


Race Condition (Acknowledged TODO)

The TODO comment describes a real correctness gap:

  1. ws_send returns (bytes accepted by OS)
  2. processed_command_idx entries cleared (this PR)
  3. WS drops before pegboard-envoy commits the FDB clear_range
  4. Reconnect: pegboard-envoy replays the commands
  5. No dedup entry: CommandStartActor re-executes for an active actor, violating the "at most one actor instance per actor id" invariant

The window is narrow (bytes-in-flight vs. FDB commit gap), but it is real. The proper fix is an ack-of-ack from pegboard-envoy before clearing. Consider tracking this in .agent/todo/ so it does not get lost.


Minor: Unnecessary Clone

last_command_checkpoints.clone() clones the entire Vec (including each actor_id: String) just to iterate the keys after moving it into ws_send. The cleanup loop then clones each actor_id again for the HashMap lookup key. Collecting cleanup keys before the send avoids the full-Vec clone:

let cleanup_keys: Vec<(String, u32)> = last_command_checkpoints
    .iter()
    .map(|cp| (cp.actor_id.clone(), cp.generation))
    .collect();

ws_send(
    &ctx.shared,
    protocol::ToRivet::ToRivetAckCommands(protocol::ToRivetAckCommands {
        last_command_checkpoints,
    }),
)
.await;

for (actor_id, generation) in cleanup_keys {
    ctx.processed_command_idx.remove(&(actor_id, generation));
}

Missing Test Coverage

tests/command_dedup.rs has good dedup coverage but no test for send_command_ack. A test should verify:

  • After calling send_command_ack, active-actor entries are removed from processed_command_idx
  • Entries for actors already removed via remove_actor are not cleared (resurrection guard still holds)

Comment Style

The TODO block is seven lines long. Project conventions say one short line max. A one-liner like // TODO: clear only after ack-of-ack from pegboard-envoy to close replay race window inline is sufficient; the full analysis can go in .agent/todo/.


@NathanFlurry NathanFlurry force-pushed the 04-27-fix_envoy-client_clear_processed_command_idx_after_ack_send branch from f65d4b3 to 5c033a7 Compare April 27, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant