Skip to content

refactor(prover-client): deferred epoch finalization, checkpoint-ready promise, and reorg safety#22749

Open
PhilWindle wants to merge 2 commits intomerge-train/spartanfrom
phil/a-954-orchestrator-restructure-deferred-finalization-checkpoints
Open

refactor(prover-client): deferred epoch finalization, checkpoint-ready promise, and reorg safety#22749
PhilWindle wants to merge 2 commits intomerge-train/spartanfrom
phil/a-954-orchestrator-restructure-deferred-finalization-checkpoints

Conversation

@PhilWindle
Copy link
Copy Markdown
Collaborator

Summary

Restructures the orchestrator to support optimistic proving (A-954):

  • startNewEpoch() no longer takes totalNumCheckpoints or finalBlobBatchingChallenges — neither is known at epoch start
  • New finalizeEpochStructure(totalNumCheckpoints, finalBlobBatchingChallenges) method called after all checkpoints are processed
  • Checkpoints-ready promise resolves when all checkpoints complete block-level proving (reactive — re-evaluates on checkpoint removal)
  • Two-input gate: checkpoint roots only enqueue when both block merge proofs AND epoch structure are ready
  • removeLastCheckpoint() for L1 reorg safety (only valid before finalization)
  • No behavior change: proving still triggers at epoch end. This is infrastructure for optimistic proving.

Fixes A-954

Test plan

  • 36 new tests in orchestrator_deferred_finalization.test.ts covering:
    • Happy path (single and multi-checkpoint epochs)
    • Deferred finalization (incremental checkpoints, double-finalize error)
    • Checkpoints-ready promise (resolves when ready, doesn't resolve early, re-evaluates on removal)
    • Two-input gate (proofs-first-then-finalize and finalize-first-then-proofs orderings)
    • Reorg safety (remove last checkpoint, throws after finalize, sequential removes)
    • Edge cases (empty epoch, cancellation, starting new epoch while previous active)
  • All existing orchestrator tests updated and passing

@PhilWindle PhilWindle added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels Apr 23, 2026
@PhilWindle PhilWindle force-pushed the phil/a-954-orchestrator-restructure-deferred-finalization-checkpoints branch 9 times, most recently from 6919b6f to f560e42 Compare April 24, 2026 13:15
@PhilWindle PhilWindle marked this pull request as ready for review April 24, 2026 13:36
…y promise, and reorg safety (A-954)

Restructures the orchestrator to support optimistic proving:

- startNewEpoch() no longer takes totalNumCheckpoints or finalBlobBatchingChallenges
- New finalizeEpochStructure() method called after all checkpoints are processed
- Checkpoints-ready promise (callback-based) resolves when all checkpoints complete block-level proving
- Two-input gate: checkpoint roots only enqueue when both block merge proofs and epoch structure are ready
- removeLastCheckpoint() for L1 reorg safety — closes world state forks and cleans up chonk verifier cache
- Checkpoints must be registered sequentially (enforced); block processing remains parallel
- EpochProvingJob registers checkpoints in order, then processes blocks via asyncPool
- No behavior change: proving still triggers at epoch end

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@PhilWindle PhilWindle force-pushed the phil/a-954-orchestrator-restructure-deferred-finalization-checkpoints branch from f560e42 to 806825c Compare April 24, 2026 16:14
* Removes the last checkpoint from the epoch. Only valid before `finalizeEpochStructure` has been called.
* Returns the removed checkpoint, or undefined if there are no checkpoints.
*/
public removeLastCheckpoint(): CheckpointProvingState | undefined {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up PR this has changed to allow removal of a checkpoint at any index.

* includes the removed checkpoint
* - Block merge jobs may be enqueued wastefully but will hit the same safe gates
*/
public removeLastCheckpoint() {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later PR changes this to remove any checkpoint, not just last.

Comment on lines +258 to +261
// Wait for all checkpoints to complete block-level proving before finalizing the epoch structure.
// In the current non-optimistic flow this will resolve once all block merge proofs complete.
// In the future optimistic flow, this will be awaited alongside the epoch-end signal.
await this.prover.waitForAllCheckpointsReady();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this part. Why do we want to wait for checkpoints to finish proving here? What we want is to wait for all checkpoints in the epoch to have been mined, so we can finalize the epoch with all the blob data, right?

Comment on lines +122 to +125
// Transition to FULL if all checkpoints are added.
if (this.checkpoints.filter(c => !!c).length === totalNumCheckpoints) {
this.provingStateLifecycle = PROVING_STATE_LIFECYCLE.PROVING_STATE_FULL;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this function expected to be called? The jsdoc for the function states that "Finalizes the epoch structure after all checkpoints have been added", but this check hints that not all checkpoints may be added yet?

Comment on lines +399 to 402
const activeCheckpoints = this.checkpoints.filter(c => !!c) as CheckpointProvingState[];
for (let i = 0; i < activeCheckpoints.length; i++) {
const checkpoint = activeCheckpoints[i];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up this changes semantics. If there's a gap in this.checkpoints (maybe because one checkpoint finalized simulation earlier than a previous one?) then we may compute the outHashHint for a checkpoint using the outHash from a checkpoint that's not the immediately previous one in the epoch.

Comment on lines +73 to +74
/** All callbacks waiting for checkpoints to be block-level ready. */
private checkpointsReadyCallbacks: Array<{ resolve: () => void; reject: (reason: string) => void }> = [];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: while I like this pattern, we're not using it anywhere else for proving orchestration. Maybe we can drop it and use the same checks we do elsewhere, for consistency's sake? Only if it's not too much effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants