Skip to content

chore(pegboard): simplify v1 to v2 migration to not rely on sqlite#4797

Closed
NathanFlurry wants to merge 1 commit into04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflowfrom
04-26-chore_pegboard_simplify_v1_to_v2_migration_to_not_rely_on_sqlite
Closed

chore(pegboard): simplify v1 to v2 migration to not rely on sqlite#4797
NathanFlurry wants to merge 1 commit into04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflowfrom
04-26-chore_pegboard_simplify_v1_to_v2_migration_to_not_rely_on_sqlite

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 27, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

PR Review: chore(pegboard): simplify v1 to v2 migration to not rely on sqlite

This is a solid cleanup. The start_new_generation refactor eliminates real duplication, moving rusqlite/tempfile to dev-dependencies is a meaningful production binary improvement, and the new metric makes the rejection path observable. A few items worth addressing before merging:


Medium: Silent no-op on missing pending stage post-write (commit.rs)

After the UDB write for a chunk completes, the in-memory stage is updated with two separate get_async calls:

// post-write success path
if let Some(mut entry) = self.pending_stages.get_async(&stage_key).await {
    let stage = entry.get_mut();
    stage.next_chunk_idx += 1;
    stage.saw_last_chunk = request.is_last;
}
// post-write error path
if let Some(mut entry) = self.pending_stages.get_async(&stage_key).await {
    entry.get_mut().error_message = Some(err.to_string());
}

Previously, Arc::clone was done once before the UDB write and the mutex was held for the full operation — the entry could not disappear. Now there is a TOCTOU window: if a concurrent commit_finalize removes the entry between the pre-write check (line ~498) and the post-write update (line ~597), the next_chunk_idx/saw_last_chunk/error_message update is silently dropped. CLAUDE.md's "Fail-By-Default Runtime" constraint explicitly calls out silent no-ops for required runtime behavior.

Options: (a) return an error if the entry is missing post-write, (b) hold the scc entry across the UDB write via entry_async to prevent concurrent removal, or (c) add a comment documenting why the silent drop cannot happen in practice (e.g., Gasoline serializes per-actor so there can be no racing commit_finalize).


Low: Removed journal auto-recovery — no operator runbook

The old code recovered from a pending journal sidecar by opening the SQLite file via rusqlite (which rolls back the uncommitted transaction automatically). The new code rejects the actor and emits SQLITE_MIGRATION_REJECTED_JOURNAL_TOTAL. This is observable, but there is no documented manual recovery path for operators when this counter is nonzero. Consider adding a note in the code comment or a .agent/todo/ entry pointing to what an operator should do.


Low: Weaker integrity guarantee after removing rusqlite

recover_v1_snapshot previously ran PRAGMA quick_check(1) after opening via rusqlite. validate_v1_main replaces this with header-only byte validation (magic bytes, page size, alignment). A database with a valid header but corrupt page data will now pass validation and be imported into v2. The test rejects_v1_main_with_corrupt_magic_byte pins the new path correctly, but there is no test for the valid-header/corrupt-page scenario. This is a deliberate trade-off given the PR title — just worth acknowledging explicitly, either in a comment or a follow-up issue.


Low: Concurrent migration race without in-process lock

The removed SQLITE_MIGRATION_LOCKS per-actor mutex prevented two concurrent callers from entering maybe_migrate_v1_to_v2 for the same actor simultaneously. The distributed-level lease check (MigratingFromV1 + lease_expires_at) still guards correctness, and Gasoline serializes per-actor workflow steps, so the in-process lock was redundant in practice. A brief comment explaining why the in-process lock is no longer needed (Gasoline serialization) would help future readers who might otherwise think it was accidentally dropped.


Style: Pre-existing inline #[cfg(test)] in engine.rs

engine/packages/sqlite-storage/src/engine.rs has an inline #[cfg(test)] mod tests block. CLAUDE.md requires Rust tests to live under tests/. This pre-dates this PR and isn't a blocker, but worth a follow-up.


What looks good

  • start_new_generation extraction cleanly deduplicates three call sites with identical generation += 1 / migrate / SendOutbound sequences.
  • rusqlite + tempfile moved to [dev-dependencies] — production binary no longer links SQLite.
  • New SQLITE_MIGRATION_REJECTED_JOURNAL_TOTAL metric makes the failure mode observable.
  • validate_v1_main header checks are clear and the magic-byte test pins the validation path.
  • Structured tracing fields, complete-sentence comments, no _ => fall-throughs — all pass CLAUDE.md checks.
  • V1File.size_bytes removal replaced with bytes.is_empty() — correct and simpler.
Concern Severity
Silent no-op on missing pending stage post-write Medium — should fix or document
No operator runbook for REJECTED_JOURNAL_TOTAL > 0 Low
Weaker integrity check (header-only validation) Low
Concurrent migration race without in-process lock Low — needs comment
Inline #[cfg(test)] in engine.rs Pre-existing, not a blocker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant