Skip to content

fix(pegboard): move sqlite v1 migration into actor workflow#4787

Merged
NathanFlurry merged 1 commit intomainfrom
04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow
Apr 27, 2026
Merged

fix(pegboard): move sqlite v1 migration into actor workflow#4787
NathanFlurry merged 1 commit intomainfrom
04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 26, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Code Review: fix(pegboard): move sqlite v1 migration into actor workflow

Note: This is a draft PR. Review reflects the state of the code as submitted.

Overview

This PR refactors the SQLite v1→v2 migration by:

  1. Extracting migration logic from pegboard-envoy/src/sqlite_runtime.rs into a new self-contained pegboard/src/actor_sqlite.rs module.
  2. Moving migration execution from eager dispatch-time into a dedicated workflow activity (MigrateSqliteV1ToV2) inside the actor2 runtime workflow.
  3. Changing SqliteEngine::db from Arc<universaldb::Database> to universaldb::Database (owned value).

The motivation is sound: running migration as a workflow activity gives it proper retry semantics, error surfacing, and isolation from the actor dispatch path.


Issues

Migration locks map grows unboundedly

// actor_sqlite.rs
static SQLITE_MIGRATION_LOCKS: OnceLock<HashMap<String, Arc<Mutex<()>>>> = OnceLock::new();

async fn actor_migration_lock(actor_id: &str) -> Arc<Mutex<()>> {
    match migration_locks().entry_async(actor_id).await {
        Entry::Occupied(entry) => Arc::clone(entry.get()),
        Entry::Vacant(entry) => Arc::clone(entry.insert_entry(Arc::new(Mutex::new(()))).get()),
    }
}

Entries are never evicted from SQLITE_MIGRATION_LOCKS. In a long-running pegboard process that handles many unique actors over its lifetime, this leaks one Arc<Mutex<()>> per actor ID. Since migration is a one-shot path, the lock is no longer needed after the migration completes. Consider removing the entry after the guard is dropped, or switch to a TTL-bounded moka::Cache (already in the workspace dependency list per CLAUDE.md conventions).

Duplicate constants between impl and test

SQLITE_V1_MAX_MIGRATION_BYTES and SQLITE_V1_MIGRATION_LEASE_MS are defined in both actor_sqlite.rs and tests/actor_sqlite_migration.rs. The test file uses them to construct boundary inputs (e.g. SQLITE_V1_MAX_MIGRATION_BYTES + 1). These should be pub(crate) constants exported from actor_sqlite and re-used in the test so a future value change cannot silently create a mismatch.

Three duplicate call sites for migration step

migrate_sqlite_v1_to_v2_step is called at three places in runtime.rs (after reschedule_actor, and in two branches of handle_stopped). If the allocation paths share a common post-allocation hook or helper, consolidating the calls there would reduce the chance of a future allocation path missing the migration step. If the branching is intrinsic to the workflow, a comment explaining why all three sites are needed would help reviewers.

60-second lease may be shorter than the migration itself

SQLITE_V1_MIGRATION_LEASE_MS = 60_000 is the threshold for declaring a prior migration attempt stale. For databases near the 128 MiB limit, the migration involves: reading all V1 chunks from KV, writing to a tempdir, running PRAGMA quick_check, encoding as LTX, and staging chunks back through SqliteEngine. On cold or overloaded storage backends this could exceed 60 seconds, causing a second concurrent actor allocation to restart the migration while the first is still finalizing. Consider increasing the lease (e.g. 5 minutes) or persisting a heartbeat during staging so the lease can be extended.

WAL/SHM rejection is terminal with no recovery path

ensure!(
    !v1_file_exists(db, recipient, FILE_TAG_WAL).await?,
    "unexpected sqlite v1 WAL sidecar present"
);

If a v1 actor snapshot has a WAL sidecar, migration hard-fails permanently — v2 data doesn't exist and migration will never succeed. At minimum, include the actor ID in the error and ensure it surfaces in alerting. Consider whether the WAL can be merged into the main file using rusqlite's PRAGMA wal_checkpoint(TRUNCATE) before the integrity check, since Connection::open in rollback-journal mode already handles journals.


Suggestions

Tempdir disk I/O for large databases. The recovery step writes the full v1 main and journal bytes to a tempdir before opening with rusqlite. For databases near 128 MiB this means ~256 MiB of disk writes. tempfile::tempdir() uses TMPDIR which may be disk-backed. Worth noting in a comment, especially since this runs synchronously in a workflow activity.

db.as_ref() to &db refactor is correct. No concerns here; removing the Arc wrapping reduces reference-counting overhead throughout sqlite-storage and the diff is consistent.

Metrics are well-instrumented. Per-phase failure labels on SQLITE_MIGRATION_FAILURES_TOTAL are exactly the right granularity for diagnosing production migration failures.

Double-checked locking pattern is correct. The sqlite_v1_data_exists check before acquiring the lock, followed by a re-check inside the lock, correctly handles the race where two concurrent allocations both see v1 data.

Test coverage is thorough. The new tests/actor_sqlite_migration.rs covers the happy path, empty databases, journal recovery, stale migration retry, native v2 precedence, and all major error conditions. One gap: a test for the concurrent-allocation case (two simultaneous migrate_v1_to_v2 calls for the same actor) would exercise the lock + double-check logic end-to-end.


Minor style notes (per CLAUDE.md)

  • migration_error returns the original anyhow::Error unchanged after incrementing the metric and logging. Naming it record_migration_error would make the side-effect-only nature more obvious to future readers.
  • Log fields like duration_ms = start.elapsed().as_millis() should use structured fields correctly: start.elapsed().as_millis() returns a u128, which will format fine, but it's worth confirming tracing handles it without a .to_string() call in the hot path.
  • The draft PR has several driver tests marked as not yet passing (actor-db, actor-db-raw, actor-db-init-order). Ensure these are resolved before merge since they are directly related to the SQLite path being modified.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 26, 2026

Preview packages published to npm

Install with:

npm install rivetkit@pr-4787

All packages published as 0.0.0-pr.4787.4a3bf94 with tag pr-4787.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-4a3bf94
docker pull rivetdev/engine:full-4a3bf94
Individual packages
npm install rivetkit@pr-4787
npm install @rivetkit/react@pr-4787
npm install @rivetkit/rivetkit-napi@pr-4787
npm install @rivetkit/workflow-engine@pr-4787

@NathanFlurry NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from 24d59db to 81b9731 Compare April 27, 2026 01:06
@NathanFlurry NathanFlurry force-pushed the cleanup/db-lifecycle-coverage branch from 7ecf3ef to 7b73290 Compare April 27, 2026 01:06
@NathanFlurry NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from 81b9731 to 6d8d78f Compare April 27, 2026 01:17
@NathanFlurry NathanFlurry force-pushed the cleanup/db-lifecycle-coverage branch from 7b73290 to 796c8b3 Compare April 27, 2026 01:17
@NathanFlurry NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from 6d8d78f to fffe86c Compare April 27, 2026 01:47
@NathanFlurry NathanFlurry force-pushed the cleanup/db-lifecycle-coverage branch from 796c8b3 to 63dbe6f Compare April 27, 2026 01:47
@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

PR Review: fix(pegboard): move sqlite v1 migration into actor workflow

This PR relocates the SQLite v1-to-v2 migration from pegboard-envoy::sqlite_runtime into the pegboard_actor2 workflow as an idempotent activity, and cleans up the SqliteEngine API (de-Arc-ing the db field, removing Arc<Mutex<PendingStage>>). The motivation is correct: migration logic belongs in durable workflow history rather than ephemeral per-connection startup code.


Critical

v2.bare schema is modified without a version bump

engine/sdks/schemas/envoy-protocol/v2.bare removes sqliteSchemaVersion: u32 from CommandStartActor, and the matching TypeScript codec is updated in lockstep, but PROTOCOL_VERSION stays at 2. The CLAUDE.md rule is explicit: "Never modify an existing published *.bare runner protocol version." The correct approach is to add v3.bare, introduce a v3::CommandStartActor without the field, update versioned.rs to convert v2 to v3 on decode, and bump PROTOCOL_VERSION. Any deployed pegboard-envoy speaking v2 wire format will now emit a message missing a field that existing envoy clients at an older binary expect, creating a deserialization skew window during rolling deploys.


Significant

Migration lock removed without concurrency analysis

The old code held a per-actor Arc<Mutex<()>> lock across maybe_migrate_v1_to_v2. The new code relies on workflow activity semantics. The lease check (ensure!(!stage_in_progress || lease_expires_at <= timestamp::now())) provides some defense but does not serialize it only errors when a live stage is detected. If the workflow activity model guarantees at-most-one-execution per activity instance, this assumption should be documented in a comment.

Journal sidecar behavior is a silent regression

The old recover_v1_snapshot opened the journaled database via rusqlite, which applies the journal automatically and recovers committed rows from a crash. The new read_v1_main increments SQLITE_MIGRATION_REJECTED_JOURNAL_TOTAL and bails entirely, leaving actors permanently stuck on v1. This is a functional regression for any actor that was mid-write when it last ran on v1. The comment says "manual triage required" but there is no tooling or documentation for what that looks like. This behavior change deserves a prominent note in the PR description.

populate_start_command no longer checks sqlite_v1_data_exists

The new code calls maybe_load_sqlite_startup_data unconditionally, which calls sqlite_engine.try_load_meta. For from_v1 = true actors where MigrateSqliteV1ToV2 has not yet completed, the ensure!(!matches!(meta.origin, SqliteOrigin::MigratingFromV1)) guard will hard-fail the start command. Confirm this race cannot happen given the workflow ordering, and if it cannot, add an explanatory comment.


Moderate

Duplicate constant definitions in test file

engine/packages/pegboard/tests/actor_sqlite_migration.rs re-declares all the SQLITE_V1_* and FILE_TAG_* constants already defined in actor_sqlite.rs. If either set drifts, tests pass against stale values. Consider making these pub(crate) and importing them in the test file.

SQLITE_MIGRATION_PAGES histogram uses wrong bucket set

MICRO_BUCKETS is designed for sub-millisecond latencies. Page counts are integers potentially in the thousands; most observations will land in the overflow bucket. Use a page-count-appropriate set (e.g., [0, 1, 4, 16, 64, 256, 1024, 4096, 16384]).

ctx.v(2) version gate on the migration activity

Confirm that existing pegboard_actor2 workflows already running with from_v1 = true replay consistently after the binary upgrade. Document what workflow version 1 behavior means for from_v1 = true actors so it is clear ctx.v(2) correctly skips migration on old replays.

Per-call SqliteEngine drops _compaction_rx immediately

The CompactionCoordinator is never spawned for the ephemeral migration engine. If commit_finalize produces a compaction notification it is silently discarded. This is safe (compaction will happen via the live engine), but worth a comment.


Minor

CLAUDE.md section rename drops constraints

The rename from "Important Domain Information" to "Terminology" removes the "Never claim zero cold start" bullet and the rivet.gg deprecation line. Confirm the removal is intentional.

actor_import_export_e2e.rs may reference a removed module path

engine/packages/engine/tests/actor_import_export_e2e.rs calls pegboard::actor_sqlite_v2::import_actor and export_actor. If actor_sqlite_v2 was renamed or removed by this PR, this will be a compile error. Confirm the test still compiles.


Summary

The structural decision to move migration into the workflow is sound and the test coverage in actor_sqlite_migration.rs is thorough. The blocking issue is the v2.bare protocol schema modification without a version bump which violates the repo explicit invariant and risks deserialization skew during rolling deploys. The journal sidecar behavior change is also a functional regression that should be documented. Recommend addressing the protocol version concern before merging.

Base automatically changed from cleanup/db-lifecycle-coverage to main April 27, 2026 07:13
@NathanFlurry NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from 3e13940 to f6d9a16 Compare April 27, 2026 07:51
@NathanFlurry NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch 2 times, most recently from 01215ac to 7d7aa2d Compare April 27, 2026 17:39
@NathanFlurry NathanFlurry marked this pull request as ready for review April 27, 2026 19:05
@NathanFlurry NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from 7d7aa2d to a9093f9 Compare April 27, 2026 19:06
@NathanFlurry NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from a9093f9 to c1976e3 Compare April 27, 2026 20:48
@NathanFlurry NathanFlurry merged commit c1976e3 into main Apr 27, 2026
7 of 27 checks passed
@NathanFlurry NathanFlurry deleted the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch April 27, 2026 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant