fix(pegboard): move sqlite v1 migration into actor workflow by NathanFlurry · Pull Request #4787 · rivet-dev/rivet

NathanFlurry · 2026-04-26T22:42:08Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

NathanFlurry · 2026-04-26T22:42:49Z

fix(pegboard): move sqlite v1 migration into actor workflow #4787 : 2 dependent PRs (#4797 , #4810 ) 👈 (View in Graphite)
test(rivetkit): cover c.db access from onWake/onSleep/onDestroy #4786
fix(rivetkit-napi): run onMigrate before createState/onCreate/createVars #4785
refactor(rivetkit-core): merge ready+started into lifecycle_started, drop redundant napi calls #4784
fix(rivetkit): drain native sleep side tasks reliably #4782 : 1 other dependent PR (#4783 )
fix(rivetkit): route raw request fetches to actors #4781
fix(pegboard): refresh runner config after envoy connect #4778 : 2 other dependent PRs (#4779 , #4798 )
chore(kitchen-sink): configurable endpoint #4766 : 2 other dependent PRs (#4754 , #4777 )
refactor(rivetkit-core): rename StopReason to ShutdownKind and kill enum fallthroughs #4765
chore(envoy-client): log command and event payloads #4751
fix(rivetkit): minor sleep-cleanup follow-ups #4761
docs(rivetkit): clean up stale setPreventSleep references in docs and JSDoc #4760
docs(rivetkit): document keepAwake/waitUntil counter-arm race and surface bridge errors #4759
test(engine): force overdue branch in alarm-during-sleep test with negative offset #4758
fix(rivetkit-core): return stopping not starting when sleep/destroy called mid-shutdown #4757
test(rivetkit): remove sleepWithPreventSleep fixture and tests #4756
test(engine): add regression test for alarm-during-sleep-transition race #4755
test(rivetkit-core): cover sleep/destroy guards and prevent_sleep no-op #4750
docs: add keepAwake vs waitUntil table and internal sleep-sequence invariants #4749
feat(rivetkit): expose c.keepAwake(promise) through native path #4748
refactor(rivetkit): deprecate setPreventSleep to no-op in TypeScript #4747
fix(rivetkit-core): enumerate non-drained counters in grace deadline warn log #4746
fix(rivetkit-core): abort graceful cleanup task when grace deadline elapses #4745
fix(rivetkit-core): error on sleep/destroy before startup or already-requested #4743
fix(rivetkit-core): cap SerializeState shutdown timeout at 15s #4742
refactor(rivetkit-core): remove prevent_sleep field and deprecate set_prevent_sleep to no-op #4741
fix(api-public): remove orphan import_export actors module #4764
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-04-26T22:46:10Z

Code Review: fix(pegboard): move sqlite v1 migration into actor workflow

Note: This is a draft PR. Review reflects the state of the code as submitted.

Overview

This PR refactors the SQLite v1→v2 migration by:

Extracting migration logic from pegboard-envoy/src/sqlite_runtime.rs into a new self-contained pegboard/src/actor_sqlite.rs module.
Moving migration execution from eager dispatch-time into a dedicated workflow activity (MigrateSqliteV1ToV2) inside the actor2 runtime workflow.
Changing SqliteEngine::db from Arc<universaldb::Database> to universaldb::Database (owned value).

The motivation is sound: running migration as a workflow activity gives it proper retry semantics, error surfacing, and isolation from the actor dispatch path.

Issues

Migration locks map grows unboundedly

// actor_sqlite.rs
static SQLITE_MIGRATION_LOCKS: OnceLock<HashMap<String, Arc<Mutex<()>>>> = OnceLock::new();

async fn actor_migration_lock(actor_id: &str) -> Arc<Mutex<()>> {
    match migration_locks().entry_async(actor_id).await {
        Entry::Occupied(entry) => Arc::clone(entry.get()),
        Entry::Vacant(entry) => Arc::clone(entry.insert_entry(Arc::new(Mutex::new(()))).get()),
    }
}

Entries are never evicted from SQLITE_MIGRATION_LOCKS. In a long-running pegboard process that handles many unique actors over its lifetime, this leaks one Arc<Mutex<()>> per actor ID. Since migration is a one-shot path, the lock is no longer needed after the migration completes. Consider removing the entry after the guard is dropped, or switch to a TTL-bounded moka::Cache (already in the workspace dependency list per CLAUDE.md conventions).

Duplicate constants between impl and test

SQLITE_V1_MAX_MIGRATION_BYTES and SQLITE_V1_MIGRATION_LEASE_MS are defined in both actor_sqlite.rs and tests/actor_sqlite_migration.rs. The test file uses them to construct boundary inputs (e.g. SQLITE_V1_MAX_MIGRATION_BYTES + 1). These should be pub(crate) constants exported from actor_sqlite and re-used in the test so a future value change cannot silently create a mismatch.

Three duplicate call sites for migration step

migrate_sqlite_v1_to_v2_step is called at three places in runtime.rs (after reschedule_actor, and in two branches of handle_stopped). If the allocation paths share a common post-allocation hook or helper, consolidating the calls there would reduce the chance of a future allocation path missing the migration step. If the branching is intrinsic to the workflow, a comment explaining why all three sites are needed would help reviewers.

60-second lease may be shorter than the migration itself

SQLITE_V1_MIGRATION_LEASE_MS = 60_000 is the threshold for declaring a prior migration attempt stale. For databases near the 128 MiB limit, the migration involves: reading all V1 chunks from KV, writing to a tempdir, running PRAGMA quick_check, encoding as LTX, and staging chunks back through SqliteEngine. On cold or overloaded storage backends this could exceed 60 seconds, causing a second concurrent actor allocation to restart the migration while the first is still finalizing. Consider increasing the lease (e.g. 5 minutes) or persisting a heartbeat during staging so the lease can be extended.

WAL/SHM rejection is terminal with no recovery path

ensure!(
    !v1_file_exists(db, recipient, FILE_TAG_WAL).await?,
    "unexpected sqlite v1 WAL sidecar present"
);

If a v1 actor snapshot has a WAL sidecar, migration hard-fails permanently — v2 data doesn't exist and migration will never succeed. At minimum, include the actor ID in the error and ensure it surfaces in alerting. Consider whether the WAL can be merged into the main file using rusqlite's PRAGMA wal_checkpoint(TRUNCATE) before the integrity check, since Connection::open in rollback-journal mode already handles journals.

Suggestions

Tempdir disk I/O for large databases. The recovery step writes the full v1 main and journal bytes to a tempdir before opening with rusqlite. For databases near 128 MiB this means ~256 MiB of disk writes. tempfile::tempdir() uses TMPDIR which may be disk-backed. Worth noting in a comment, especially since this runs synchronously in a workflow activity.

db.as_ref() to &db refactor is correct. No concerns here; removing the Arc wrapping reduces reference-counting overhead throughout sqlite-storage and the diff is consistent.

Metrics are well-instrumented. Per-phase failure labels on SQLITE_MIGRATION_FAILURES_TOTAL are exactly the right granularity for diagnosing production migration failures.

Double-checked locking pattern is correct. The sqlite_v1_data_exists check before acquiring the lock, followed by a re-check inside the lock, correctly handles the race where two concurrent allocations both see v1 data.

Test coverage is thorough. The new tests/actor_sqlite_migration.rs covers the happy path, empty databases, journal recovery, stale migration retry, native v2 precedence, and all major error conditions. One gap: a test for the concurrent-allocation case (two simultaneous migrate_v1_to_v2 calls for the same actor) would exercise the lock + double-check logic end-to-end.

Minor style notes (per CLAUDE.md)

migration_error returns the original anyhow::Error unchanged after incrementing the metric and logging. Naming it record_migration_error would make the side-effect-only nature more obvious to future readers.
Log fields like duration_ms = start.elapsed().as_millis() should use structured fields correctly: start.elapsed().as_millis() returns a u128, which will format fine, but it's worth confirming tracing handles it without a .to_string() call in the hot path.
The draft PR has several driver tests marked as not yet passing (actor-db, actor-db-raw, actor-db-init-order). Ensure these are resolved before merge since they are directly related to the SQLite path being modified.

github-actions · 2026-04-26T22:55:54Z

Preview packages published to npm

Install with:

npm install rivetkit@pr-4787

All packages published as 0.0.0-pr.4787.4a3bf94 with tag pr-4787.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-4a3bf94
docker pull rivetdev/engine:full-4a3bf94

Individual packages

npm install rivetkit@pr-4787
npm install @rivetkit/react@pr-4787
npm install @rivetkit/rivetkit-napi@pr-4787
npm install @rivetkit/workflow-engine@pr-4787

claude · 2026-04-27T01:53:25Z

PR Review: fix(pegboard): move sqlite v1 migration into actor workflow

This PR relocates the SQLite v1-to-v2 migration from pegboard-envoy::sqlite_runtime into the pegboard_actor2 workflow as an idempotent activity, and cleans up the SqliteEngine API (de-Arc-ing the db field, removing Arc<Mutex<PendingStage>>). The motivation is correct: migration logic belongs in durable workflow history rather than ephemeral per-connection startup code.

Critical

v2.bare schema is modified without a version bump

engine/sdks/schemas/envoy-protocol/v2.bare removes sqliteSchemaVersion: u32 from CommandStartActor, and the matching TypeScript codec is updated in lockstep, but PROTOCOL_VERSION stays at 2. The CLAUDE.md rule is explicit: "Never modify an existing published *.bare runner protocol version." The correct approach is to add v3.bare, introduce a v3::CommandStartActor without the field, update versioned.rs to convert v2 to v3 on decode, and bump PROTOCOL_VERSION. Any deployed pegboard-envoy speaking v2 wire format will now emit a message missing a field that existing envoy clients at an older binary expect, creating a deserialization skew window during rolling deploys.

Significant

Migration lock removed without concurrency analysis

The old code held a per-actor Arc<Mutex<()>> lock across maybe_migrate_v1_to_v2. The new code relies on workflow activity semantics. The lease check (ensure!(!stage_in_progress || lease_expires_at <= timestamp::now())) provides some defense but does not serialize it only errors when a live stage is detected. If the workflow activity model guarantees at-most-one-execution per activity instance, this assumption should be documented in a comment.

Journal sidecar behavior is a silent regression

The old recover_v1_snapshot opened the journaled database via rusqlite, which applies the journal automatically and recovers committed rows from a crash. The new read_v1_main increments SQLITE_MIGRATION_REJECTED_JOURNAL_TOTAL and bails entirely, leaving actors permanently stuck on v1. This is a functional regression for any actor that was mid-write when it last ran on v1. The comment says "manual triage required" but there is no tooling or documentation for what that looks like. This behavior change deserves a prominent note in the PR description.

populate_start_command no longer checks sqlite_v1_data_exists

The new code calls maybe_load_sqlite_startup_data unconditionally, which calls sqlite_engine.try_load_meta. For from_v1 = true actors where MigrateSqliteV1ToV2 has not yet completed, the ensure!(!matches!(meta.origin, SqliteOrigin::MigratingFromV1)) guard will hard-fail the start command. Confirm this race cannot happen given the workflow ordering, and if it cannot, add an explanatory comment.

Moderate

Duplicate constant definitions in test file

engine/packages/pegboard/tests/actor_sqlite_migration.rs re-declares all the SQLITE_V1_* and FILE_TAG_* constants already defined in actor_sqlite.rs. If either set drifts, tests pass against stale values. Consider making these pub(crate) and importing them in the test file.

SQLITE_MIGRATION_PAGES histogram uses wrong bucket set

MICRO_BUCKETS is designed for sub-millisecond latencies. Page counts are integers potentially in the thousands; most observations will land in the overflow bucket. Use a page-count-appropriate set (e.g., [0, 1, 4, 16, 64, 256, 1024, 4096, 16384]).

ctx.v(2) version gate on the migration activity

Confirm that existing pegboard_actor2 workflows already running with from_v1 = true replay consistently after the binary upgrade. Document what workflow version 1 behavior means for from_v1 = true actors so it is clear ctx.v(2) correctly skips migration on old replays.

Per-call SqliteEngine drops _compaction_rx immediately

The CompactionCoordinator is never spawned for the ephemeral migration engine. If commit_finalize produces a compaction notification it is silently discarded. This is safe (compaction will happen via the live engine), but worth a comment.

Minor

CLAUDE.md section rename drops constraints

The rename from "Important Domain Information" to "Terminology" removes the "Never claim zero cold start" bullet and the rivet.gg deprecation line. Confirm the removal is intentional.

actor_import_export_e2e.rs may reference a removed module path

engine/packages/engine/tests/actor_import_export_e2e.rs calls pegboard::actor_sqlite_v2::import_actor and export_actor. If actor_sqlite_v2 was renamed or removed by this PR, this will be a compile error. Confirm the test still compiles.

Summary

The structural decision to move migration into the workflow is sound and the test coverage in actor_sqlite_migration.rs is thorough. The blocking issue is the v2.bare protocol schema modification without a version bump which violates the repo explicit invariant and risks deserialization skew during rolling deploys. The journal sidecar behavior change is also a functional regression that should be documented. Recommend addressing the protocol version concern before merging.

NathanFlurry mentioned this pull request Apr 26, 2026

test(rivetkit): cover c.db access from onWake/onSleep/onDestroy #4786

Merged

11 tasks

NathanFlurry mentioned this pull request Apr 27, 2026

chore(sqlite): add open close lifecycle for envoy actors #4788

Closed

11 tasks

NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from 24d59db to 81b9731 Compare April 27, 2026 01:06

NathanFlurry force-pushed the cleanup/db-lifecycle-coverage branch from 7ecf3ef to 7b73290 Compare April 27, 2026 01:06

NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from 81b9731 to 6d8d78f Compare April 27, 2026 01:17

NathanFlurry force-pushed the cleanup/db-lifecycle-coverage branch from 7b73290 to 796c8b3 Compare April 27, 2026 01:17

NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from 6d8d78f to fffe86c Compare April 27, 2026 01:47

NathanFlurry force-pushed the cleanup/db-lifecycle-coverage branch from 796c8b3 to 63dbe6f Compare April 27, 2026 01:47

NathanFlurry force-pushed the cleanup/db-lifecycle-coverage branch from 63dbe6f to fc55a97 Compare April 27, 2026 03:24

NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from fffe86c to 1b1fb61 Compare April 27, 2026 03:24

This was referenced Apr 27, 2026

chore(pegboard): simplify v1 to v2 migration to not rely on sqlite #4797

Closed

fix(rivetkit-core): remove legacy timeouts #4799

Closed

test(rivetkit): re-enable gateway URL and direct-registry coverage #4659

Closed

Base automatically changed from cleanup/db-lifecycle-coverage to main April 27, 2026 07:13

NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from 3e13940 to f6d9a16 Compare April 27, 2026 07:51

NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch 2 times, most recently from 01215ac to 7d7aa2d Compare April 27, 2026 17:39

NathanFlurry marked this pull request as ready for review April 27, 2026 19:05

NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from 7d7aa2d to a9093f9 Compare April 27, 2026 19:06

fix(pegboard): move sqlite v1 migration into actor workflow

c1976e3

NathanFlurry force-pushed the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch from a9093f9 to c1976e3 Compare April 27, 2026 20:48

NathanFlurry mentioned this pull request Apr 27, 2026

ci(publish): skip lfs for preview checkouts #4819

Closed

11 tasks

NathanFlurry merged commit c1976e3 into main Apr 27, 2026
7 of 27 checks passed

NathanFlurry deleted the 04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow branch April 27, 2026 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pegboard): move sqlite v1 migration into actor workflow#4787

fix(pegboard): move sqlite v1 migration into actor workflow#4787
NathanFlurry merged 1 commit intomainfrom
04-26-fix_pegboard_move_sqlite_v1_migration_into_actor_workflow

NathanFlurry commented Apr 26, 2026

Uh oh!

NathanFlurry commented Apr 26, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented Apr 26, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

NathanFlurry commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: fix(pegboard): move sqlite v1 migration into actor workflow

Overview

Issues

Migration locks map grows unboundedly

Duplicate constants between impl and test

Three duplicate call sites for migration step

60-second lease may be shorter than the migration itself

WAL/SHM rejection is terminal with no recovery path

Suggestions

Minor style notes (per CLAUDE.md)

Uh oh!

github-actions Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preview packages published to npm

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(pegboard): move sqlite v1 migration into actor workflow

Critical

Significant

Moderate

Minor

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NathanFlurry commented Apr 26, 2026 •

edited

Loading

claude Bot commented Apr 26, 2026 •

edited

Loading

github-actions Bot commented Apr 26, 2026 •

edited

Loading

claude Bot commented Apr 27, 2026 •

edited

Loading