Skip to content

fix(pb): fix pool config defaults#4774

Open
MasterPtato wants to merge 1 commit into04-24-fix_pb_remove_actor_sleep_ts_when_allocatedfrom
04-24-fix_pb_fix_pool_config_defaults
Open

fix(pb): fix pool config defaults#4774
MasterPtato wants to merge 1 commit into04-24-fix_pb_remove_actor_sleep_ts_when_allocatedfrom
04-24-fix_pb_fix_pool_config_defaults

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@MasterPtato MasterPtato force-pushed the 04-24-fix_pb_remove_actor_sleep_ts_when_allocated branch from 1afa22c to fd614fc Compare April 25, 2026 01:14
@MasterPtato MasterPtato force-pushed the 04-24-fix_pb_fix_pool_config_defaults branch from a37d543 to ac730ee Compare April 25, 2026 01:14
@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Code Review

This PR makes three targeted bug fixes to pool config defaults. All changes are correct and address real issues. Some notes below.

Change 1: drain_on_version_upgrade default: falsetrue

This is a behavioral change that affects any deployment relying on the implicit default. Draining on version upgrade is the safer operational behavior (avoids routing traffic to stale runners), so true is the better default. Worth noting in a changelog or migration note since existing configs that omitted the field will now behave differently after a redeploy.

Change 2: drain_grace_period default: 10_00010

Clear bug fix. The field is documented in seconds, so the old default of 10_000 was ~2.8 hours rather than the intended 10 seconds. The inline comment claims this mirrors the deprecated config.pegboard.serverless_drain_grace_period value — if that deprecated field was in milliseconds (10,000 ms = 10 s), the comment is now misleading and should be updated or removed to avoid re-introducing the confusion.

Change 3: Division-by-zero guard in runner_pool.rs

The fix prevents a panic when slots_per_runner == 0, but silently falls back to treating every slot as its own runner. Since the field is not Option<u32>, the caller is required to supply a value but there is no upstream validation rejecting zero. Consider whether a zero value should be rejected at config parse time with an explicit error rather than silently over-provisioning. The current fix is still an improvement over panicking.

Minor observations

  • No tests were added or updated to pin the new default values. Given these are config defaults that previously caused silent operational bugs, a unit test asserting the Into conversion produces the correct defaults (10 s grace period, drain_on_version_upgrade = true) would prevent regression.
  • The Serverless fields slots_per_runner, min_runners, max_runners, and runners_margin are all marked deprecated in the struct. If they are no longer the canonical way to configure pool sizing, it may be worth adding a validation warning when they are supplied to nudge users to migrate.

Overall the fixes are correct and the PR is safe to merge. The stale comment on drain_grace_period and the lack of regression tests are the main things worth addressing before or after merging.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Code Review: fix(pb): fix pool config defaults

Overview

This PR fixes three bugs in pool configuration defaults:

  1. drain_on_version_upgrade default changed from falsetrue
  2. drain_grace_period default changed from 10_00010
  3. Division-by-zero protection when slots_per_runner is 0

Analysis

drain_on_version_upgrade default: falsetrue

Correct change. Draining on version upgrade is the safer default — it avoids killing in-flight actors during a deployment. Opting into disruptive upgrades should be explicit.

✅ Division-by-zero guard: slots_per_runner.max(1)

(adjusted_desired_slots as u32).div_ceil(slots_per_runner.max(1))

Clear defensive fix. slots_per_runner = 0 would panic on integer division, and .max(1) keeps the semantics sensible (0 slots → behave as 1 to avoid division error). Good.

⚠️ drain_grace_period default: 10_00010

The struct documents the field as seconds:

/// Seconds.
drain_grace_period: Option<u32>,

The old default of 10_000 (seconds) = ~2.77 hours is obviously wrong. The deprecated global fallback (config.pegboard.serverless_drain_grace_period) defaults to 10_000 milliseconds = 10 seconds, so the correct equivalent in seconds is 10. Fix is correct.

However, there is a related inconsistency not addressed in this PR:

In engine/sdks/rust/data/src/versioned/namespace_runner_config.rs (v4 → v5 migration), the hardcoded value is still 10_000:

// Default to deprecated config value (config.pegboard.serverless_drain_grace_period)
drain_grace_period: 10_000,  // line 312 — not changed in this PR

If the v5 BARE schema field is also in seconds (matching the live RunnerConfigKind::Serverless type), existing migrated configs will carry a 10,000-second (~2.77 hour) drain grace period. This should be 10 to match the fix applied here.

Also worth noting: in engine/packages/pegboard/src/workflows/serverless/conn.rs, the runner config's drain_grace_period is currently skipped via .. and the deprecated global config is used instead — so this fix does not affect runtime behavior yet. When this field is eventually wired in, the migration value at line 312 in versioned/namespace_runner_config.rs will become load-bearing.


Summary

Change Assessment
drain_on_version_upgrade default → true ✅ Correct, safer default
drain_grace_period default → 10 ✅ Correct unit fix
slots_per_runner.max(1) div-by-zero guard ✅ Correct defensive fix
v4→v5 migration drain_grace_period: 10_000 (line 312 in namespace_runner_config.rs) ⚠️ Not updated — likely the same unit bug

Suggested follow-up: Update the v4→v5 migration default at engine/sdks/rust/data/src/versioned/namespace_runner_config.rs:312 from 10_000 to 10 to match this fix, or document why it should remain different. The comment on that line references the same deprecated config as justification.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant