feat: add experimental Codex session supervisor#147
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (13)
📝 WalkthroughWalkthroughadds a gated session supervisor for the codex cli, integrates it into the startup flow, adds a backend toggle and config getter, adds abort-aware quota probing, refactors storage transaction path handling, and adds comprehensive supervisor and storage tests ( Changes
Sequence Diagram(s)sequenceDiagram
participant caller as codex.js
participant supervisor as codex-supervisor.js
participant runtime as runtime/config
participant lock as filesystem-lock
participant storage as storage/accounts
participant quota as quota-probe
participant child as child-process
participant codex as real-codex-binary
caller->>supervisor: runCodexSupervisorIfEnabled(rawArgs)
supervisor->>runtime: getCodexCliSessionSupervisor(pluginConfig)
runtime-->>supervisor: true/false
alt supervisor disabled
supervisor-->>caller: null
caller->>codex: forward args (startup sync)
else supervisor enabled
supervisor->>lock: acquire lock + heartbeat
lock-->>supervisor: acquired
supervisor->>storage: load account manager
storage-->>supervisor: accounts
alt non-interactive or not-eligible
supervisor->>codex: forward args
codex-->>supervisor: exit
else interactive & eligible
supervisor->>quota: probe current account (signal aware)
quota-->>supervisor: snapshot
alt rotate required
supervisor->>child: signal SIGINT/SIGTERM/SIGKILL
child-->>supervisor: exit
supervisor->>storage: commit new account
storage-->>supervisor: persisted
supervisor->>codex: forward args with new account
else prewarm
supervisor->>quota: prewarm next account async
supervisor->>codex: forward current account
end
codex-->>supervisor: exit
end
supervisor->>lock: release + cleanup
supervisor-->>caller: exit code
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes notes (explicit risks to review)
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
8b878c5 to
98c5185
Compare
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/storage.ts (1)
2146-2155:⚠️ Potential issue | 🔴 Criticalcapture the flagged path with the transaction too.
lib/storage.ts:2146-2155freezes the accounts file, butsaveFlaggedAccountsUnlocked()still resolves its destination fromgetFlaggedAccountsPath()at call time inlib/storage.ts:2415-2419. if the handler switches storage roots mid-transaction, accounts are written back to the original path while flagged state goes to the new path, so rollback no longer protects the same storage pair. please thread a captured flagged path through this transaction and add a drift regression alongsidetest/storage.test.ts:1095-1154; the current combined-transaction cases intest/storage.test.ts:1202-1345never exercise this split-write path. as per coding guidelines,lib/**: focus on auth rotation, windows filesystem IO, and concurrency. verify every change cites affected tests (vitest) and that new queues handle EBUSY/429 scenarios.suggested fix
async function saveFlaggedAccountsUnlocked( storage: FlaggedAccountStorageV1, + storagePath = getFlaggedAccountsPath(), ): Promise<void> { - const path = getFlaggedAccountsPath(); + const path = storagePath; const markerPath = getIntentionalResetMarkerPath(path); const uniqueSuffix = `${Date.now()}.${Math.random().toString(36).slice(2, 8)}`; const tempPath = `${path}.${uniqueSuffix}.tmp`; @@ export async function withAccountAndFlaggedStorageTransaction<T>( handler: ( @@ ): Promise<T> { return withStorageLock(async () => { const storagePath = getStoragePath(); + const flaggedStoragePath = join(dirname(storagePath), FLAGGED_ACCOUNTS_FILE_NAME); const state = { @@ const persist = async ( accountStorage: AccountStorageV3, flaggedStorage: FlaggedAccountStorageV1, ): Promise<void> => { const previousAccounts = cloneAccountStorageForPersistence(state.snapshot); const nextAccounts = cloneAccountStorageForPersistence(accountStorage); await saveAccountsUnlocked(nextAccounts, storagePath); try { - await saveFlaggedAccountsUnlocked(flaggedStorage); + await saveFlaggedAccountsUnlocked(flaggedStorage, flaggedStoragePath); state.snapshot = nextAccounts; } catch (error) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/storage.ts` around lines 2146 - 2155, The persist transaction freezes account file but calls saveFlaggedAccountsUnlocked which resolves its path at call time; capture the flagged-path up-front (call getFlaggedAccountsPath() inside persist into e.g. flaggedPathCaptured) and pass that path into saveFlaggedAccountsUnlocked (and any helpers that resolve paths) so both accounts and flagged writes target the same storage root; update cloneAccountStorageForPersistence/state.snapshot usage unchanged but ensure saveAccountsUnlocked and saveFlaggedAccountsUnlocked are called with the captured paths, add a regression vitest near the existing storage tests that simulates handler storage-root rotation mid-transaction to assert rollback still restores both files, and update any affected queue logic to explicitly retry/handle EBUSY and 429 per existing queue abstractions as noted in coding guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/codex-manager/settings-hub.ts`:
- Around line 329-333: The codexCliSessionSupervisor toggle is currently listed
in BACKEND_TOGGLE_OPTIONS but not rendered by BACKEND_CATEGORY_OPTIONS, so
backendSettingsSnapshot() / buildBackendConfigPatch() will include it in backend
resets and silently clear it; fix by removing "codexCliSessionSupervisor" from
BACKEND_TOGGLE_OPTIONS (or alternatively add it to BACKEND_CATEGORY_OPTIONS and
ensure the backend UI renders it) so that the key is only in the shared backend
option set if the backend UI can actually display and preserve it—update the
array where BACKEND_TOGGLE_OPTIONS is defined and adjust any UI rendering logic
if you choose to move it into BACKEND_CATEGORY_OPTIONS.
In `@package.json`:
- Line 54: Update the smoke npm script "test:session-supervisor:smoke" to
include the storage regressions by adding test/storage.test.ts to the list so
the smoke run exercises the changed storage logic (the path-equivalence and
transaction/export safety code in lib/storage.ts around the path-drift and
windows-handling sections). Modify the package.json target that currently lists
tests like test/codex-supervisor.test.ts to also run test/storage.test.ts, then
commit the package.json change so CI's smoke job covers the storage changes.
In `@scripts/codex-supervisor.js`:
- Around line 551-585: Add a short in-source comment above the pendingRefresh
handler explaining that the read-then-write window in pendingRefresh (which
calls readSupervisorLockPayload and then writeSupervisorLockPayload using
lockPath, ownerId, acquiredAt, ttlMs) is not a critical TOCTOU because lock
acquisition uses exclusive create ('wx') so a concurrent supervisor cannot
silently take the same filename, and the code explicitly checks payload.ownerId
and calls stopWithError on mismatch; mention these two protections
(exclusive-create on acquisition and ownerId mismatch check) so future readers
understand the mitigation.
- Around line 34-35: The module-level maps snapshotProbeCache and
sessionRolloutPathById persist across process reloads and need a public reset
entrypoint; add and export a production-safe cache-reset function (e.g., export
function resetSupervisorCaches() or two exports resetSnapshotProbeCache() and
resetSessionRolloutPathCache()) that calls snapshotProbeCache.clear() and
sessionRolloutPathById.clear(), and document/ensure callers (such as the
supervisor runtime reload path) invoke it when reloading; keep the existing
__testOnly.clearAllProbeSnapshotCache and
__testOnly.clearSessionBindingPathCache behavior but provide this new public API
for embedded/long-running use.
- Around line 114-118: The parseNumberEnv helper currently signature
parseNumberEnv(name, fallback, min = 0) ignores an intended max bound; update
the signature to accept a fourth parameter (e.g., max = Infinity) and clamp the
parsed value between min and max (use Math.trunc then Math.max(min,
Math.min(max, value))). Modify parseNumberEnv accordingly and ensure callers
that pass a 4th arg (e.g., places that call parseNumberEnv(..., 0, 100) for
CODEX_AUTH_PREEMPTIVE_QUOTA_5H_REMAINING_PCT) behave correctly without further
changes.
In `@scripts/codex.js`:
- Around line 541-554: The branch after runCodexSupervisorIfEnabled that returns
supervisedExitCode without triggering autoSyncManagerActiveSelectionIfEnabled is
untested for the case supervisedExitCode === 0 and supervisorDidForward ===
false; add a deterministic vitest regression that simulates
runCodexSupervisorIfEnabled returning 0 while supervisorDidForward is false and
rawArgs is an interactive command (use isSupervisorInteractiveCommand to pick
args), then assert autoSyncManagerActiveSelectionIfEnabled was called once
(mock/spied) and the wrapper returns 0; reference runCodexSupervisorIfEnabled,
supervisedExitCode, supervisorDidForward, isSupervisorInteractiveCommand and
autoSyncManagerActiveSelectionIfEnabled in the test and avoid flaky timing by
stubbing dependencies instead of real concurrency or filesystem operations.
In `@test/codex-bin-wrapper.test.ts`:
- Around line 68-125: The fixture always hard-codes codexCliSessionSupervisor:
true in writeSupervisorRuntimeFixture which masks runtime env precedence; change
writeSupervisorRuntimeFixture to accept a boolean (e.g., enableSupervisor =
false) and write codexCliSessionSupervisor accordingly, then update the wrapper
test cases that previously relied on the fixture to call
writeSupervisorRuntimeFixture(fixtureRoot, false) and enable the supervisor only
via CODEX_AUTH_CLI_SESSION_SUPERVISOR=1 in the specific test that needs it so
tests exercise the env-path behavior rather than the hard-coded config.
In `@test/codex-supervisor.test.ts`:
- Around line 1326-1328: The busy-wait loop polling setImmediate to wait for
startedAccounts.size to reach 3 is flaky; replace it with a deterministic
deferred/promise that resolves when startedAccounts.size === 3 (and include a
short timeout fallback). Concretely, create a Promise (deferred) in the test and
have whatever code increments the startedAccounts Set resolve that promise as
soon as it observes startedAccounts.size === 3; then await that promise instead
of the for-loop. Ensure the promise also rejects or resolves after a reasonable
timeout to avoid hanging tests.
- Around line 9-26: The env cleanup list envKeys in
test/codex-supervisor.test.ts is missing the environment variable
CODEX_AUTH_CLI_SESSION_LOCK_TTL_MS which the tests set; add
"CODEX_AUTH_CLI_SESSION_LOCK_TTL_MS" to the envKeys const so the test harness
clears it between tests (update the envKeys array near the existing CODEX_AUTH_*
entries and re-run the test suite to ensure no state leakage).
In `@test/plugin-config.test.ts`:
- Around line 991-1008: The test suite misses the explicit environment opt-out
case: add a deterministic vitest test that sets
process.env.CODEX_AUTH_CLI_SESSION_SUPERVISOR = '0', calls
getCodexCliSessionSupervisor({ codexCliSessionSupervisor: true }) and asserts
the result is false, then cleans up the env var; this verifies the env override
in getCodexCliSessionSupervisor correctly wins over a true config value and
prevents regressions in the opt-out branch.
In `@test/storage.test.ts`:
- Around line 1008-1044: Update the test that simulates Windows platform to also
exercise separator normalization: when running inside
withAccountStorageTransaction (using setStoragePathDirect, exportAccounts,
saveAccounts, testStoragePath, exportPath) set the storage path to a
Windows-style path with backslashes (e.g., transform testStoragePath to use "\"
separators) in addition to uppercasing, so the export path comparison covers
both case-insensitivity and separator-insensitivity; ensure the modified test
still toggles process.platform to "win32" and restores it in finally, and assert
the exported accountId as before.
---
Outside diff comments:
In `@lib/storage.ts`:
- Around line 2146-2155: The persist transaction freezes account file but calls
saveFlaggedAccountsUnlocked which resolves its path at call time; capture the
flagged-path up-front (call getFlaggedAccountsPath() inside persist into e.g.
flaggedPathCaptured) and pass that path into saveFlaggedAccountsUnlocked (and
any helpers that resolve paths) so both accounts and flagged writes target the
same storage root; update cloneAccountStorageForPersistence/state.snapshot usage
unchanged but ensure saveAccountsUnlocked and saveFlaggedAccountsUnlocked are
called with the captured paths, add a regression vitest near the existing
storage tests that simulates handler storage-root rotation mid-transaction to
assert rollback still restores both files, and update any affected queue logic
to explicitly retry/handle EBUSY and 429 per existing queue abstractions as
noted in coding guidelines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 10ee12b2-f7cf-4b80-b8cf-a2aeaefb3006
📒 Files selected for processing (16)
lib/codex-manager/settings-hub.tslib/config.tslib/quota-probe.tslib/schemas.tslib/storage.tslib/ui/copy.tspackage.jsonscripts/codex-routing.jsscripts/codex-supervisor.jsscripts/codex.jstest/codex-bin-wrapper.test.tstest/codex-supervisor.test.tstest/plugin-config.test.tstest/quota-probe.test.tstest/settings-hub-utils.test.tstest/storage.test.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (2)
lib/**
⚙️ CodeRabbit configuration file
focus on auth rotation, windows filesystem IO, and concurrency. verify every change cites affected tests (vitest) and that new queues handle EBUSY/429 scenarios. check for logging that leaks tokens or emails.
Files:
lib/schemas.tslib/ui/copy.tslib/config.tslib/quota-probe.tslib/codex-manager/settings-hub.tslib/storage.ts
test/**
⚙️ CodeRabbit configuration file
tests must stay deterministic and use vitest. demand regression cases that reproduce concurrency bugs, token refresh races, and windows filesystem behavior. reject changes that mock real secrets or skip assertions.
Files:
test/quota-probe.test.tstest/settings-hub-utils.test.tstest/codex-bin-wrapper.test.tstest/plugin-config.test.tstest/storage.test.tstest/codex-supervisor.test.ts
🪛 ast-grep (0.41.1)
test/storage.test.ts
[warning] 998-1002: Regular expression constructed from variable input detected. This can lead to Regular Expression Denial of Service (ReDoS) attacks if the variable contains malicious patterns. Use libraries like 'recheck' to validate regex safety or use static patterns.
Context: new RegExp(
storage path mismatch: transaction path is ${escapeRegExp( testStoragePath, )}, active path is ${escapeRegExp(secondaryStoragePath)},
)
Note: [CWE-1333] Inefficient Regular Expression Complexity [REFERENCES]
- https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- https://cwe.mitre.org/data/definitions/1333.html
(regexp-from-variable)
🔇 Additional comments (18)
scripts/codex-supervisor.js (8)
37-60: lgtm - sleep with abort support.proper cleanup of abort listeners and timer unref for process exit. the early return on
signal?.aborted(line 39-41) prevents unnecessary timer allocation.
120-130: lgtm - env overrides read at call time.
getMaxAccountSelectionAttempts()andgetMaxSessionRestarts()correctly read env vars at invocation rather than module load, honoring runtime overrides. this aligns with the commit message fix.
717-749: lgtm - windows transient lock failures handled.
EPERMandEBUSYduring lock creation trigger retry after checking for stale locks (line 722-724). test coverage attest/codex-supervisor.test.ts:598-648validates both error codes. the sleep at line 745 respects the abort signal.
1088-1095: lgtm - correct abort vs unavailable distinction.line 1088-1093 correctly differentiates: the caller who aborted gets
AbortError, while concurrent waiters on the same probe getQuotaProbeUnavailableError. this prevents aborting one caller from killing other waiters. test attest/codex-supervisor.test.ts:935-977validates this behavior.
1729-1754: lgtm - platform-aware signal escalation.non-windows gets
SIGINTfirst (line 1742), thenSIGTERM, thenSIGKILL. windows skipsSIGINTsince it's not well-supported. tests attest/codex-supervisor.test.ts:448-468(windows) andtest/codex-supervisor.test.ts:470-487(linux) cover both paths.
2150-2155: lgtm - prewarm selection cleanup in finally.
preparedResumeSelectionLink.cleanup()at line 2151 aborts the linked controller, and line 2152-2154 awaits the promise with.catch(() => null)to prevent unhandled rejection. this ensures the prewarm probe is cancelled when the session exits without rotation. test attest/codex-supervisor.test.ts:777-863validates this.
2203-2241: lgtm - signal handler cleanup.
process.onceat lines 2211-2212 andprocess.offat lines 2238-2239 ensureSIGINT/SIGTERMhandlers are properly registered and removed. the finally block guarantees cleanup even on early returns.
1541-1583: lgtm - stream cleanup in finally.
extractSessionMetaproperly cleans uplineReaderandstreamin the finally block (lines 1577-1579). the optional chaining handles cases where creation fails early.test/codex-supervisor.test.ts (10)
31-48: lgtm - robust temp dir cleanup for windows.
removeDirectoryWithRetryhandlesENOTEMPTY,EPERM,EBUSYwith backoff. this is necessary on windows where file handles may linger briefly.
64-158: lgtm - FakeManager test double.correctly implements the account management interface with stateful cooling down, index tracking, and candidate selection.
saveToDiskis a no-op which is appropriate for unit tests.
224-254: lgtm - fake quota probe with abort support.
fetchCodexQuotaSnapshotat line 224-253 correctly handles abort signals and supports test hooks (onFetch,onFetchStart,waitForFetchByAccountId). this enables testing concurrent probe sharing and abort scenarios.
650-705: lgtm - lock serialization test."serializes concurrent callers behind the supervisor storage lock" uses deferreds to control entry/exit timing deterministically. the order assertion at line 704 validates that
first-enter→first-exit→second-enter.
1086-1162: lgtm - token refresh race regression.this test validates that
commitPreparedSelectionre-resolves the account from storage at commit time (lines 1147-1160), ensuring stale prewarm snapshots don't override a refreshed token. this is a critical regression case per coding guidelines.
448-487: lgtm - signal escalation tests.lines 448-468 test abort-interrupted restart waits, lines 470-487 test the
SIGINT→SIGTERM→SIGKILLsequence on linux. both usevi.useFakeTimers()for deterministic timing.
1714-1769: lgtm - monitor loop failure test.validates that exceptions from
loadCurrentStateduring the monitor loop (line 1756-1758) result in exit code 1 and a user-facing error message on stderr. important for debugging production issues.
257-272: lgtm - afterEach cleanup.properly restores env vars, clears supervisor caches, resets timers, and removes temp directories. reverse iteration at line 269 handles nested directory cleanup correctly.
274-299: lgtm - boundary test for session metadata scan.tests the 200-line scan limit behavior: line 199 metadata is found (lines 275-299), line 201 metadata is missed (lines 301-321). deterministic test data with explicit assertions.
1561-1598: regression test for storage path mismatch already exists.test/storage.test.ts:964-1005 validates the exact behavioral change—"fails fast when export is called from an active transaction after the storage path changes" explicitly tests that exportAccounts throws with "storage path mismatch: transaction path is X, active path is Y" when the path diverges during a transaction.
|
@coderabbitai full review @coderabbitai summary |
|
@greptileai please review this PR again and refresh the summary for the latest branch state. |
|
✅ Actions performedSummary regeneration triggered. |
|
@greptileai please review this PR again and refresh the summary for the latest branch state. |
|
I rebuilt this from fresh
Closing this one so review stays on the clean replacements. |
Summary
Adds an opt-in Codex CLI session supervisor that keeps interactive sessions on a healthy account, integrates quota-aware rotation into startup, and hardens the storage and quota paths needed to make supervisor handoffs safe.
What Changed
scripts/codex-supervisor.jsand routedscripts/codex.jsthrough it whencodexCliSessionSupervisoris enabled.codexCliSessionSupervisorsetting.realpathchecks so exports do not false-fail when the active path and transaction path resolve to the same file through different prefixes.Validation
Passed:
npx vitest run test/quota-probe.test.tsnpx vitest run test/codex-supervisor.test.tsnpx vitest run test/storage.test.tsnpm run test:session-supervisor:smokenpm run lintnpm run typechecknpm run buildBlocked Baseline
npm teststill fails on an inherited docs issue already present onorigin/main:README.mdanddocs/README.mdboth link to missingdocs/releases/v1.2.0.md.Docs Impact
origin/main.Risk and Rollback
codexCliSessionSupervisor, or revert this PR branch if the experiment needs to be backed out.Notes
note: greptile review for oc-chatgpt-multi-auth. cite files like
lib/foo.ts:123. confirm regression tests + windows concurrency/token redaction coverage.Greptile Summary
this pr adds an opt-in
codexCliSessionSupervisorthat monitors running codex sessions, probes quota, and transparently rotates to a healthy account viacodex resume <sessionId>when a rate-limit or near-exhaustion is detected. it's disabled by default behind an experimental flag and integrates into the existingscripts/codex.jsentrypoint. the storage and quota-probe layers are hardened with symlink-aware path comparison, abort-signal propagation, and a safer stale-lock removal path.key points:
scripts/codex-supervisor.js(2418 lines) is the core: lock acquisition, heartbeat-based lease renewal, quota probing with a bounded snapshot cache, session-binding discovery via.jsonlscan, and a retry loop with pre-warminglib/storage.tsreplaces a silent branch-2 fallback inexportAccountswith a hard throw on storage-path mismatch, guarded by asyncrealpath-aware path comparisonlib/quota-probe.tsaddsAbortSignalsupport with proper listener cleanup and abort-before-fallback guardssafeUnlinkplatform guard,parseNumberEnvmax clamp, stale-lock TOCTOU,buildCodexChildEnvenv sanitisation, module-level constant baking) appear addressed in this revisionhandle.writeFilethrows afterfs.open("wx")succeeds — subsequent acquisitions seeEEXIST, fail to parse the empty payload, and then observe a recent mtime soisSupervisorLockStalereturns false, blocking until TTL; (2)normalizeStorageComparisonPathswallows allrealpatherrors (not justENOENT), causing a silent fallback toresolvePathunder permission errors which can produce false path-mismatch export blockslistJsonlFileshas no file-count limit — a large sessions directory causes unbounded I/O on every polling cycle (refreshed every ~2 s); this is a windows filesystem performance riskConfidence Score: 3/5
catch {}innormalizeStorageComparisonPathsilently degrades symlink detection under permission errors. the unboundedlistJsonlFilesscan is a latency risk but not a correctness bug. test coverage is strong overall but the stranded-lock path has no vitest test.Important Files Changed
normalizeStorageComparisonPathsilently swallows EPERM/EACCES (should only catch ENOENT); also adds defensively-correct path-mismatch throw replacing the silent fallback; sequential awaits inareEquivalentStoragePathsare a minor inefficiencysupervisorDidStartupSyncflag correctly gates the post-session sync, andsupervisorDidForwardcorrectly handles the non-supervisor fallback pathAbortSignalsupport with correct abort-forwarding, listener cleanup, and abort-before-fallback guards; abort signal is properly propagated through the fetch controllernormalizeStorageComparisonPathSequence Diagram
sequenceDiagram participant codex.js participant supervisor as codex-supervisor.js participant runtime as loadSupervisorRuntime participant lock as withSupervisorStorageLock participant manager as AccountManager participant child as codex child process participant monitor as monitor loop codex.js->>supervisor: runCodexSupervisorIfEnabled() supervisor->>runtime: loadSupervisorRuntime() runtime-->>supervisor: { loadPluginConfig, fetchCodexQuotaSnapshot, ... } supervisor->>lock: ensureLaunchableAccount() lock->>manager: loadFromDisk() manager-->>lock: accounts lock->>runtime: fetchCodexQuotaSnapshot(account) runtime-->>lock: snapshot / QuotaProbeUnavailableError lock-->>supervisor: { ok, account } supervisor->>supervisor: syncBeforeLaunch() supervisor->>child: spawnRealCodex(codexBin, args, buildCodexChildEnv()) supervisor->>monitor: start monitor loop loop poll every 300ms monitor->>runtime: probeAccountSnapshot(currentAccount) runtime-->>monitor: snapshot monitor->>monitor: computeQuotaPressure() alt pressure.prewarm monitor->>supervisor: maybeStartPreparedResumeSelection() end alt pressure.rotate && idle monitor->>child: requestChildRestart(SIGINT→SIGTERM→SIGKILL) monitor->>monitor: set requestedRestart end end child-->>supervisor: exit(code) supervisor->>lock: markCurrentAccountForRestart() supervisor->>lock: commitPreparedSelection() or ensureLaunchableAccount() lock->>manager: persistActiveSelection() supervisor->>supervisor: syncBeforeLaunch() supervisor->>child: spawnRealCodex(resume sessionId, buildCodexChildEnv()) supervisor-->>codex.js: exitCodeComments Outside Diff (2)
scripts/codex-supervisor.js, line 704-717 (link)lock file stranded on write failure — blocks future acquisition until TTL expires
fs.open(lockPath, "wx")creates the file exclusively, but ifhandle.writeFilethen throws (e.g.,ENOSPC,EACCES, intermittent I/O), thefinallyblock only closes the handle without removing the empty file. the outer catch re-throws because thewriteFileerror code is not a transient lock-acquire code (EEXIST/ windowsEPERM/EBUSY), so the stranded file is never cleaned up.on the next acquisition attempt any process will see
EEXIST.isSupervisorLockStalewill getnullfromJSON.parse("")→ falls through tofs.stat→ mtime is very recent →isSupervisorLockStalereturnsfalse. the lock is stuck until TTL expires (30 s default), which is the same class of concurrency problem as the stale-lock TOCTOU. this is a windows filesystem safety risk too since NTFS can produceEPERMon short-lived file contention.no vitest coverage exists for this path — a test that mocks
handle.writeFileto throwENOSPCand verifies the lock file is cleaned up and a subsequent acquisition succeeds would close the gap.lib/storage.ts, line 383-390 (link)realpathcatch swallows EPERM/EACCES — silent fallback breaks symlink detection under permission errorsthe
catch {}is documented as "path does not exist yet" but also silently eats permission errors (EACCES,EPERM). if the parent directory of the storage path is unreadable (common in sandboxed CI or on windows with locked directories),realpaththrowsEACCESand the function falls back toresolvePath. two paths that are symlinked to the same underlying file then compare as unequal, causingareEquivalentStoragePathsto returnfalseand theexportAccountspath-mismatch guard to throw unnecessarily.restrict the catch to only
ENOENTto match the stated intent:this also ensures unexpected filesystem errors surface instead of producing a silent wrong answer. worth a vitest coverage case for the
ENOENTvsEACCESdistinction.Prompt To Fix All With AI
Last reviewed commit: "fix: pin flagged sto..."