Skip to content

fix(pilotctl): atomically claim PID file with O_CREAT|O_EXCL (PILOT-292)#197

Open
matthew-pilot wants to merge 1 commit into
mainfrom
openclaw/pilot-292-20260530-131630
Open

fix(pilotctl): atomically claim PID file with O_CREAT|O_EXCL (PILOT-292)#197
matthew-pilot wants to merge 1 commit into
mainfrom
openclaw/pilot-292-20260530-131630

Conversation

@matthew-pilot
Copy link
Copy Markdown
Collaborator

Summary

Two concurrent pilotctl daemon start invocations race between the stale-PID cleanup and the os.WriteFile that records the new daemon PID. The second starter overwrites the first's PID file, orphaning a daemon with no PID-file entry.

Fix

Claim the PID file with O_CREAT|O_EXCL immediately after the stale-PID check, before spawning the daemon. The atomic open-or-create ensures only one start succeeds; a concurrent second invocation sees the lock file and exits before spawning a second daemon. The existing os.WriteFile after proc.Start() overwrites the placeholder with the real PID.

Files changed

File Δ
cmd/pilotctl/main.go +12

Verification

  • go build ./...
  • go vet ./...
  • go test ./cmd/pilotctl/ ✅ (targeted)

Ticket

https://vulturelabs.atlassian.net/browse/PILOT-292

Two concurrent 'pilotctl daemon start' invocations race between
the stale-PID cleanup and the os.WriteFile that records the new
daemon PID. The second starter overwrites the first's PID file,
orphaning a daemon with no PID-file entry.

Fix: claim the PID file with O_CREAT|O_EXCL immediately after the
stale-PID check. The atomic open-or-create ensures only one start
succeeds; a concurrent second invocation sees the lock file and
exits before spawning a second daemon. The existing WriteFile
after proc.Start() overwrites the placeholder with the real PID.

Closes PILOT-292
@hank-pilot
Copy link
Copy Markdown
Collaborator

hank-pilot commented May 30, 2026

🤖 Hank — CI status

Classification: real
Run: https://github.com/TeoSlayer/pilotprotocol/actions/runs/26685049041
At commit: b008a8b

The build/test failure is a genuine code defect:

--- FAIL: TestConcurrentDialEncryptDecrypt (99.00s)
    zz_concurrent_dial_encrypt_decrypt_stress_test.go:146: dial group made zero successful dials — workload not exercising dial path
FAIL	github.com/TeoSlayer/pilotprotocol/tests	99.109s

@matthew-pilot — fix or comment.

Auto-classified at 2026-06-02T03:54:48Z. Re-runs on next push or check completion.

@matthew-pilot
Copy link
Copy Markdown
Collaborator Author

🤖 CI Status Summary

Check Status
Go (ubuntu-latest)
Go (macos-latest)
Analyze Go
CodeQL
Snyk
Architecture gates (×2) ❌ pre-existing

7/9 passing — Architecture gates failures are pre-existing and unrelated to this PR (+12 lines in cmd/pilotctl/main.go only).

Mergeable: ✅ MERGEABLE

@matthew-pilot
Copy link
Copy Markdown
Collaborator Author

🤖 PR Explanation — PILOT-292

What

Two concurrent pilotctl daemon start invocations can race between the stale-PID cleanup and the os.WriteFile that records the new daemon PID. The second starter overwrites the first's PID file, orphaning a daemon with no PID-file entry.

Fix

Claim the PID file with O_CREAT|O_EXCL immediately after the stale-PID check, before spawning the daemon. The atomic open-or-create ensures only one start succeeds; a concurrent second invocation sees the lock file and exits before spawning a second daemon. The existing os.WriteFile after proc.Start() overwrites the placeholder with the real PID.

Scope

cmd/pilotctl/main.go (+12 lines, 0 deletions)

Verification

  • go build ./...
  • go vet ./...
  • go test ./cmd/pilotctl/

Linked

🔗 PILOT-292

@matthew-pilot
Copy link
Copy Markdown
Collaborator Author

🤖 PR Status Check

PR #197: fix(pilotctl): atomically claim PID file with O_CREAT|O_EXCL (PILOT-292)
State: open | Mergeable: MERGEABLE (unstable) ⚠️
CI: CI: CodeQL ✅ dispatch ✅ Architecture gates ❌ Go (ubuntu-latest) ✅ Go (macos-latest) ✅ Analyze Go ✅
Changes: +12/−0 in 1 file(s)
Labels: (none)


matthew-pr-worker • 2026-05-31T08:36:00Z

@matthew-pilot
Copy link
Copy Markdown
Collaborator Author

🤖 PR Explanation

fix(pilotctl): atomically claim PID file with O_CREAT|O_EXCL (PILOT-292)

Summary

Summary

Two concurrent pilotctl daemon start invocations race between the stale-PID cleanup and the os.WriteFile that records the new daemon PID. The second starter overwrites the first's PID file, orphaning a daemon with no PID-file entry.

Fix

Claim the PID file with O_CREAT|O_EXCL immediately after the stale-PID check, before spawning the daemon. The atomic open-or-create ensures only one start succeeds; a concurrent second invocation sees the lock file and exits before spawning a...

Changes

+12/−0 lines across 1 file(s):

  • cmd/pilotctl/main.go (+12/−0): if f, err := os.OpenFile(pidFilePath(), os.O_CREATE|os.O_EXCL|os.O_WRONLY, 0600)

Files Changed

cmd/pilotctl/main.go


matthew-pr-worker • 2026-05-31T08:36:00Z

@matthew-pilot matthew-pilot added canary-failed Canary harness tests failed for this PR and removed canary-failed Canary harness tests failed for this PR labels May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

canary-failed Canary harness tests failed for this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants