Skip to content

feat(proxy): credential pool auto-failover + approval-prompt coalescing#43

Merged
nnemirovsky merged 51 commits into
mainfrom
feat/credential-pool-and-approval-coalescing
May 16, 2026
Merged

feat(proxy): credential pool auto-failover + approval-prompt coalescing#43
nnemirovsky merged 51 commits into
mainfrom
feat/credential-pool-and-approval-coalescing

Conversation

@nnemirovsky
Copy link
Copy Markdown
Owner

@nnemirovsky nnemirovsky commented May 16, 2026

Implements two planned features on one branch (squash-merge — history contains the original two-feature merge commits). Plans: docs/plans/completed/20260515-approval-coalescing.md, docs/plans/completed/20260515-credential-pool-failover.md.

1. Approval-prompt coalescing

Bursts of requests to the same target while the first approval is pending now coalesce into one prompt instead of a wall of identical ones. Broker dedups by dest:port (the persistence-equivalent key); one resolve fans out to all attached waiters; the final coalesced count is folded into the existing resolve/cancel edit (zero extra Telegram API calls). MCP tool calls opt out (arg-sensitive). Always Allow persist is idempotent under concurrent fan-out.

2. Credential pools with auto-failover

One pool-scoped phantom identity backed by N OAuth members (primary use case: two OpenAI Codex accounts behind one agent). Phase 0 schema/CLI (sluice pool create/list/status/rotate/remove, migration 000006), Phase 1 single chokepoint + R1 refresh-token attribution (fail-closed) + R3 pool-stable synthetic JWT, Phase 2 auto-failover on 429 / 403-quota / 401 / invalid_grant with synchronous in-memory cooldown, cred_failover audit, non-blocking Telegram notice.

Two critical bugs found by self-review and fixed (with regression tests)

  • Token-endpoint failover attribution: original code attributed token-endpoint auth failures via OAuthIndex.Match (returns first index entry) — for two accounts sharing one token URL it cooled the wrong member and thrashed the dead account. Now recovers the true member via the same injected-refresh-token join key the persist path uses (non-consuming Peek), with active-member fallback.
  • Cooldown durability: a synchronous cooldown could be lost across resolver pointer-swaps / a failed detached store write. Fixed with a single process-wide mutex-guarded PoolHealth shared across all resolver generations.

End-to-end coverage (no deferred gaps)

Both behaviours have real, non-vacuous e2e tests (e2e tag, run by CI Linux+macOS):

  • TestPoolFailover_EndToEnd — two fake OAuth upstreams (real HMAC-JWT issuing) sharing one token URL behind one pool: A used until 429 → next request fails over to B → B's rotated tokens persist to B's vault entry, A's untouched → phantom access token byte-identical before/after switch → cred_failover audit present. Verified to fail when failover is reverted.
  • TestApprovalCoalesce_BurstOnePrompt — gated webhook channel holds the first decision; 8 concurrent requests to one Ask target → exactly one approval call (peak concurrency 1) → single always_allow fans out to all 8. Verified to fail when coalescing is disabled.

(WithNoCoalesce/MCP opt-out is not e2e-expressible — no SOCKS5 burst surface for MCP — and remains unit-covered.)

Validation

  • go test ./...2559 pass, 13/13 packages
  • go test -tags=e2e ./e2e/ — green (~141s)
  • go test -race (proxy/vault/channel) — 0 data races
  • golangci-lint v2.9.00 issues; gofumpt, go vet, go build clean
  • Container e2e (e2e && linux/darwin) run by CI workflows.
  • Streamed (SSE / >5 MiB) 429/401 responses bypass failover — documented in CLAUDE.md alongside the existing streaming-DLP bypass note (an inherent go-mitmproxy streaming limitation, consistent with existing behaviour, not a gap in this work).

Generated via /planning:exec; no Codex used (per constraint).

Migration 000006 adds credential_pools, credential_pool_members and
credential_health tables. Store API: CreatePoolWithMembers (atomic;
namespace mutual-exclusion + oauth-member validation in one tx), GetPool,
ListPools, RemovePool, PoolExists, PoolsForMember, and
Set/Get/ListCredentialHealth. Covered by CRUD/ordering/validation tests
and an up/down migration reversibility test.
PoolResolver is the single place a bound pool name expands to a concrete
credential. IsPool/ResolveActive (first healthy or expired-cooldown member
in position order; degrade to soonest-recovering when all down) and
MarkCooldown for Phase 2 synchronous in-memory failover. Locking
discipline documented: membership immutable per instance (atomic-pointer
swap on reload), health mutated in place under an RWMutex. RateLimit
(60s) and AuthFail (300s) cooldown TTL consts. Nil-safe.
sluice pool create|list|status|rotate|remove. status computes the active
member via the same PoolResolver logic the proxy uses so it never
disagrees with what gets injected; rotate parks the active member
(lazy-recovery cooldown) so the next member takes over. cred add rejects
a name colliding with an existing pool; cred remove is blocked (before
the vault delete) when the credential is a live pool member, so no
dangling member rows or destroyed secrets.
Server gains an atomic PoolResolver pointer (parallel to the binding
resolver), StorePool/PoolResolverPtr, and threads it into SluiceAddon via
WithPoolResolver. addon.resolvePoolMember is the chokepoint helper
(non-pool names passthrough). main.go registers the pool subcommand,
loads the resolver at startup, and rebuilds+atomically swaps it in
reloadAll alongside the binding/oauth reloads. Injection does not consult
it yet (Phase 1).

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

…atomic RemovePoolIfUnreferenced; REST cred-remove missing-secret tolerant

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 52 out of 52 changed files in this pull request and generated 1 comment.

Comment thread internal/telegram/commands.go Outdated
@nnemirovsky nnemirovsky merged commit fb5528d into main May 16, 2026
6 checks passed
@nnemirovsky nnemirovsky deleted the feat/credential-pool-and-approval-coalescing branch May 16, 2026 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants