feat(proxy): credential pool auto-failover + approval-prompt coalescing#43
Merged
Merged
Conversation
Migration 000006 adds credential_pools, credential_pool_members and credential_health tables. Store API: CreatePoolWithMembers (atomic; namespace mutual-exclusion + oauth-member validation in one tx), GetPool, ListPools, RemovePool, PoolExists, PoolsForMember, and Set/Get/ListCredentialHealth. Covered by CRUD/ordering/validation tests and an up/down migration reversibility test.
PoolResolver is the single place a bound pool name expands to a concrete credential. IsPool/ResolveActive (first healthy or expired-cooldown member in position order; degrade to soonest-recovering when all down) and MarkCooldown for Phase 2 synchronous in-memory failover. Locking discipline documented: membership immutable per instance (atomic-pointer swap on reload), health mutated in place under an RWMutex. RateLimit (60s) and AuthFail (300s) cooldown TTL consts. Nil-safe.
sluice pool create|list|status|rotate|remove. status computes the active member via the same PoolResolver logic the proxy uses so it never disagrees with what gets injected; rotate parks the active member (lazy-recovery cooldown) so the next member takes over. cred add rejects a name colliding with an existing pool; cred remove is blocked (before the vault delete) when the credential is a live pool member, so no dangling member rows or destroyed secrets.
Server gains an atomic PoolResolver pointer (parallel to the binding resolver), StorePool/PoolResolverPtr, and threads it into SluiceAddon via WithPoolResolver. addon.resolvePoolMember is the chokepoint helper (non-pool names passthrough). main.go registers the pool subcommand, loads the resolver at startup, and rebuilds+atomically swaps it in reloadAll alongside the binding/oauth reloads. Injection does not consult it yet (Phase 1).
…en cooldown durability
… JWT payload marshaling, coalesced-count accuracy, single-pool membership)
…WT phantom swap in query/path
…aware QUIC injection
…atomic RemovePoolIfUnreferenced; REST cred-remove missing-secret tolerant
….md QUIC R3 accuracy
… TestCancelApprovalRendersCoalescedCount)
…00 for RemoveCredentialFully
…ils; completed-plan R3 doc accuracy
…v-var binding failure
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements two planned features on one branch (squash-merge — history contains the original two-feature merge commits). Plans:
docs/plans/completed/20260515-approval-coalescing.md,docs/plans/completed/20260515-credential-pool-failover.md.1. Approval-prompt coalescing
Bursts of requests to the same target while the first approval is pending now coalesce into one prompt instead of a wall of identical ones. Broker dedups by
dest:port(the persistence-equivalent key); one resolve fans out to all attached waiters; the final coalesced count is folded into the existing resolve/cancel edit (zero extra Telegram API calls). MCP tool calls opt out (arg-sensitive).Always Allowpersist is idempotent under concurrent fan-out.2. Credential pools with auto-failover
One pool-scoped phantom identity backed by N OAuth members (primary use case: two OpenAI Codex accounts behind one agent). Phase 0 schema/CLI (
sluice pool create/list/status/rotate/remove, migration000006), Phase 1 single chokepoint + R1 refresh-token attribution (fail-closed) + R3 pool-stable synthetic JWT, Phase 2 auto-failover on 429 / 403-quota / 401 /invalid_grantwith synchronous in-memory cooldown,cred_failoveraudit, non-blocking Telegram notice.Two critical bugs found by self-review and fixed (with regression tests)
OAuthIndex.Match(returns first index entry) — for two accounts sharing one token URL it cooled the wrong member and thrashed the dead account. Now recovers the true member via the same injected-refresh-token join key the persist path uses (non-consumingPeek), with active-member fallback.PoolHealthshared across all resolver generations.End-to-end coverage (no deferred gaps)
Both behaviours have real, non-vacuous e2e tests (
e2etag, run by CI Linux+macOS):TestPoolFailover_EndToEnd— two fake OAuth upstreams (real HMAC-JWT issuing) sharing one token URL behind one pool: A used until 429 → next request fails over to B → B's rotated tokens persist to B's vault entry, A's untouched → phantom access token byte-identical before/after switch →cred_failoveraudit present. Verified to fail when failover is reverted.TestApprovalCoalesce_BurstOnePrompt— gated webhook channel holds the first decision; 8 concurrent requests to one Ask target → exactly one approval call (peak concurrency 1) → singlealways_allowfans out to all 8. Verified to fail when coalescing is disabled.(
WithNoCoalesce/MCP opt-out is not e2e-expressible — no SOCKS5 burst surface for MCP — and remains unit-covered.)Validation
go test ./...— 2559 pass, 13/13 packagesgo test -tags=e2e ./e2e/— green (~141s)go test -race(proxy/vault/channel) — 0 data racesgolangci-lint v2.9.0— 0 issues;gofumpt,go vet,go buildcleane2e && linux/darwin) run by CI workflows.Generated via
/planning:exec; no Codex used (per constraint).