fix(gcp): rebuild credentials when token cache is poisoned (LOG-9470)#276
fix(gcp): rebuild credentials when token cache is poisoned (LOG-9470)#276vparfonov wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughGcpAuthenticator is refactored to store OAuth credentials in an atomic ArcSwap-backed CredentialsState wrapper. Constructors build this wrapper instead of storing Arc directly. On token refresh failure, the authenticator attempts credential rebuild with retry backoff, verifies, swaps, and retries on subsequent cycles. Tests validate credential swapping and rebuild failure modes. ChangesGCP Atomic Credential Replacement
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcantrill, vparfonov The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…soning The google-cloud-auth library's internal TokenCache permanently stops its refresh loop on non-transient errors (e.g. 403, file I/O race during kubelet token rotation). Once stopped, all subsequent access_token() calls return the cached error forever — credentials are permanently poisoned. Add credential rebuild logic: when the token refresh fails, Vector now re-creates the AccessTokenCredentials from the original config (path + scopes), verifies the new credentials can obtain a token, and swaps them in via ArcSwap. This gives the system a fresh TokenCache with a healthy refresh loop. Up to 3 rebuild attempts are made at 30-second intervals before falling back to the normal 55-minute refresh cycle. The rebuild runs on spawn_blocking to keep the Mutex acquisition and file I/O off the async executor. Token reads use lock-free ArcSwap instead of RwLock to avoid contention on the per-request hot path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
/retest-required |
|
/test cargo-fmt-check |
|
/test cluster-logging-operator-e2e |
|
@vparfonov: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
google-cloud-authlibrary's internalTokenCachepermanently stops its refresh loop on non-transient errors — once dead, credentials are poisoned forevertoken_regenerator: on refresh failure, re-createAccessTokenCredentialsfrom the original config, verify the new credentials work, and swap them in viaArcSwapRoot Cause
google-cloud-auth v1.8.0classifies only HTTP 500/503/408/429 as transient. Any other error (e.g. 403 from STS during kubelet token rotation, brief file I/O race) causes the internal refresh_task to break permanently. Vector's existing token_regenerator calledaccess_token()on the same dead cache, so it also failed forever.Changed
Cargo.toml— add arc-swap to the gcp feature gate (already a workspace dependency)src/gcp.rs:CredentialsStatestruct wraps credentials inArc<ArcSwap<...>>(lock-free reads on the per-request hot path) alongside the original path and scopes needed for rebuildGcpAuthenticator::Credentialsvariant now holdsCredentialsStateinstead of bareArc<AccessTokenCredentials>make_token()reads credentials viaArcSwap::load_full()— no lock contentionrebuild()runs insidetokio::task::spawn_blockingto keepMutexacquisition and file I/O off the async executortoken_regenerator()detects refresh failure and delegates totry_rebuild_credentials()try_rebuild_credentials()attempts up to 3 rebuilds at 30s intervals, verifying each with access_token() before swapping inFixes: LOG-9470
Summary by CodeRabbit
Refactor
Tests