Skip to content

fix(gcp): rebuild credentials when token cache is poisoned (LOG-9470)#276

Open
vparfonov wants to merge 1 commit into
ViaQ:v0.54.0-rhfrom
vparfonov:log9470
Open

fix(gcp): rebuild credentials when token cache is poisoned (LOG-9470)#276
vparfonov wants to merge 1 commit into
ViaQ:v0.54.0-rhfrom
vparfonov:log9470

Conversation

@vparfonov

@vparfonov vparfonov commented Jun 11, 2026

Copy link
Copy Markdown

Summary

  • Fix GCP WIF (Workload Identity Federation) tokens not refreshing after ~1 hour, causing permanent 403 Forbidden errors
  • The google-cloud-auth library's internal TokenCache permanently stops its refresh loop on non-transient errors — once dead, credentials are poisoned forever
  • Add credential rebuild logic in token_regenerator: on refresh failure, re-create AccessTokenCredentials from the original config, verify the new credentials work, and swap them in via ArcSwap

Root Cause

google-cloud-auth v1.8.0 classifies only HTTP 500/503/408/429 as transient. Any other error (e.g. 403 from STS during kubelet token rotation, brief file I/O race) causes the internal refresh_task to break permanently. Vector's existing token_regenerator called access_token() on the same dead cache, so it also failed forever.

Changed

Cargo.toml — add arc-swap to the gcp feature gate (already a workspace dependency)
src/gcp.rs:

  • CredentialsState struct wraps credentials in Arc<ArcSwap<...>> (lock-free reads on the per-request hot path) alongside the original path and scopes needed for rebuild
  • GcpAuthenticator::Credentials variant now holds CredentialsState instead of bare Arc<AccessTokenCredentials>
  • make_token() reads credentials via ArcSwap::load_full() — no lock contention
  • rebuild() runs inside tokio::task::spawn_blocking to keep Mutex acquisition and file I/O off the async executor
  • token_regenerator() detects refresh failure and delegates to try_rebuild_credentials()
  • try_rebuild_credentials() attempts up to 3 rebuilds at 30s intervals, verifying each with access_token() before swapping in
  • unit tests

Fixes: LOG-9470

Summary by CodeRabbit

  • Refactor

    • Enhanced GCP credential management with improved refresh reliability and automatic retry logic for handling credential renewal failures.
  • Tests

    • Added test coverage for credential state management and validation.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Enterprise

Run ID: fa472ce8-e5c8-470a-b81c-ac8b9398ad20

📥 Commits

Reviewing files that changed from the base of the PR and between a329118 and 03091a5.

📒 Files selected for processing (2)
  • Cargo.toml
  • src/gcp.rs

📝 Walkthrough

Walkthrough

GcpAuthenticator is refactored to store OAuth credentials in an atomic ArcSwap-backed CredentialsState wrapper. Constructors build this wrapper instead of storing Arc directly. On token refresh failure, the authenticator attempts credential rebuild with retry backoff, verifies, swaps, and retries on subsequent cycles. Tests validate credential swapping and rebuild failure modes.

Changes

GCP Atomic Credential Replacement

Layer / File(s) Summary
Dependency, imports, and CredentialsState type
Cargo.toml, src/gcp.rs
Adds arc-swap to the gcp feature; imports ArcSwap; defines CredentialsState struct wrapping AccessTokenCredentials in ArcSwap, tracking optional path and scopes, with methods to load/swap/rebuild credentials and retry constants (REBUILD_RETRY_INTERVAL_SECS, MAX_REBUILD_ATTEMPTS).
Constructor and token generation wiring
src/gcp.rs
GcpAuthenticator::from_file and from_adc initialize CredentialsState with converted scopes and optional path; make_token calls current_creds() to retrieve the current access token credentials.
Credential refresh loop and rebuild on failure
src/gcp.rs
Token refresh loop now triggers credential rebuild via try_rebuild_credentials when access-token refresh fails. Helper retries rebuild up to MAX_REBUILD_ATTEMPTS with REBUILD_RETRY_INTERVAL_SECS backoff, verifies by fetching an access token, swaps successful credentials, notifies watchers, and logs on exhausted retries.
CredentialsState tests and helpers
src/gcp.rs
Adds async tests validating swap_creds updates current_creds, clones observe swapped credentials, rebuild errors on invalid path; includes test helpers to construct fake external-account JSON and CredentialsState instances.

🎯 3 (Moderate) | ⏱️ ~25 minutes


🐰 A rabbit hops through the vault so green,
With swap and arc-swap, credentials unseen,
Atomic refreshes, no race in the night,
GCP tokens now danced just right! 🔄✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.32% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: rebuilding GCP credentials when token cache becomes poisoned, which is the core fix addressing the 403 Forbidden errors.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jcantrill

Copy link
Copy Markdown
Member

/approve

@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcantrill, vparfonov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…soning

The google-cloud-auth library's internal TokenCache permanently stops its
refresh loop on non-transient errors (e.g. 403, file I/O race during
kubelet token rotation). Once stopped, all subsequent access_token() calls
return the cached error forever — credentials are permanently poisoned.

Add credential rebuild logic: when the token refresh fails, Vector now
re-creates the AccessTokenCredentials from the original config (path +
scopes), verifies the new credentials can obtain a token, and swaps them
in via ArcSwap. This gives the system a fresh TokenCache with a healthy
refresh loop. Up to 3 rebuild attempts are made at 30-second intervals
before falling back to the normal 55-minute refresh cycle.

The rebuild runs on spawn_blocking to keep the Mutex acquisition and
file I/O off the async executor. Token reads use lock-free ArcSwap
instead of RwLock to avoid contention on the per-request hot path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vparfonov

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@vparfonov

Copy link
Copy Markdown
Author

/retest-required

@vparfonov

Copy link
Copy Markdown
Author

/test cargo-fmt-check

@vparfonov

Copy link
Copy Markdown
Author

/test cluster-logging-operator-e2e

@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

@vparfonov: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/cluster-logging-operator-e2e 03091a5 link true /test cluster-logging-operator-e2e

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants