PAT lifecycle: rotator can deadlock on idle skip; proxy token cache can serve revoked tokens

Two related stale-token bugs found while end-to-end testing #PR_TBD. Both are latent in production with the default `900 / 600` rotation params (manifest rarely or briefly) but cleanly reproducible by compressing the rotation cycle via the new `PAT_TOKEN_LIFETIME` / `PAT_ROTATION_INTERVAL` env knobs introduced in that PR.

## Bug 1 — Rotator deadlock when idle-skip outruns token lifetime

`pat_rotator.PATRotator._rotation_loop()` skips rotation when `session_count == 0`. The skip does **not** update `_last_rotation_time` and does **not** check whether the in-process token is near expiry. If the token expires while sessions are absent, the next non-skipped rotation tries to mint a new token using a dead token — `/api/2.0/token/create` returns 403 — and the rotator is permanently stuck:

\`\`\`
13:14:06 first rotation — T1 minted (lifetime 90s)
13:15:06 first scheduled rotation — \"no active sessions — skipping\"
13:15:36 T1 expires
13:16:06 second scheduled rotation — POST /token/create with revoked T1 → 403
13:16:51 same
13:17:36 same  ← deadlock from this point
\`\`\`

In production the lifetime/interval gap is 300s, so the skip window has to exceed ~5min to deadlock. Still possible; also easy to misconfigure via the new env knobs.

**Fix sketch**
- When skipping due to no sessions, check whether the token's remaining lifetime is < `min(rotation_interval, 60s)` and force a rotation anyway, OR
- On every successful rotation, record the absolute expiry timestamp; if the next loop tick finds expiry imminent regardless of sessions, rotate.

## Bug 2 — Proxy token cache can serve revoked tokens

`content_filter_proxy._get_fresh_token()` caches the parsed `~/.databrickscfg` token for `_TOKEN_CACHE_TTL = 30` seconds. The cache is purely time-based — it doesn't notice when the file is rewritten by the rotator. After every rotation there is a window of up to 30s where the proxy will forward the now-revoked token to upstream, producing 403s.

In production (rotation interval 600s) that's ~5% of the time. Concretely visible during the test that uncovered the original Hermes bug.

**Fix sketch**

Stat the cfg file's mtime before deciding to use the cache. If the mtime is newer than the cache's `read_at`, re-parse. Single `os.stat` call per request — orders of magnitude cheaper than the actual HTTP forward, and the cache stays useful for de-duplicating multi-stream chunks within the same request burst.

\`\`\`python
def _get_fresh_token():
    now = time.time()
    try:
        mtime = os.stat(_DATABRICKSCFG_PATH).st_mtime
    except OSError:
        mtime = 0
    cache = _TOKEN_CACHE
    if cache[\"token\"] and mtime <= cache.get(\"mtime\", 0) and (now - cache[\"read_at\"]) < _TOKEN_CACHE_TTL:
        return cache[\"token\"]
    # ... re-read file, store both read_at and mtime
\`\`\`

## Scope

Filing as one issue because they're the same family — \"how PAT lifecycle survives transitions between processes that hold cached views of the token\". A small hotfix PR will follow shortly addressing both.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PAT lifecycle: rotator can deadlock on idle skip; proxy token cache can serve revoked tokens #52

Bug 1 — Rotator deadlock when idle-skip outruns token lifetime

Bug 2 — Proxy token cache can serve revoked tokens

Scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PAT lifecycle: rotator can deadlock on idle skip; proxy token cache can serve revoked tokens #52

Description

Bug 1 — Rotator deadlock when idle-skip outruns token lifetime

Bug 2 — Proxy token cache can serve revoked tokens

Scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions