Two related stale-token bugs found while end-to-end testing #PR_TBD. Both are latent in production with the default 900 / 600 rotation params (manifest rarely or briefly) but cleanly reproducible by compressing the rotation cycle via the new PAT_TOKEN_LIFETIME / PAT_ROTATION_INTERVAL env knobs introduced in that PR.
Bug 1 — Rotator deadlock when idle-skip outruns token lifetime
pat_rotator.PATRotator._rotation_loop() skips rotation when session_count == 0. The skip does not update _last_rotation_time and does not check whether the in-process token is near expiry. If the token expires while sessions are absent, the next non-skipped rotation tries to mint a new token using a dead token — /api/2.0/token/create returns 403 — and the rotator is permanently stuck:
```
13:14:06 first rotation — T1 minted (lifetime 90s)
13:15:06 first scheduled rotation — "no active sessions — skipping"
13:15:36 T1 expires
13:16:06 second scheduled rotation — POST /token/create with revoked T1 → 403
13:16:51 same
13:17:36 same ← deadlock from this point
```
In production the lifetime/interval gap is 300s, so the skip window has to exceed ~5min to deadlock. Still possible; also easy to misconfigure via the new env knobs.
Fix sketch
- When skipping due to no sessions, check whether the token's remaining lifetime is <
min(rotation_interval, 60s) and force a rotation anyway, OR
- On every successful rotation, record the absolute expiry timestamp; if the next loop tick finds expiry imminent regardless of sessions, rotate.
Bug 2 — Proxy token cache can serve revoked tokens
content_filter_proxy._get_fresh_token() caches the parsed ~/.databrickscfg token for _TOKEN_CACHE_TTL = 30 seconds. The cache is purely time-based — it doesn't notice when the file is rewritten by the rotator. After every rotation there is a window of up to 30s where the proxy will forward the now-revoked token to upstream, producing 403s.
In production (rotation interval 600s) that's ~5% of the time. Concretely visible during the test that uncovered the original Hermes bug.
Fix sketch
Stat the cfg file's mtime before deciding to use the cache. If the mtime is newer than the cache's read_at, re-parse. Single os.stat call per request — orders of magnitude cheaper than the actual HTTP forward, and the cache stays useful for de-duplicating multi-stream chunks within the same request burst.
```python
def _get_fresh_token():
now = time.time()
try:
mtime = os.stat(_DATABRICKSCFG_PATH).st_mtime
except OSError:
mtime = 0
cache = _TOKEN_CACHE
if cache["token"] and mtime <= cache.get("mtime", 0) and (now - cache["read_at"]) < _TOKEN_CACHE_TTL:
return cache["token"]
# ... re-read file, store both read_at and mtime
```
Scope
Filing as one issue because they're the same family — "how PAT lifecycle survives transitions between processes that hold cached views of the token". A small hotfix PR will follow shortly addressing both.
Two related stale-token bugs found while end-to-end testing #PR_TBD. Both are latent in production with the default
900 / 600rotation params (manifest rarely or briefly) but cleanly reproducible by compressing the rotation cycle via the newPAT_TOKEN_LIFETIME/PAT_ROTATION_INTERVALenv knobs introduced in that PR.Bug 1 — Rotator deadlock when idle-skip outruns token lifetime
pat_rotator.PATRotator._rotation_loop()skips rotation whensession_count == 0. The skip does not update_last_rotation_timeand does not check whether the in-process token is near expiry. If the token expires while sessions are absent, the next non-skipped rotation tries to mint a new token using a dead token —/api/2.0/token/createreturns 403 — and the rotator is permanently stuck:```
13:14:06 first rotation — T1 minted (lifetime 90s)
13:15:06 first scheduled rotation — "no active sessions — skipping"
13:15:36 T1 expires
13:16:06 second scheduled rotation — POST /token/create with revoked T1 → 403
13:16:51 same
13:17:36 same ← deadlock from this point
```
In production the lifetime/interval gap is 300s, so the skip window has to exceed ~5min to deadlock. Still possible; also easy to misconfigure via the new env knobs.
Fix sketch
min(rotation_interval, 60s)and force a rotation anyway, ORBug 2 — Proxy token cache can serve revoked tokens
content_filter_proxy._get_fresh_token()caches the parsed~/.databrickscfgtoken for_TOKEN_CACHE_TTL = 30seconds. The cache is purely time-based — it doesn't notice when the file is rewritten by the rotator. After every rotation there is a window of up to 30s where the proxy will forward the now-revoked token to upstream, producing 403s.In production (rotation interval 600s) that's ~5% of the time. Concretely visible during the test that uncovered the original Hermes bug.
Fix sketch
Stat the cfg file's mtime before deciding to use the cache. If the mtime is newer than the cache's
read_at, re-parse. Singleos.statcall per request — orders of magnitude cheaper than the actual HTTP forward, and the cache stays useful for de-duplicating multi-stream chunks within the same request burst.```python
def _get_fresh_token():
now = time.time()
try:
mtime = os.stat(_DATABRICKSCFG_PATH).st_mtime
except OSError:
mtime = 0
cache = _TOKEN_CACHE
if cache["token"] and mtime <= cache.get("mtime", 0) and (now - cache["read_at"]) < _TOKEN_CACHE_TTL:
return cache["token"]
# ... re-read file, store both read_at and mtime
```
Scope
Filing as one issue because they're the same family — "how PAT lifecycle survives transitions between processes that hold cached views of the token". A small hotfix PR will follow shortly addressing both.