Skip to content

PAT lifecycle: rotator can deadlock on idle skip; proxy token cache can serve revoked tokens #52

@dgokeeffe

Description

@dgokeeffe

Two related stale-token bugs found while end-to-end testing #PR_TBD. Both are latent in production with the default 900 / 600 rotation params (manifest rarely or briefly) but cleanly reproducible by compressing the rotation cycle via the new PAT_TOKEN_LIFETIME / PAT_ROTATION_INTERVAL env knobs introduced in that PR.

Bug 1 — Rotator deadlock when idle-skip outruns token lifetime

pat_rotator.PATRotator._rotation_loop() skips rotation when session_count == 0. The skip does not update _last_rotation_time and does not check whether the in-process token is near expiry. If the token expires while sessions are absent, the next non-skipped rotation tries to mint a new token using a dead token — /api/2.0/token/create returns 403 — and the rotator is permanently stuck:

```
13:14:06 first rotation — T1 minted (lifetime 90s)
13:15:06 first scheduled rotation — "no active sessions — skipping"
13:15:36 T1 expires
13:16:06 second scheduled rotation — POST /token/create with revoked T1 → 403
13:16:51 same
13:17:36 same ← deadlock from this point
```

In production the lifetime/interval gap is 300s, so the skip window has to exceed ~5min to deadlock. Still possible; also easy to misconfigure via the new env knobs.

Fix sketch

  • When skipping due to no sessions, check whether the token's remaining lifetime is < min(rotation_interval, 60s) and force a rotation anyway, OR
  • On every successful rotation, record the absolute expiry timestamp; if the next loop tick finds expiry imminent regardless of sessions, rotate.

Bug 2 — Proxy token cache can serve revoked tokens

content_filter_proxy._get_fresh_token() caches the parsed ~/.databrickscfg token for _TOKEN_CACHE_TTL = 30 seconds. The cache is purely time-based — it doesn't notice when the file is rewritten by the rotator. After every rotation there is a window of up to 30s where the proxy will forward the now-revoked token to upstream, producing 403s.

In production (rotation interval 600s) that's ~5% of the time. Concretely visible during the test that uncovered the original Hermes bug.

Fix sketch

Stat the cfg file's mtime before deciding to use the cache. If the mtime is newer than the cache's read_at, re-parse. Single os.stat call per request — orders of magnitude cheaper than the actual HTTP forward, and the cache stays useful for de-duplicating multi-stream chunks within the same request burst.

```python
def _get_fresh_token():
now = time.time()
try:
mtime = os.stat(_DATABRICKSCFG_PATH).st_mtime
except OSError:
mtime = 0
cache = _TOKEN_CACHE
if cache["token"] and mtime <= cache.get("mtime", 0) and (now - cache["read_at"]) < _TOKEN_CACHE_TTL:
return cache["token"]
# ... re-read file, store both read_at and mtime
```

Scope

Filing as one issue because they're the same family — "how PAT lifecycle survives transitions between processes that hold cached views of the token". A small hotfix PR will follow shortly addressing both.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions