Skip to content

fix(hermes): route via content-filter proxy; close stale-token windows in proxy + rotator#53

Open
dgokeeffe wants to merge 2 commits into
mainfrom
fix/hermes-route-via-proxy
Open

fix(hermes): route via content-filter proxy; close stale-token windows in proxy + rotator#53
dgokeeffe wants to merge 2 commits into
mainfrom
fix/hermes-route-via-proxy

Conversation

@dgokeeffe
Copy link
Copy Markdown
Collaborator

Summary

  • Hermes was the only CLI in CoDA holding a frozen api_key in memory across PAT rotations. Routing it via the local content-filter proxy (port 4000) lets the proxy inject a fresh token from ~/.databrickscfg per request, the same trick OpenCode already relies on.
  • While testing, surfaced two related stale-token windows (filed as PAT lifecycle: rotator can deadlock on idle skip; proxy token cache can serve revoked tokens #52): the proxy's 30s token cache served revoked tokens after each rotation, and the rotator could deadlock when idle skips outran the in-process token's lifetime. Both fixed here too.

Changes

File Change
setup_hermes.py base_url: http://127.0.0.1:4000 instead of the gateway/serving endpoint. Diagnostic banner updated.
pat_rotator.py PAT_ROTATION_INTERVAL / PAT_TOKEN_LIFETIME env knobs (prod defaults unchanged). Idle-skip now still rotates when the in-process token is within one rotation interval of expiry — prevents deadlock.
content_filter_proxy.py _get_fresh_token consults file mtime, invalidating the cache the instant the rotator rewrites ~/.databrickscfg. 30s TTL kept as a backstop.
cli_auth.py unchanged — _update_hermes was already in place as defence-in-depth.
tests/test_cli_token_rotation.py New TestUpdateHermes (3 cases); omnibus test bumped from "four" → "five".
tests/test_pat_rotator.py New TestRotationOnNearExpiry (2 cases).
tests/test_content_filter_proxy.py New file — 4 cases for mtime-based cache invalidation, stat fallback, missing file.

Test plan

  • Unit: 37 / 37 pass (pytest tests/test_pat_rotator.py tests/test_cli_token_rotation.py tests/test_content_filter_proxy.py).
  • E2E on daveok (pre-hotfix): minted PAT, opened hermes chat, sent prompt chore(deps): bump opentelemetry-semantic-conventions from 0.62b0 to 0.62b1 #1 ("hello-one"), waited 90s during which two rotations fired (per app logs: "Old token ELIMINATED"), sent prompt chore(deps): bump fastapi from 0.136.0 to 0.136.1 #2 ("hello-two") in the same long-running Hermes process. Both turns succeeded — would have 403'd against the unfixed code.
  • E2E on daveok (post-hotfix): same flow with "hotfix-one" / "hotfix-two" — both succeeded, no regression.
  • Verified via grep inside the container that ~/.hermes/config.yaml now points base_url at http://127.0.0.1:4000.

Closes #52.

This pull request and its description were written by Isaac.

dgokeeffe added 2 commits May 19, 2026 10:11
Hermes is the only long-running interactive CLI in CoDA — Claude/Codex/Gemini
are re-spawned per turn so they re-read their token from disk on every call,
and OpenCode already routes through the local content-filter proxy. Hermes
loaded ~/.hermes/config.yaml once at startup, cached the api_key in memory,
and silently 403'd on its next turn after the in-process PAT rotator swapped
tokens (#issue).

setup_hermes.py now points base_url at http://127.0.0.1:4000 instead of the
gateway/serving endpoint. The proxy reads ~/.databrickscfg per request (with a
short cache) and injects a fresh Bearer, so Hermes' stale in-memory api_key
is overridden transparently — same trick OpenCode uses.

Also adds PAT_ROTATION_INTERVAL / PAT_TOKEN_LIFETIME env knobs to pat_rotator
so e2e tests can compress the rotation cycle without a code change. Prod
defaults (900s/600s) unchanged.

Test coverage closed too: test_cli_token_rotation.py was missing a
TestUpdateHermes class and the omnibus test was literally named "four". Added
the class and bumped to "five".

Verified end-to-end on daveok: minted PAT, opened hermes chat, sent prompt,
waited 90s for two rotations (old tokens ELIMINATED per app logs), sent a
second prompt in the same long-running Hermes process — second turn succeeded
where it would have 403'd against the unfixed code.

Co-authored-by: Isaac
…r skip

Two related bugs surfaced while end-to-end testing the Hermes-routing fix
(GH-52):

1. **Proxy cache could serve revoked tokens for up to 30s after rotation.**
   content_filter_proxy._get_fresh_token cached the parsed ~/.databrickscfg
   value purely on read time, so a rotation that rewrote the file mid-cache
   was invisible until TTL expired. In prod (10-min rotation) that's a ~5%
   failure window per cycle. Now stats the cfg file's mtime and invalidates
   the cache the instant mtime advances; the TTL stays as a backstop for
   weird filesystem behaviour.

2. **Rotator deadlocked if an idle skip outran the token's lifetime.** The
   _rotation_loop skipped when session_count == 0, without checking whether
   the in-process token was approaching expiry. If it expired during the
   skip, the next attempt to mint a replacement 403'd (we'd authenticate
   with a dead token) and the rotator was stuck forever. Now still skips
   when idle but only while the token has > one rotation_interval of life
   left; otherwise rotates anyway to keep our own auth alive.

Tests added for both: tests/test_content_filter_proxy.py (4 cases) plus
TestRotationOnNearExpiry in test_pat_rotator.py (2 cases). All 37 tests in
the PAT-lifecycle group pass.

Closes #52

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PAT lifecycle: rotator can deadlock on idle skip; proxy token cache can serve revoked tokens

1 participant