Skip to content

feat(auth): Phase 2 — AWS Sigv4, GCP IAP, Azure AD providers (v0.11.0)#79

Open
naveen-kurra wants to merge 12 commits into
initializ:mainfrom
naveen-kurra:pr2-cloud-native-auth-providers
Open

feat(auth): Phase 2 — AWS Sigv4, GCP IAP, Azure AD providers (v0.11.0)#79
naveen-kurra wants to merge 12 commits into
initializ:mainfrom
naveen-kurra:pr2-cloud-native-auth-providers

Conversation

@naveen-kurra
Copy link
Copy Markdown
Collaborator

@naveen-kurra naveen-kurra commented May 24, 2026

Summary

Phase 2 of the pluggable auth provider work — three cloud-native providers
on top of the Phase 1 foundation (#77 / 7998f12). Customers authenticate
to Forge using identities they already have in their cloud; no parallel
IdP required.

Provider What it does Client token format
aws_sigv4 Caller mints a pre-signed STS GetCallerIdentity URL with their AWS SDK; Forge invokes it. STS returns the caller's canonical ARN. Same pattern as aws-iam-authenticator (EKS). Authorization: Bearer forge-aws-v1.<base64-of-presigned-sts-url>
gcp_iap Verify the JWT GCP IAP forwards as X-Goog-Iap-Jwt-Assertion when Forge sits behind a GCP HTTPS LB + IAP. X-Goog-Iap-Jwt-Assertion: <jwt>
azure_ad Verify Microsoft Entra ID Bearer tokens with tenant lock-in and optional Microsoft Graph group enrichment. Composes the Phase 1 oidc provider. Authorization: Bearer <aad-jwt>

Forge never holds any IdP secrets — all three providers verify a caller-
minted credential against a third party (STS / GCP JWKS / AAD JWKS).

Why this matters

Today, putting Forge behind any of the three big cloud IdPs requires
standing up a parallel OIDC issuer (Cognito for AWS, Workspace SAML, etc.).
This PR removes that friction:

  • AWS: any Lambda / EC2 / EKS / IAM user calls Forge using their
    existing IAM credentials. Zero secrets stored on Forge, no token endpoint
    to host.
  • GCP: Forge sitting behind GCP LB+IAP consumes IAP's signed user
    assertion directly.
  • Azure: Entra tokens (humans, CI, Azure workloads) work natively with
    AAD-specific quirks (tenant gate, groups overage) handled correctly.

Design pivot during PR review — aws_sigv4 switched to pre-signed URL pattern

The PR went through an in-flight design correction caught by real-AWS
smoke testing. The TL;DR:

What was wrong (original design — commits 9a1ebae through 382294e)

The first design had clients sign their POST to Forge using AWS Sigv4
("header reflection"). Forge would forward the signed headers to STS,
expecting STS to validate them.

This is broken in a deterministic way: Sigv4 binds its signature to
the destination host
as part of the canonicalized signing input.
Headers signed for Forge's hostname can't be replayed against STS —
STS computes host: sts.<region>.amazonaws.com, recomputes the
signature, gets a different hash, and rejects with
SignatureDoesNotMatch.

Standard tools (awscurl, boto3.client('sts'), all AWS SDKs) always
sign for the URL they're calling, so there was no working client path.

What's correct (current design — commit 8568535)

Switched to the pre-signed URL pattern that aws-iam-authenticator
uses for EKS:

  1. Caller's AWS SDK mints a pre-signed GetCallerIdentity URL — signature
    embedded in query params, signed for STS's host.
  2. Caller base64-encodes the URL, prepends forge-aws-v1., sends as a
    standard Authorization: Bearer … header.
  3. Forge decodes, validates the URL host matches sts.<region>.amazonaws.com
    (SSRF guard), GETs the URL, parses the XML response, stamps Identity.

Why this is the right design (not just a fix)

Three design properties land cleanly:

  1. Client UX matches OIDC/JWT/Azure AD. All three Phase 2 providers
    now have the same caller experience: mint a Bearer token, attach it,
    send. Three lines of Python (or any AWS-SDK-bearing language). No
    custom signing logic, no header manipulation, no per-call procedure.
  2. No re-engineering of standard tools. aws-iam-authenticator-style
    token format works with any AWS SDK's SigV4QueryAuth API.
  3. Server is simpler. Forge doesn't reflect headers; it just GETs a
    URL. ~80 LOC of STS client code instead of header-forwarding plumbing.

Locked design decisions (unchanged through the pivot)

  • §9.1 — No aws-sdk-go-v2 dependency. STS client is hand-rolled HTTP + XML.
  • §9.2azure_ad composes Phase 1 oidc.Provider; no JWT/JWKS code in azure_ad/.
  • §9.3allowed_principals uses path.Match shell globs (no regex).
  • §9.4 — IAP issuer + JWKS URL hardcoded; no operator-overridable knob.

What clients write (the actual 3 lines)

import boto3, base64
from botocore.auth import SigV4QueryAuth
from botocore.awsrequest import AWSRequest

creds = boto3.Session().get_credentials().get_frozen_credentials()
req = AWSRequest(method='GET',
                 url='https://sts.us-east-1.amazonaws.com/?Action=GetCallerIdentity&Version=2011-06-15')
SigV4QueryAuth(creds, 'sts', 'us-east-1', expires=900).add_auth(req)
token = 'forge-aws-v1.' + base64.urlsafe_b64encode(req.url.encode()).rstrip(b'=').decode()

requests.post(forge_url, headers={'Authorization': f'Bearer {token}'}, data=msg)

A reference client lives at scripts/forge-aws-sign.py (~80 LOC; CLI flags
for --region, --profile, --token-only, etc.).

Note on boto3: the high-level boto3.client('sts').generate_presigned_url('get_caller_identity', ...) returns a URL STS rejects with SignatureDoesNotMatch even on direct curl — known boto3 limitation (it signs as if the request were a POST). The reference client uses the lower-level SigV4QueryAuth API directly, same as aws-iam-authenticator does internally.

✅ Real-AWS validation (live AWS account)

End-to-end validated against real AWS STS using SSO assumed-role
credentials in account 412664885516. Forge instance built from
commit 8568535, running locally.

# Scenario Expected Actual Validates
1 curl with no auth headers 401, missing_token ✅ 401 + valid bearer token required Phase 1 middleware enforces auth; review #4 reason-code preserved
2 Pre-signed URL token, empty allowlist 200 / handler-error, auth_verify fires ✅ HTTP 400 (body shape; auth path complete), auth_verify emitted STS reflection works end-to-end against real STS
3 Same token, matching allowed_principals 200 / handler-error (auth pass) ✅ HTTP 400, auth_verify fires ArnMatcher accepts matching ARNs
4 Same token, wrong-account allowed_principals 401, rejected ✅ 401 + token rejected by auth provider ArnMatcher correctly denies non-matching ARNs

Audit log emitted (Test #2 success path)

{
  "event":"auth_verify",
  "fields":{
    "method":"POST",
    "path":"/tasks/send",
    "provider":"aws_sigv4",
    "user_id":"arn:aws:sts::412664885516:assumed-role/AWSReservedSSO_PowerUserAccess_c794d5f2c2fe4370/Naveen",
    "org_id":"412664885516",
    "token_kind":"sigv4",
    "groups_count":0,
    "remote_addr":"[::1]:62448"
  }
}

Every field correct: provider matches the configured name, user_id is the STS-returned canonical ARN (assumed-role form, including session name), org_id is the AWS account number, token_kind is the new forge-aws-v1.-prefix detection.

Two bugs caught during live testing (both fixed in 8568535)

  1. URL re-encoding through net/url. Round-tripping the presigned URL through Go's *url.URL.String() re-encoded query params in ways that differed from how the AWS SDK emitted them (e.g., / in X-Amz-Credential, + inside X-Amz-Security-Token). STS recomputed the canonical request from those re-encoded bytes and rejected. Fix: PresignedToken keeps a RawURL string field with the byte-for-byte original; the parsed *url.URL is used only for SSRF host validation and query-param inspection.
  2. boto3's generate_presigned_url quirk. The reference client originally used boto3.client('sts').generate_presigned_url('get_caller_identity', …), which produces a URL STS rejects with SignatureDoesNotMatch (known boto3 quirk — signs as if POST). Fix: the reference client now uses the lower-level SigV4QueryAuth.add_auth() directly, same pattern aws-iam-authenticator uses.

Other layers also exercised live

  • Egress allowlist auto-extended to include sts.us-east-1.amazonaws.com — Forge's outbound STS call was permitted.
  • Loopback static_token still works for the dashboard at http://localhost:9999/ (auto-prepended via PrependChain — Phase 1 review Add per-agent secrets, build signing, and forge framework #10 invariant intact under Phase 2).
  • Config hot-reload picks up forge.yaml content changes for most fields. Caveat: auth-chain providers are constructed once at startup; modifying allowed_principals requires a hard restart (Ctrl-C + forge run again) for the new allowlist to take effect. Documented as a follow-up — affects all providers, not just aws_sigv4.
  • Auth-before-Egress wizard ordering (639bfa9) ensures wizard-scaffolded configs include the STS host in egress_hosts.

What was NOT live-tested (and why)

  • gcp_iap provider — requires a GCP project with HTTPS LB + IAP enabled. Covered by unit tests with a fake JWKS signer.
  • azure_ad provider — requires an Entra tenant + app registration. Covered by unit tests with a fake AAD.

Per PHASE2_TEST_STRATEGY.md §8.2, those live tests run at release-tag time, not on every PR.

What landed (12 commits)

# Commit Theme
PR 1 55942d8 Header contract extension (foundation)
PR 2 9a1ebae aws_sigv4 provider (initial — header-reflection design)
PR 3 5b71071 gcp_iap provider
PR 4 1e23140 azure_ad provider
PR 5 47c4748 Wizard + non-interactive flags + Web UI integration
PR 6 98578f9 Operator docs + CHANGELOG
fix 639bfa9 Auth step runs before Egress; auth hosts auto-added to allowlist
audit b5f303b Audit-pass refinements (iap_jwt token_kind, URL parse caching, cleanup)
docs aaf8375 Gitignore docs/auth (per reviewer feedback)
client 382294e Reference client + "client signing contract" docstring (original pattern)
pivot b3444c2 Switch aws_sigv4 to pre-signed URL pattern
live 8568535 Live-test fixes: preserve raw URL; use SigV4QueryAuth in client

Total: ~42 files, +5,500 / -700 lines net (mostly tests + the design-pivot rewrite).

Phase 1 compatibility

  • Zero forge.yaml changes required for callers using Phase 1 providers (static_token, oidc, http_verifier). Phase 1 test suite passes unmodified.
  • The auth Headers map gained one new key — X-Goog-Iap-Jwt-Assertion for gcp_iap. Existing keys unchanged.
  • The oidc package gained an internal SkipIssuerCheck field with yaml:"-" — unreachable from forge.yaml, only set by azure_ad multi-tenant. Operators see no change.

Security model highlights

  • No IdP secret on Forge. STS / GCP / AAD do the cryptographic work; Forge reflects or verifies, never holds keys.
  • SSRF guard on aws_sigv4 — the pre-signed URL host MUST match sts.<configured-region>.amazonaws.com. A token whose URL points elsewhere is rejected at parse time, before any outbound request.
  • Algorithm whitelist before key lookup on the OIDC / IAP / AAD providers — algorithm-confusion attacks rejected before reaching JWKS.
  • Tenant gate before Graph on azure_ad — multi-tenant requires explicit opt-in; Graph calls only fire after tid validation.
  • Caller's Bearer reflected to Graph — Forge holds no Graph credentials.

Known deferred work

  • TUI sub-step input flows for the three new providers — non-interactive flag path is the production-critical surface; TUI users pick "Custom" and edit forge.yaml directly until the TUI follow-up lands.
  • Hot-reload of auth chain — config edits to auth.providers (incl. allowed_principals) require a hard forge run restart. Affects all providers, not just Phase 2.

Test plan

  • Unit tests: happy path, malformed input, scope/tenant rejection, allowlist miss, cache hit/miss, stale-grace, soft-fail, alg confusion, body caps, redirect attacks — all three new providers
  • Pre-signed URL parser fuzz test (corpus-seeded, runs in CI)
  • Phase 1 regression suite passes unmodified
  • go test -race -count=1 ./... — 42 packages green
  • golangci-lint v2.10.1 — 0 issues
  • gofmt -l forge-core forge-cli forge-plugins — clean
  • Anti-pattern grep gates (no aws-sdk-go import, IAP constants confined, no JWT in azure_ad, skip_issuer_check never in YAML) — all pass
  • Live AWS validation — happy + deny paths against real STS (see "Real-AWS validation" above)
  • Manual smoke for gcp_iap and azure_ad — runs at release-tag time per PHASE2_TEST_STRATEGY.md §8.2

Design artifacts (offline)

Full design package in ~/Desktop/forge_designs_and_PRD/phase2_implementation/:

  • PHASE2_CLOUD_NATIVE_PROVIDERS.md — top-level design + §9 locked decisions
  • PHASE2_PROGRESS_MAP.md — diagram-to-PR tracker
  • PHASE2_TEST_STRATEGY.md — pyramid, harnesses, security catalog, CI gates, manual smoke
  • 7 PlantUML diagrams (component / class / startup / request flow / per-provider deep-dives)
  • PR1_HEADER_CONTRACT.md through PR6_DOCS.md — per-PR checklists with code sketches and acceptance criteria

Naveen Kurra added 12 commits May 23, 2026 19:19
…r 1)

HeadersFromRequest gains Authorization, X-Goog-Iap-Jwt-Assertion,
X-Amz-Date, X-Amz-Security-Token so future providers consuming
non-Bearer formats (aws_sigv4, gcp_iap) can read what they need
without changing the Provider.Verify signature.

TokenKind recognizes the "AWS4-HMAC-SHA256 " prefix and returns
"sigv4", so audit logs can distinguish Sigv4 requests from "empty"
even though the Bearer extractor returns "".

Middleware now consults the chain even when no Bearer token was
extracted, provided a non-Bearer auth header is present (Sigv4
Authorization or IAP assertion). When NO auth headers at all are
present, the audit reason still resolves to ErrMissingBearer —
preserving review initializ#4's stable "missing_token" reason code.

Phase 1 providers see zero behavior change; their Verify path is
unchanged. All Phase 1 tests pass without modification.
…e 2 pr 2)

aws_sigv4 authenticates AWS-IAM callers by reflecting their Sigv4
signature to STS GetCallerIdentity. No aws-sdk-go-v2 dependency
(decision §9.1): the STS RPC is ~150 LOC of hand-rolled HTTP +
XML. Forge never holds the caller's secret key — STS validates
the signature on Forge's behalf.

Key pieces:
- sigv4_parser.go: pure string parser, fuzz-tested, never panics
- sts_client.go: 200/4xx/5xx classification per review initializ#6 contract
- identity_cache.go: hash(AKID|YYYYMMDD)-keyed TTL cache, opportunistic
  eviction past 10k entries, Put does NOT extend prior expiry
- arn_matcher.go: shell-style globs via path.Match (decision §9.3),
  invalid patterns fail at Factory time
- provider.go: scope check (service=sts, region match) before any
  STS round trip, cache hit avoids RPC, rejection does NOT poison
  the cache

security:
- Algorithm: only AWS4-HMAC-SHA256 prefix is claimed
- Scope: cross-service replay (s3->sts) and cross-region replay
  (eu-west-1->us-east-1) rejected at parse-time
- Cache: bucketing by YYYYMMDD bounds stolen-key window to a day
- Body cap: 64 KiB on STS responses
- Logs: STS error bodies summarized at 200 chars, newlines stripped

audit:
- ErrTokenNotForMe   -> not_for_me   (no AWS4 prefix)
- ErrInvalidToken    -> invalid      (malformed Sigv4)
- ErrTokenRejected   -> rejected     (scope/allowlist/STS 4xx)
- ErrProviderUnavailable -> provider_unavailable (STS 5xx/network)

extras:
- security.AuthDomains gains sts.<region>.amazonaws.com (+ override
  host when sts_endpoint set for tests)
- forge-cli/runtime/runner.go side-effect imports aws_sigv4
gcp_iap consumes the X-Goog-Iap-Jwt-Assertion header that GCP's
Identity-Aware Proxy forwards on every authenticated request when
Forge sits behind a GCP HTTPS load balancer with IAP enabled.

Decision §9.4: IAP issuer (https://cloud.google.com/iap) and JWKS
URL (https://www.gstatic.com/iap/verify/public_key-jwk) are
hardcoded. They're the only stable contract GCP exposes; an
override knob would be a footgun.

key pieces:
- iap_jwks.go: ES256-only JWKS cache, TTL refresh + backoff +
  stale-grace (mirrors Phase 1 OIDC review initializ#1 pattern)
- provider.go: header-presence check, claims projection, iss/aud
  gates, sub/email required-claims check
- parseECJWKSet drops non-EC / non-P-256 / non-ES256-labeled keys
  during parse — defense in depth against compromised JWKS
- alg whitelist rejects RS256 BEFORE key lookup (algorithm-
  confusion defense)
- aud as string OR array both parse (JWT spec allows either)
- audit reasons follow Phase 1 contract:
    rejected             — iss/aud mismatch, expired, bad signature
    invalid              — alg != ES256, missing sub/email, bad kid
    provider_unavailable — JWKS fetch failed AND no prior key cached
    not_for_me           — header absent

extras:
- security.AuthDomains returns www.gstatic.com when gcp_iap is configured
- forge-cli/runtime/runner.go side-effect imports gcp_iap
azure_ad authenticates Microsoft Entra ID tokens. Composes the
Phase 1 oidc.Provider (decision §9.2) for signature verify + base
claim validation; layers AAD-specific concerns on top:
  - Tenant lock-in via the tid claim
  - Optional Microsoft Graph group enrichment when JWT groups claim
    is empty (AAD truncates at ~200 groups)
  - Single-tenant vs multi-tenant issuer template

key pieces:
- provider.go: composed oidc + tenant gate + Source overwrite to
  "azure_ad" (replaces the inner "oidc" stamp)
- tenant.go: ExtractTenantID — typed accessor for the tid claim
- graph_client.go: Graph /me/transitiveMemberOf with pagination,
  same-host enforcement (rejects redirect attacks), 401/403 ->
  ErrTokenRejected, 5xx -> ErrProviderUnavailable, defensive
  cap at 5000 groups, body cap 1 MiB per page
- graph_cache.go: 5 min TTL, same shape as aws_sigv4's cache

key decisions:
- oidc.Config gains internal SkipIssuerCheck flag with yaml:"-"
  so it CANNOT be set via forge.yaml — only callable from another
  Go package. AAD multi-tenant uses it; everything else leaves it
  off. Surfacing it in YAML would let operators disable iss
  validation by accident.
- Soft-fail on Graph 5xx/401: Identity returned with empty Groups
  rather than blocking prod traffic. Hard-fail mode (graph_required)
  out of scope for v0.11.
- Forge reflects the CALLER's Bearer to Graph; holds no Graph
  credentials of its own.

audit reasons:
- ErrTokenRejected     -> rejected (tid mismatch, bad sig, Graph 401)
- ErrInvalidToken      -> invalid  (missing tid, malformed claims)
- ErrProviderUnavailable -> provider_unavailable (Graph 5xx, JWKS down)

extras:
- security.AuthDomains returns login.microsoftonline.com always;
  graph.microsoft.com when groups_mode=graph
- forge-cli/runtime/runner.go side-effect imports azure_ad
…(phase 2 pr 5)

Wires aws_sigv4, gcp_iap, and azure_ad into the operator surfaces:

cli (forge-cli/cmd/init*.go):
- New non-interactive flags namespaced --auth-aws-* / --auth-gcp-iap-* /
  --auth-azure-* (StringSlice for repeatable allowed-principal globs)
- buildAuthFromFlags validates required combinations and emits the
  right egress hosts per provider (sts.<region>.amazonaws.com,
  www.gstatic.com, login.microsoftonline.com + graph.microsoft.com
  when groups_mode=graph)
- authEgressHostsFromSettings mirrors the same logic for the Web UI
- renderAuthBlock supports []string lists with proper YAML quoting
  (allowed_principals)

web ui (forge-ui/handlers_create.go):
- AuthProviderTypeMeta lists the three new types with helpful labels

validate (forge-core/validate/auth.go):
- knownAuthProviderTypes admits aws_sigv4 / gcp_iap / azure_ad
- validateProviderSettings enforces per-type required keys
  (aws_sigv4.region, gcp_iap.audience, azure_ad.audience +
   tenant_id-unless-multi-tenant, azure_ad.groups_mode whitelist)

tests:
- 11 new renderer + flag-parsing tests
- Round-trip YAML parse used instead of brittle quote-pattern asserts
- Updated wizard-meta test to expect 7 auth provider types

deliberate scope cut:
- TUI step_auth.go sub-step input flows for the 3 new providers are
  NOT included. Adding them is mechanical (~100 LOC per provider,
  mirroring the OIDC issuer→audience→groups_claim phase chain) but
  out of scope for v0.11.0 cut. Non-interactive flag path covers the
  production-critical CI/CD case; operators using the TUI can pick
  "Custom" and edit forge.yaml directly until the follow-up lands.
…pr 6)

Adds the operator-facing documentation for the three Phase 2 providers
that shipped in PRs 1–5, plus a top-level auth index, chain-semantics
concepts page, CHANGELOG, and a README link.

new docs:
- docs/auth/index.md — provider matrix and chain-semantics overview
- docs/auth/concepts/chain.md — first-match-wins, no-fall-through on
  reject, non-Bearer header support, mixed-chain worked example
- docs/auth/providers/aws_sigv4.md — STS reflection setup, awscurl
  example, assumed-role-vs-IAM-role gotcha called out twice
- docs/auth/providers/gcp_iap.md — backend service ID lookup steps,
  hardcoded JWKS rationale, GCP IAM Conditions for allowlisting
- docs/auth/providers/azure_ad.md — app registration walkthrough,
  single/multi/graph mode configs, multi-tenant warning prominent

every provider doc includes:
- Prerequisites checklist
- forge.yaml example
- Configuration reference table
- Audit log shape (literal JSON)
- Troubleshooting matrix (grep-able reason codes)
- Security model + limitations sections

CHANGELOG.md (new file):
- Lists Added / Changed entries for v0.11.0
- "Notes for upgraders" makes the non-breaking nature explicit
- Calls out the known TUI sub-flow gap from PR 5

README.md:
- Adds Auth Providers row to the Security documentation table
…owlist

The wizard was asking for Egress confirmation before the operator had
picked an auth provider, so STS / AAD authority / IAP JWKS hosts never
appeared in the egress list. Forge would scaffold a forge.yaml whose
egress_hosts blocked its own auth-provider RPC calls — failure happens
later at `forge run`, with no signal the wizard could have caught.

changes:
- Swap step order in init.go: Auth now runs immediately before Egress
- Extend DeriveEgressFunc with (authMode, authSettings) so the Egress
  step's Prepare(ctx) pulls the operator's auth choice from
  WizardContext and forwards it into deriveEgressDomains
- deriveEgressDomains calls authEgressHostsFromSettings (same helper
  the non-interactive --auth=… path uses) — TUI and CLI now produce
  identical egress lists for any given auth choice
- EgressStep's inferSource() learns to label auth-derived hosts:
    sts.<region>.amazonaws.com → "aws_sigv4 auth"
    www.gstatic.com           → "gcp_iap auth"
    login.microsoftonline.com → "azure_ad auth"
    graph.microsoft.com       → "azure_ad auth (graph)"
    <oidc issuer host>        → "oidc auth"
    <http_verifier url host>  → "http_verifier auth"

tests:
- TestDeriveEgressDomains_AuthProviderHostsMerged: 8 cases pinning the
  per-provider host emission (incl. graph-mode adds graph host)
- TestDeriveEgressDomains_AuthHostsMergeNotOverwrite: auth pass is
  additive — provider / channel hosts still emit alongside auth hosts

docs:
- docs/auth/concepts/chain.md gains a "TUI wizard ordering" section
  explaining the Auth-before-Egress invariant
…, cleanup

Final-pass audit findings against the phase 2 design doc surfaced one
correctness bug and several small improvements. All gates clean
(go test -race / golangci-lint / gofmt). 42 packages pass.

BUG fix — middleware emits token_kind="iap_jwt" for IAP requests:
  The strategy doc §5/§10 lists five token_kind values: empty, opaque,
  jwt, sigv4, iap_jwt. PR1 wired sigv4 detection but missed iap_jwt,
  so successful GCP IAP requests audited with token_kind="empty" — the
  same value as no-auth requests, defeating the audit-pipeline goal of
  counting IAP traffic distinctly. Middleware now classifies
  X-Goog-Iap-Jwt-Assertion presence as kind="iap_jwt" on the
  empty-Bearer path. New regression test pins it.

Improvement — graph_client.go avoids per-page URL re-parse:
  ensureGraphHost was parsing GraphClient.endpoint via url.Parse on
  EVERY pagination step. Pre-parse the endpoint Host once at construction
  and compare against that string instead. Trims redundant work on
  multi-page Graph responses.

Improvement — gcp_iap classifyJWTErr ordering hardened:
  Replaced the bare substring match on "kid" (which would catch unrelated
  errors) with the specific patterns: "kid " (e.g. "kid X not found")
  and "not found" (covers JWKS-resolution failures). Pre-existing
  ordering invariant comment is now actually defended.

Cleanup — drop redundant single-function file:
  Moved ExtractTenantID from azure_ad/tenant.go into provider.go alongside
  other claim accessors and removed the empty tenant.go. The function
  was a 1-liner and didn't justify its own file.

Cleanup — inline audienceContains shim:
  Replaced the audienceContains() wrapper (one-liner around slices.Contains)
  with a direct call at the use site. Less indirection, same behavior.

Cleanup — middleware: simplify hasNonBearerAuth boolean expr:
  Folded the multi-line if-chain into a single boolean expression. Same
  semantics, less noise.

audit findings deferred as nits, not fixed:
- aws_sigv4 Parser as zero-value struct (cosmetic; keeps symmetry)
- egress_step.go hostOf manual URL parsing (cosmetic; non-hot path)
- 10k eviction comment wording

audit findings confirmed not bugs:
- GraphCache TTL test (already exists in graph_cache_test.go)
- PrependChain loopback invariant intact (runner.go line 2036)
The Phase 2 provider docs were committed as MD files under docs/auth/
but we don't want to version-control them — the source-of-truth lives
in the design folder, and we'll deliver via the doc site separately.

- .gitignore: add docs/auth/
- git rm --cached docs/auth/** (local files preserved)
- README.md: drop the now-broken "Auth Providers" docs row
- CHANGELOG.md: drop the docs/auth/*.md links from the v0.11.0 entry

No code or test changes.
…ontract

Real-AWS testing surfaced a documentation gap: callers cannot use raw
`awscurl` / `aws-sdk-go` against Forge's `aws_sigv4` provider because
Sigv4 binds the signature to the destination host. Standard tools sign
for the URL they're addressing (Forge) — STS then rejects the reflected
signature because the host bytes don't match.

The server-side code is correct. The client just needs to sign a
hypothetical STS request, then attach the resulting headers to its real
POST to Forge. Same pattern as aws-iam-authenticator for EKS.

This commit:
- Ships `scripts/forge-aws-sign.py`, a ~100 LOC reference client using
  boto3.session + SigV4Auth. CLI flags for --region, --url, --profile,
  --body, --verbose. Reads SSO/IRSA/profile/env credentials via boto3's
  standard chain.
- Extends the package-level docstring in
  `forge-core/auth/providers/aws_sigv4/provider.go` with a
  "Client-side signing contract" section spelling out the 4-step pattern
  and pointing readers to the reference script.
- Adds a "Client-side requirement" section to CHANGELOG.md so adopters
  know to grab the helper or write their own before integrating.

Validated against real AWS:
- STS reflection: 200, identity stamped, correct ARN/Account/UserID
- ARN allowlist match: 200 (matching pattern)
- ARN allowlist miss: 401 reason=rejected (correct authz gate)
- No-auth: 401 reason=missing_token (Phase 1 contract preserved)
…uthenticator)

Phase 2 PR 2's original "reflect Sigv4 headers" design was broken in
the obvious way: Sigv4 binds its signature to the destination host as
part of the canonicalized signing input. Headers signed for Forge's
host could not be replayed against STS — STS sees host:sts.<region>.
amazonaws.com, recomputes the signature, gets a different hash,
rejects with "SignatureDoesNotMatch". Caught during real-AWS smoke;
documented in PR initializ#79 description.

This commit replaces the pattern with the same approach
aws-iam-authenticator uses for EKS:

  Client (3 lines):
    url   = boto3.client('sts').generate_presigned_url('get_caller_identity', ExpiresIn=900)
    token = 'forge-aws-v1.' + base64.urlsafe_b64encode(url.encode()).rstrip(b'=').decode()
    requests.post(forge_url, headers={'Authorization': f'Bearer {token}'}, ...)

  Server:
    Authorization: Bearer forge-aws-v1.<base64-of-presigned-sts-url>
    → decode + validate host (SSRF guard) + GET on the URL → STS → identity

Net effect on caller experience: identical to JWT/OIDC/azure_ad —
"mint token, send Bearer, done." Three lines of client code, hidden in
~15 lines of any AWS SDK in any language.

what changed:

  forge-core/auth/providers/aws_sigv4/
    sigv4_parser.go  — was parsing AWS4-HMAC-SHA256 Authorization header
                       now parses forge-aws-v1.<base64-url> Bearer tokens
                       (URL host validation, SSRF guard, X-Amz-Credential
                       parsing for cache key derivation)
    sts_client.go    — was POST with reflected headers
                       now GET on the pre-signed URL; same 200/4xx/5xx
                       classification and 64 KiB body cap
    provider.go      — Verify() now reads the Bearer token (not raw
                       headers); SSRF guard via expectedHost field;
                       same cache + ARN allowlist semantics

  forge-core/auth/
    provider.go      — HeadersFromRequest reverts X-Amz-Date and
                       X-Amz-Security-Token (no longer needed); keeps
                       X-Goog-Iap-Jwt-Assertion for gcp_iap
    provider.go      — TokenKind detects "forge-aws-v1." prefix → "sigv4"
                       (was: "AWS4-HMAC-SHA256 " on raw Authorization)
    middleware.go    — simplify: empty-Bearer fallback only handles IAP
                       (aws_sigv4 rides standard Bearer flow now)

  scripts/forge-aws-sign.py — rewrite as a clean reference client.
    --token-only: print just the token for use with curl/other tools
    Otherwise: do the round-trip POST and print the response

  CHANGELOG.md — replace "client wrapper required" friction note with
    the 3-line happy path snippet

what stays unchanged:

  - forge.yaml shape (still type: aws_sigv4, region:, allowed_principals:)
  - identity_cache.go, arn_matcher.go (cache and authz logic untouched)
  - security.AuthDomains (sts.<region>.amazonaws.com derivation)
  - forge-cli/cmd/init* flag set and renderer
  - validate.ValidateAuthConfig (region still required)
  - forge-ui/handlers_create.go (AuthProviderTypeMeta entry)

Tests: 42 packages pass, golangci-lint v2.10.1 clean, gofmt clean,
no aws-sdk-go imports (decision §9.1 still holds).

Net diff: +732 / -625 lines (mostly test rewrites; ~80 LOC net less
in the provider package because the new flow is structurally simpler).
…n client

Two correctness fixes surfaced by live AWS testing of the pre-signed URL
pattern from b3444c2.

1. Preserve the raw URL byte-for-byte.

Round-tripping the presigned URL through Go's net/url package
re-encoded query params in subtle ways (e.g. "/" in X-Amz-Credential,
"+" inside X-Amz-Security-Token) that didn't match how the AWS SDK
emitted them on the caller side. STS recomputes the canonical request
using whatever bytes we send and gets a different hash → 4xx
SignatureDoesNotMatch → audit reason "rejected".

  - PresignedToken gains a RawURL field — the exact bytes from the
    decoded token payload.
  - The parsed *url.URL is kept ONLY for SSRF host validation and
    query-param inspection. It is NEVER used to construct the outbound
    request.
  - Provider.Verify now passes parsed.RawURL to STSClient.GetCallerIdentity.

2. Use SigV4QueryAuth directly in the reference client (not boto3's
   high-level generate_presigned_url).

boto3.client('sts').generate_presigned_url('get_caller_identity', ...)
produces a URL STS rejects with SignatureDoesNotMatch when GET. Known
quirk — the high-level presigner signs as if the request were a POST.
aws-iam-authenticator works around this by signing the AWSRequest
explicitly; scripts/forge-aws-sign.py now does the same:

    req = AWSRequest(method='GET', url='https://sts.{region}.amazonaws.com/?Action=GetCallerIdentity&Version=2011-06-15')
    SigV4QueryAuth(creds, 'sts', region, expires=900).add_auth(req)
    token = 'forge-aws-v1.' + base64.urlsafe_b64encode(req.url.encode()).rstrip(b'=').decode()

Live validation against real AWS (account 412664885516, SSO assumed-role):
- Happy path: HTTP 400 body-shape error + auth_verify with correct ARN
- Deny path:  HTTP 401 + auth_fail reason="rejected" + token_kind="sigv4"

42 packages still pass; golangci-lint clean; gofmt clean.

(Known follow-up surfaced but out of scope: hot-reload of forge.yaml
doesn't rebuild the auth chain, so allowlist changes require a hard
restart. Same caveat affects all providers, not just aws_sigv4.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant