Skip to content

feat: chaos for proxy mode (drop, disconnect, malformed)#134

Closed
marthakelly wants to merge 13 commits intomainfrom
feat/chaos-proxy-mode
Closed

feat: chaos for proxy mode (drop, disconnect, malformed)#134
marthakelly wants to merge 13 commits intomainfrom
feat/chaos-proxy-mode

Conversation

@marthakelly
Copy link
Copy Markdown
Contributor

Summary

Chaos (drop, disconnect, malformed) now applies in proxy mode. Previously chaos was skipped for proxied requests, making them unrealistically reliable.

Clean replacement for PR #122 — fresh branch off current main, only the chaos-proxy work, organized as 13 commits by area of concern.

What it does

  • Pre-flight drop/disconnect pre-empt upstream (nothing is contacted).
  • Post-response malformed proxies to upstream, captures the response, replaces the body with invalid JSON before relay.
  • Chaos is rolled once per request, after fixture match, with resolved config (headers > fixture.chaos > server defaults).
  • SSE bypass: when upstream streams SSE progressively, malformed chaos is skipped (bytes are already on the wire) and aimock_chaos_bypassed_total fires so the bypass is visible.
  • Source labeling: aimock_chaos_triggered_total and journal entries carry source: "fixture" | "proxy".
  • Video status: handleVideoStatus (GET /v1/videos/:id) now evaluates chaos, consistent with other handlers.

Design highlights

  • applyChaos split into roll+dispatch (applyChaosAction) so callers that pre-roll can't re-roll by accident.
  • proxyAndRecord returns "not_configured" | "relayed" | "handled_by_hook" instead of a ghost flag shared with the hook.
  • beforeWriteResponse hook on proxyAndRecord for post-response mutation; onHookBypassed companion for SSE visibility.
  • JournalEntry.response.source disambiguates null-fixture entries.
  • applyChaos requires an explicit source parameter so a future handler that grows a proxy path can't silently journal the wrong label.

Commits

  1. feat: apply pre-flight chaos to proxy mode (phase 1)
  2. test: add fixture-mode pre-flight chaos test
  3. feat: post-response chaos for proxy mode (malformed)
  4. refactor: single chaos roll per request
  5. refactor: apply review feedback — explicit API, no ghost flag
  6. test: pin binary-safe hook contract + signal unused response arg
  7. test: proxy-failure integration through handleCompletions
  8. refactor: address remaining review points
  9. refactor(chaos): require explicit source on applyChaos
  10. feat(chaos): record SSE bypass when hook can't mutate streamed response
  11. test(metrics): assert source label on chaos counter (fixture + proxy)
  12. docs: chaos in proxy mode + sequential rate semantics
  13. feat(chaos): extend chaos support to handleVideoStatus

Scope

In: drop / disconnect / malformed for proxy mode, non-streaming.
Deferred: status_override, truncate, streaming (SSE/NDJSON) mutation.

Test plan

  • Pre-flight drop/disconnect: upstream counter stays at 0, client gets 500/disconnect
  • Post-response malformed: upstream IS called, body is invalid JSON, journal records chaosAction
  • Single-roll semantics: drop journals the matched fixture + increments match count
  • SSE bypass: malformed chaos skipped, bypass metric emitted, triggered metric NOT emitted
  • Proxy failure interaction: 502 reaches client, journal correct, no chaosAction
  • Prometheus source labels: source="fixture" vs source="proxy" on chaos counter
  • Video status chaos: drop fires before video-not-found 404
  • Binary-safe hook: beforeWriteResponse receives raw upstream bytes
  • Content-type preserved on no-chaos replay

pnpm format:checkpnpm lintpnpm test ✓ (2504 passed)

Context

Supersedes #122, which collected a large backlog of unrelated fixes via merged blitz/omnibus branches and became difficult to review.


Generated by Claude Code

marthakelly and others added 13 commits April 23, 2026 18:50
Previously, chaos in proxy mode only supported pre-flight actions — drop
and disconnect fired before upstream was ever contacted. Malformed was
evaluated pre-flight too, so it short-circuited the proxy call and never
actually exercised the upstream path. This landed malformed as a
post-response mutation so the proxy does hit upstream, captures the
response, and then (probabilistically) replaces the body with invalid
JSON before relay.

Design:

- Split applyChaos into a roll+dispatch pair. applyChaosAction dispatches
  a known action without re-rolling the dice, so callers that pre-roll
  to decide phase-appropriate dispatch can't drift from their own
  decision.
- Gate pre-flight chaos in handleCompletions to drop/disconnect only.
  Malformed flows through to either fixture post-match or proxy
  post-response.
- Gate post-match applyChaos on fixture presence (previously ran
  unconditionally). Removes a stale roll on the no-fixture/proxy path.
- Add an optional beforeWriteResponse hook to proxyAndRecord. The hook
  receives the captured upstream response and can either relay it
  normally (return false) or replace it (return true). Non-breaking for
  the other nine proxyAndRecord callers — options defaults to undefined.
- handleCompletions uses the hook to apply post-response malformed
  chaos and journal it, with a chaosHandled flag that suppresses the
  outer journal.add so the request records exactly one entry.

Tests (red/green verified — malformed test fails on pre-refactor code
because upstream is never called, then passes after the changes):

- applies malformed chaos AFTER proxying (upstream called, body
  corrupted, journaled exactly once with chaosAction).
- preserves upstream content-type on replay when no chaos fires.

All 2440 existing tests continue to pass.
Phase 1 noted that pre-flight + post-match evaluation was a tradeoff
to preserve fixture-level chaos overrides without a larger refactor,
and that the double-evaluation would be removed in phase 2. This lands
that change.

Match the fixture first, then roll chaos once with the resolved config
(headers > fixture.chaos > server defaults). Dispatch by action + path:
drop/disconnect apply immediately; malformed on the fixture path writes
invalid JSON instead of the fixture response; malformed on the proxy
path goes through the beforeWriteResponse hook, which now closes over
the pre-rolled action rather than re-evaluating.

Observable behavior change: when drop/disconnect fires on a request
that matches a fixture, the journal entry now records the matched
fixture (not null) and the fixture's match count is incremented. Under
the old double-roll, whether the fixture was recorded depended on
which phase rolled first — an inconsistency that's now gone.

New test pins the single-roll semantics via the journal entry's
fixture field on a drop.
Review feedback on the phase-2 draft surfaced four legitimate issues;
this addresses them by making the API boundaries more explicit.

1. chaosHandled ghost flag → ProxyOutcome return type.
   proxyAndRecord previously returned boolean, so the server.ts caller
   had to share a mutated closure variable with the beforeWriteResponse
   hook to know whether the hook already journaled. Replaced boolean
   with "not_configured" | "relayed" | "handled_by_hook". The caller
   pattern-matches on the outcome instead of reading a flag, and the
   other thirteen proxyAndRecord callers (bedrock, cohere, ollama,
   speech, images, video, gemini, messages, embeddings, transcription,
   responses, bedrock-converse, multimedia-record test) are updated
   mechanically from `if (proxied)` to `if (outcome !== "not_configured")`.
   Those callers were also latently hazardous — any future refactor
   returning a truthy non-boolean would silently break them.

2. Logic leakage in the chaos hook → caller gates hook existence.
   Previously the hook did `if (chaosAction !== "malformed") return false`
   as its first line. Now the hook is only passed when the action is
   malformed, so inside the hook it can unconditionally apply + return
   true. Decision lives at the call site, executor lives in the hook.

3. isBinary dead code → removed from ProxyCapturedResponse.
   Was added preemptively, never read by the one hook that consumes
   the captured response. YAGNI.

4. Journal source ambiguity → added response.source: "fixture" | "proxy".
   A null fixture in a chaos journal entry was ambiguous (proxy path?
   unmatched 404?). applyChaosAction now requires a source argument
   (required, not optional — TypeScript won't let a branch silently
   omit it). The applyChaos wrapper used by the ten non-chat handlers
   defaults to "fixture" since those callers only apply chaos in
   fixture-serving contexts. Chat completions passes the appropriate
   source at every site.

Not addressed:
- Hardcoded Content-Type: application/json in applyChaosAction's
  malformed branch. For the current PR scope (chat completions, always
  JSON upstream) this is correct; for phase 3 (streaming / non-JSON)
  the dispatch will need to branch on content type.
- Full journal schema consistency (source field on every handler's
  journal.add path, not just chaos entries). Non-chat handlers set
  source via applyChaos's "fixture" default; their non-chaos journal
  paths stay unchanged here.

All 2441 tests pass (previous 2440 + one pinning the single-roll
semantics). Format + lint clean.
Two small points from review:

1. The chaos hook in server.ts declared a `response` parameter it
   didn't use. Dropped the parameter entirely rather than prefixing
   with `_` (the project's ESLint config doesn't honor that pattern).
   TypeScript allows a callback to declare fewer params than its
   signature type, which is itself a valid signal that the hook
   doesn't consume the response. A comment notes where phase 3 will
   need it.

2. Added a proxyAndRecord unit test that stands up a real HTTP
   upstream returning non-UTF8 bytes (0xff 0xfe 0xfd ...) and
   verifies the `beforeWriteResponse` hook receives a Buffer
   byte-equal to the upstream response. Pins the refactor's claim
   that the hook sees raw upstream bytes rather than a
   UTF-8-decoded-then-re-encoded view.
Unit tests in recorder.test.ts prove proxyAndRecord writes 502 on
upstream failure. These two integration tests pin the corresponding
server-level behavior:

1. Proxy failure (no chaos): client gets 502, journal records one
   entry with status=502, fixture=null, source="proxy",
   chaosAction=undefined. Confirms handleCompletions's "relayed"
   branch journals the failure correctly and doesn't hang.

2. Proxy failure with malformedRate=1.0: client still gets 502
   (not a malformed-JSON body), journal has no chaosAction. Pins
   that when upstream fails before the hook is invoked, chaos is
   "rolled but never fired" — the journal reflects what happened,
   not what was intended.
Four small changes from review:

1. Tag aimock_chaos_triggered_total with source. Observability wanted
   to see "N% of chaos fires are fixture vs proxy"; the label was
   already plumbed through applyChaosAction, just wasn't on the
   metric itself.

2. Comment at the collapse/hook site in recorder.ts noting that the
   upstream response is buffered in memory. Fine for chat traffic;
   revisit if this path ever proxies large or long-lived streams.

3. Comment at the fixture-write in recorder.ts clarifying that the
   persisted artifact is always the real upstream response, even
   when chaos later mutates the relay. Chaos is a runtime
   decoration; the recorded artifact must stay truthful so replay
   sees what upstream said.

4. Disconnect journal-pinning test in chaos-fixture-mode.test.ts,
   symmetric to the existing drop test. Disconnect's status: 0 is
   an unusual shape worth a regression pin.
The applyChaos helper hard-coded source: "fixture" with a comment
asserting non-chat handlers only apply chaos in fixture-serving
contexts. True today, enforced by convention. Lifts the parameter to
required so a future handler that grows a proxy path cannot silently
journal the wrong source. All 15 non-chat call sites pass "fixture"
explicitly.

No behavioral change — every caller was already journaling source:
"fixture" via the hard-coded default.
When upstream streams SSE, the beforeWriteResponse hook can't fire —
the bytes are already on the wire by the time proxyAndRecord's relay
code path runs. Previously malformedRate: 1.0 against SSE traffic
silently meant 0% corruption: no log, no metric, no journal trace.
Lifting the gate in a future refactor (phase 3: streaming mutation)
could then silently corrupt SSE mid-flight with zero test failures.

Adds an onHookBypassed callback to ProxyOptions; recorder.ts invokes
it when a hook was registered but streaming bypassed it.
handleCompletions wires telemetry: a logger.warn plus an
aimock_chaos_bypassed_total{action,source,reason} counter increment.

Also widens the beforeWriteResponse JSDoc. The old docs described
only the "skip default relay" obligation; a hook that returns `true`
actually carries three: it wrote the response, it journaled the
outcome, AND it skipped default relay. All three now explicit.

New proxy-only test pins the end-to-end behavior:
  - SSE stream reaches client intact (content-type + frames)
  - Journal records a clean proxy relay, no chaosAction
  - aimock_chaos_bypassed_total increments with action+source+reason
  - aimock_chaos_triggered_total does NOT increment (didn't fire)
Existing aimock_chaos_triggered_total assertions matched
[^}]*action="drop" — they passed regardless of whether source was
present or correct. The source label was added to this public
Prometheus metric when chaos was extended to proxy mode, so shipping
it unasserted is a regression hazard.

Tightened the fixture-source assertion in the existing test and
added a paired proxy-source variant so both values of the source
dimension are pinned. A future change that drops the source label
or flips fixture/proxy will now fail loudly.
Two documentation surfaces needed updates for this PR's behavior.

ChaosConfig JSDoc (src/types.ts):
Rates are evaluated sequentially with first-hit semantics, so
{ dropRate: 0.5, malformedRate: 0.5 } yields a ~25% effective
malformed rate, not 50%. Worth a paragraph on the type so users
don't reason about them as independent probabilities.

Chaos Testing page (docs/chaos-testing):
The site page described chaos modes but never mentioned proxy mode,
which is this PR's whole premise. Adds a "Proxy Mode" section
covering:
  - Pre-flight drop/disconnect (upstream never contacted).
  - Post-response malformed (upstream called, body replaced).
  - The recorded fixture stays truthful; chaos is a live-traffic
    decoration, not a fixture mutation.
  - SSE bypass: malformed can't mutate streamed responses; visible
    via aimock_chaos_bypassed_total + server warning.

Also updates the Prometheus Metrics section to show the `source`
label (added when chaos was extended to proxy mode) and documents
the new `aimock_chaos_bypassed_total` counter.

README.md is not changed — its one-line chaos summary is still
accurate; the detail belongs on the site page.
handleVideoStatus (GET /v1/videos/{id}) was the only handler missing
chaos support. Wires applyChaos into the polling path so tests can
inject drop / disconnect / malformed for video-status checks too.

Since video status polling has no proxy path, it passes null for
fixture and "fixture" for source.

Widens ChaosJournalContext.body to accept null (matching
JournalEntry.body) so handleVideoStatus can pass null without an
unsafe cast.
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Apr 23, 2026

Open in StackBlitz

npm i https://pkg.pr.new/@copilotkit/aimock@134

commit: 22fe6bd

@jpr5
Copy link
Copy Markdown
Contributor

jpr5 commented Apr 23, 2026

Closing in favor of #122 which has the same work plus CR fixes (chaos source labels, CORS headers, video status chaos, docs accuracy) and a release bump to v1.15.0.

@jpr5 jpr5 closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants