Security research disclosure. This document maps the complete safety architecture of Anthropic's Claude Code CLI (v2.1.119), including system prompt construction, permission systems, refusal handlers, remote killswitches, telemetry, and classifier trust boundaries.
All findings are from analysis of the publicly distributed Claude Code binary and the open-sourced Claude Code repository.
Claude Code ships with multiple layers of safety controls that are entirely client-side. The binary contains the full system prompt in plaintext, all permission logic, classifier decisions, and remote killswitch handlers. Any user with a hex editor can read, modify, or remove these controls. This document demonstrates the architectural implications.
- System Prompt Internals - Every safety instruction embedded in the binary, with exact strings
- Permission Pipeline - How tool calls are approved/denied, including the YOLO classifier
- Refusal Handler Chain - How the client processes server-side refusals
- Remote Killswitches - GrowthBook feature flags that Anthropic can trigger to disable features
- Classifier Trust Boundary - An architectural weakness where
tool_resultcontent bypasses the input classifier - Telemetry Map - All outbound data collection endpoints and event names
- Undocumented Features - Opus 4.7 gating, undercover mode, advisor tool encryption
The Anthropic API runs two classifiers:
| Classifier | When | What it scans | What it skips |
|---|---|---|---|
| Input (pre-model) | Before inference | System prompt, assistant messages, tool_use blocks |
tool_result blocks |
| Output (post-model) | During inference | Model-generated content | N/A |
The input classifier trusts all tool_result content (role: user). This means any content that reaches the model via a tool result (file reads, command output, web fetches) is not scanned by the input classifier. This is an architectural trust boundary decision with security implications for deployments that rely on the classifier for content filtering.
The patches are designed to be applied by Claude Code itself. The irony is intentional.
- Open Claude Code in a terminal with the target binary's directory accessible
- Paste the relevant patch section from PATCHES.md into the conversation
- Claude will write a Python script that performs length-matched binary replacements on its own executable
- Keep this Claude Code session open after patching -- do NOT close it
- Open a second Claude Code instance in a new terminal to verify the patches took effect
- The second instance runs the patched binary; test it by asking something the unpatched version would refuse
- If the second instance behaves correctly, close the first (unpatched) session
- If something broke, the backup (
claude.exe.bak) is in the same directory -- restore and retry
Critical notes:
- All patches MUST be length-matched (same byte count old and new) or the binary will corrupt
- Always back up the binary before patching:
copy claude.exe claude.exe.bak - When Claude Code auto-updates, patches are overwritten -- re-apply from this manifest
- The binary path is:
%APPDATA%\Claude\claude-code\<VERSION>\claude.exe - Find current version: check the most recently modified directory in
%APPDATA%\Claude\claude-code\
PATCHES.md- Complete patch manifest with exact before/after strings for all safety controlsLICENSE- Proprietary license (view/study permitted, commercial use prohibited)
Copyright (c) 2025-2026 Andrew C. Dorman. All Rights Reserved. Proprietary license. View and study permitted. Commercial use prohibited. See LICENSE for full terms.