Skip to content

fix: sanitize non-UTF-8 source in get_code_snippet#541

Closed
piiiico wants to merge 3 commits into
DeusData:mainfrom
piiiico:fix/utf8-sanitize-get-code-snippet
Closed

fix: sanitize non-UTF-8 source in get_code_snippet#541
piiiico wants to merge 3 commits into
DeusData:mainfrom
piiiico:fix/utf8-sanitize-get-code-snippet

Conversation

@piiiico

@piiiico piiiico commented Jun 21, 2026

Copy link
Copy Markdown

Problem

get_code_snippet reads file bytes via read_file_lines and passes them directly to yyjson_mut_obj_add_str without validating the encoding. When a source file uses a non-UTF-8 encoding (CP949, EUC-KR, Shift_JIS, GBK — common in legacy CJK Windows codebases), the emitted JSON contains raw bytes that are invalid UTF-8. Over the MCP stdio JSON-RPC transport, the client's stream decoder stalls on a malformed multi-byte lead byte, hanging the session indefinitely (observed 40+ minutes).

The existing sanitize_ascii() used in search_code handles this by replacing all non-ASCII bytes with ?, which also destroys valid UTF-8 (Korean comments, Japanese identifiers, emoji).

Fix

Add sanitize_utf8() — a single-pass UTF-8 validator (RFC 3629 §3) that replaces each invalid byte with U+FFFD (EF BF BD) per Unicode §3.9 (Maximal Subpart substitution), while preserving valid multi-byte UTF-8.

Applied in build_snippet_response before JSON serialization.

What it handles

Case Byte pattern Action
ASCII 00..7F pass through
Valid 2-byte C2..DF + 80..BF pass through
Valid 3-byte E0..EF + continuations (checked) pass through
Valid 4-byte F0..F4 + continuations (checked) pass through
Overlong 2-byte C0, C1 replace with U+FFFD
Overlong 3-byte E0 + 80..9F replace with U+FFFD
Overlong 4-byte F0 + 80..8F replace with U+FFFD
Surrogates ED + A0..BF replace with U+FFFD
Above U+10FFFF F4 + 90..BF, or F5..FF replace with U+FFFD
Truncated sequence missing continuation bytes replace lead with U+FFFD
Stray continuation 80..BF where lead expected replace with U+FFFD

Scope

  • 85 lines added, single file (src/mcp/mcp.c)
  • No new dependencies — uses only <string.h> and <stdlib.h> (both already included)
  • No changes to sanitize_ascii or search_code (separate concern)
  • Syntax-checked with gcc -fsyntax-only

Note

The broader pattern — yy_doc_to_str using YYJSON_WRITE_ALLOW_INVALID_UNICODE (line 113) — means other tools could emit invalid UTF-8 from database-stored strings. This PR fixes the known reproduction path; a follow-up could audit the other tool handlers.

Fixes #511.

get_code_snippet reads file bytes via read_file_lines and passes them
directly to yyjson without validating the encoding. When a source file
uses a non-UTF-8 encoding (CP949, EUC-KR, Shift_JIS, GBK — common in
legacy CJK codebases), the emitted JSON contains raw bytes that are
invalid UTF-8. Over the MCP stdio JSON-RPC transport this hangs clients
indefinitely because the stream decoder stalls on a malformed multi-byte
lead byte.

Add sanitize_utf8() — a single-pass UTF-8 validator that replaces each
invalid byte with U+FFFD (EF BF BD) per Unicode §3.9 (Maximal Subpart
substitution). Unlike the existing sanitize_ascii() used in search_code,
this preserves valid multi-byte UTF-8 (CJK, emoji, accented Latin, etc.)
while only replacing genuinely invalid sequences.

Applied in build_snippet_response before JSON serialization so that both
MCP and CLI paths emit valid UTF-8.

Handles: overlong encodings, truncated sequences, unexpected continuation
bytes, surrogates (U+D800..U+DFFF), and codepoints above U+10FFFF.

Fixes DeusData#511.

Signed-off-by: piiiico <pico@aamdal.dev>
@piiiico piiiico force-pushed the fix/utf8-sanitize-get-code-snippet branch from 0fce176 to 93e33d7 Compare June 21, 2026 01:47
Compact multi-condition if-checks that fit within the 100-char column
limit onto single lines, and repack the 4-byte sequence check so three
conditions share the first line (95 chars) with the fourth wrapping to
the continuation, as clang-format-20 requires.

Signed-off-by: piiiico <pico@amdal.dev>
@piiiico piiiico force-pushed the fix/utf8-sanitize-get-code-snippet branch from ab981a3 to ba5e876 Compare June 21, 2026 03:03
@piiiico

piiiico commented Jun 21, 2026

Copy link
Copy Markdown
Author

Lint passes ✅ and DCO is now fixed (the clang-format fixup commit was missing the Signed-off-by line — amended and force-pushed).

Remaining jobs (build/test matrix, CodeQL) are still running — all green so far. Ready for review once those complete.

@DeusData

Copy link
Copy Markdown
Owner

Thanks @piiiico — of the two #511 fixes, your sanitizer is the cleaner implementation; the (c & 0xF0) == 0xE0 bitmask checks are easier to audit against RFC 3629.

Two things needed before it can land:

  1. Add a reproduce-first test — the project requires a failing-first test per fix. fix(mcp): return valid UTF-8 snippets #526 has a good one (writes bad CP949 bytes, asserts valid JSON + replacement chars); we'd pair that with your sanitizer.
  2. Add an overflow guard on the src_len * 3 + 1 allocation (fix(mcp): return valid UTF-8 snippets #526 has one) — strictly it's UB if src_len > SIZE_MAX/3, even if practically unreachable.

Plan: land your sanitizer + #526's test + the guard as one combined fix, crediting both contributors. Sound good? 🙏

…f8 (DeusData#511)

Per maintainer review on DeusData#541, two additions to the snippet UTF-8
sanitizer:

1. **Reproduce-first test** (`snippet_source_invalid_utf8`)
   Writes a Go source file containing the CP949 sequence 0xC0 0xD4
   0xB7 0xC2 inside a comment, then calls `get_code_snippet` and
   asserts:
   - the outer MCP envelope parses as strict JSON,
   - the embedded snippet JSON parses as strict JSON,
   - the original invalid bytes (0xC0 0xD4) do not appear verbatim,
   - surrounding valid bytes (HandleRequest, return nil) survive,
   - U+FFFD (EF BF BD) marks where invalid bytes were.

   Without `sanitize_utf8`, the response contains raw non-UTF-8 bytes
   and step 1 fails (yyjson rejects it) — i.e. the test fails on
   pre-fix `main`. Byte pattern and structure borrowed from DeusData#526
   (jstar0) so both contributions converge on the same regression.

2. **Allocation overflow guard**
   `malloc(src_len * 3 + 1)` is technically UB if
   `src_len > (SIZE_MAX - 1) / 3`. Snippet sources can't reach that
   in practice, but the guard makes the safety property local to
   the function rather than relying on call-site bounds. The same
   bound is enforced in DeusData#526.

Refs DeusData#511.

Co-authored-by: King Star <54024410+jstar0@users.noreply.github.com>
Signed-off-by: piiiico <pico@amdal.dev>
@piiiico

piiiico commented Jun 22, 2026

Copy link
Copy Markdown
Author

Pushed c0ae935. Two things:

  1. New test snippet_source_invalid_utf8 (tests/test_mcp.c) — borrows fix(mcp): return valid UTF-8 snippets #526's CP949 byte pattern (0xC0 0xD4 0xB7 0xC2 inside a Go comment) and assertion structure: outer envelope parses as JSON, embedded snippet parses as JSON, no raw 0xC0 0xD4, surrounding bytes (HandleRequest, return nil) preserved, U+FFFD present where invalid bytes were. Without sanitize_utf8 the very first assertion (yyjson_read on the raw MCP response) fails — i.e. failing-first on pre-fix main. Credited @jstar0 via Co-authored-by.

  2. Overflow guard added to sanitize_utf8: early return NULL when src_len > (SIZE_MAX - 1) / 3, before the * 3 + 1 allocation. Matches the bound used in fix(mcp): return valid UTF-8 snippets #526.

Single commit for the pair; happy to split into test: + fix: if you prefer separate reviews. CI is running.

@piiiico

piiiico commented Jun 23, 2026

Copy link
Copy Markdown
Author

codeql-gate timed out at 21:53Z polling for CodeQL — but the CodeQL SAST run itself completed successfully at 21:44Z (run 27984113003). Looks like a timing race in the gate script. Could you re-run the failed jobs?

@DeusData

Copy link
Copy Markdown
Owner

Thanks @piiiico — your non-UTF-8 sanitizer was the best-written of the two #511 fixes, and the overflow guard you added was a real improvement.

We're landing #526 for #511 — it's green and merge-ready, whereas this one is currently blocked on the codeql-gate/ci-ok checks. The reproduce-first test you co-authored with @jstar0 carries over there, so your work isn't lost. Closing this as superseded — genuinely appreciate the contribution. 🙏

@DeusData DeusData closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_code_snippet hangs MCP client for non-UTF-8 (CP949/EUC-KR) source files — raw bytes emitted as invalid UTF-8 in the source field

2 participants