fix: sanitize non-UTF-8 source in get_code_snippet by piiiico · Pull Request #541 · DeusData/codebase-memory-mcp

piiiico · 2026-06-21T01:13:29Z

Problem

get_code_snippet reads file bytes via read_file_lines and passes them directly to yyjson_mut_obj_add_str without validating the encoding. When a source file uses a non-UTF-8 encoding (CP949, EUC-KR, Shift_JIS, GBK — common in legacy CJK Windows codebases), the emitted JSON contains raw bytes that are invalid UTF-8. Over the MCP stdio JSON-RPC transport, the client's stream decoder stalls on a malformed multi-byte lead byte, hanging the session indefinitely (observed 40+ minutes).

The existing sanitize_ascii() used in search_code handles this by replacing all non-ASCII bytes with ?, which also destroys valid UTF-8 (Korean comments, Japanese identifiers, emoji).

Fix

Add sanitize_utf8() — a single-pass UTF-8 validator (RFC 3629 §3) that replaces each invalid byte with U+FFFD (EF BF BD) per Unicode §3.9 (Maximal Subpart substitution), while preserving valid multi-byte UTF-8.

Applied in build_snippet_response before JSON serialization.

What it handles

Case	Byte pattern	Action
ASCII	`00..7F`	pass through
Valid 2-byte	`C2..DF` + `80..BF`	pass through
Valid 3-byte	`E0..EF` + continuations (checked)	pass through
Valid 4-byte	`F0..F4` + continuations (checked)	pass through
Overlong 2-byte	`C0`, `C1`	replace with U+FFFD
Overlong 3-byte	`E0` + `80..9F`	replace with U+FFFD
Overlong 4-byte	`F0` + `80..8F`	replace with U+FFFD
Surrogates	`ED` + `A0..BF`	replace with U+FFFD
Above U+10FFFF	`F4` + `90..BF`, or `F5..FF`	replace with U+FFFD
Truncated sequence	missing continuation bytes	replace lead with U+FFFD
Stray continuation	`80..BF` where lead expected	replace with U+FFFD

Scope

85 lines added, single file (src/mcp/mcp.c)
No new dependencies — uses only <string.h> and <stdlib.h> (both already included)
No changes to sanitize_ascii or search_code (separate concern)
Syntax-checked with gcc -fsyntax-only

Note

The broader pattern — yy_doc_to_str using YYJSON_WRITE_ALLOW_INVALID_UNICODE (line 113) — means other tools could emit invalid UTF-8 from database-stored strings. This PR fixes the known reproduction path; a follow-up could audit the other tool handlers.

Fixes #511.

get_code_snippet reads file bytes via read_file_lines and passes them directly to yyjson without validating the encoding. When a source file uses a non-UTF-8 encoding (CP949, EUC-KR, Shift_JIS, GBK — common in legacy CJK codebases), the emitted JSON contains raw bytes that are invalid UTF-8. Over the MCP stdio JSON-RPC transport this hangs clients indefinitely because the stream decoder stalls on a malformed multi-byte lead byte. Add sanitize_utf8() — a single-pass UTF-8 validator that replaces each invalid byte with U+FFFD (EF BF BD) per Unicode §3.9 (Maximal Subpart substitution). Unlike the existing sanitize_ascii() used in search_code, this preserves valid multi-byte UTF-8 (CJK, emoji, accented Latin, etc.) while only replacing genuinely invalid sequences. Applied in build_snippet_response before JSON serialization so that both MCP and CLI paths emit valid UTF-8. Handles: overlong encodings, truncated sequences, unexpected continuation bytes, surrogates (U+D800..U+DFFF), and codepoints above U+10FFFF. Fixes DeusData#511. Signed-off-by: piiiico <pico@aamdal.dev>

Compact multi-condition if-checks that fit within the 100-char column limit onto single lines, and repack the 4-byte sequence check so three conditions share the first line (95 chars) with the fourth wrapping to the continuation, as clang-format-20 requires. Signed-off-by: piiiico <pico@amdal.dev>

piiiico · 2026-06-21T03:04:26Z

Lint passes ✅ and DCO is now fixed (the clang-format fixup commit was missing the Signed-off-by line — amended and force-pushed).

Remaining jobs (build/test matrix, CodeQL) are still running — all green so far. Ready for review once those complete.

DeusData · 2026-06-22T20:27:09Z

Thanks @piiiico — of the two #511 fixes, your sanitizer is the cleaner implementation; the (c & 0xF0) == 0xE0 bitmask checks are easier to audit against RFC 3629.

Two things needed before it can land:

Add a reproduce-first test — the project requires a failing-first test per fix. fix(mcp): return valid UTF-8 snippets #526 has a good one (writes bad CP949 bytes, asserts valid JSON + replacement chars); we'd pair that with your sanitizer.
Add an overflow guard on the src_len * 3 + 1 allocation (fix(mcp): return valid UTF-8 snippets #526 has one) — strictly it's UB if src_len > SIZE_MAX/3, even if practically unreachable.

Plan: land your sanitizer + #526's test + the guard as one combined fix, crediting both contributors. Sound good? 🙏

…f8 (DeusData#511) Per maintainer review on DeusData#541, two additions to the snippet UTF-8 sanitizer: 1. **Reproduce-first test** (`snippet_source_invalid_utf8`) Writes a Go source file containing the CP949 sequence 0xC0 0xD4 0xB7 0xC2 inside a comment, then calls `get_code_snippet` and asserts: - the outer MCP envelope parses as strict JSON, - the embedded snippet JSON parses as strict JSON, - the original invalid bytes (0xC0 0xD4) do not appear verbatim, - surrounding valid bytes (HandleRequest, return nil) survive, - U+FFFD (EF BF BD) marks where invalid bytes were. Without `sanitize_utf8`, the response contains raw non-UTF-8 bytes and step 1 fails (yyjson rejects it) — i.e. the test fails on pre-fix `main`. Byte pattern and structure borrowed from DeusData#526 (jstar0) so both contributions converge on the same regression. 2. **Allocation overflow guard** `malloc(src_len * 3 + 1)` is technically UB if `src_len > (SIZE_MAX - 1) / 3`. Snippet sources can't reach that in practice, but the guard makes the safety property local to the function rather than relying on call-site bounds. The same bound is enforced in DeusData#526. Refs DeusData#511. Co-authored-by: King Star <54024410+jstar0@users.noreply.github.com> Signed-off-by: piiiico <pico@amdal.dev>

piiiico · 2026-06-22T21:08:31Z

Pushed c0ae935. Two things:

New test snippet_source_invalid_utf8 (tests/test_mcp.c) — borrows fix(mcp): return valid UTF-8 snippets #526's CP949 byte pattern (0xC0 0xD4 0xB7 0xC2 inside a Go comment) and assertion structure: outer envelope parses as JSON, embedded snippet parses as JSON, no raw 0xC0 0xD4, surrounding bytes (HandleRequest, return nil) preserved, U+FFFD present where invalid bytes were. Without sanitize_utf8 the very first assertion (yyjson_read on the raw MCP response) fails — i.e. failing-first on pre-fix main. Credited @jstar0 via Co-authored-by.
Overflow guard added to sanitize_utf8: early return NULL when src_len > (SIZE_MAX - 1) / 3, before the * 3 + 1 allocation. Matches the bound used in fix(mcp): return valid UTF-8 snippets #526.

Single commit for the pair; happy to split into test: + fix: if you prefer separate reviews. CI is running.

piiiico · 2026-06-23T00:26:46Z

codeql-gate timed out at 21:53Z polling for CodeQL — but the CodeQL SAST run itself completed successfully at 21:44Z (run 27984113003). Looks like a timing race in the gate script. Could you re-run the failed jobs?

DeusData · 2026-06-23T21:46:28Z

Thanks @piiiico — your non-UTF-8 sanitizer was the best-written of the two #511 fixes, and the overflow guard you added was a real improvement.

We're landing #526 for #511 — it's green and merge-ready, whereas this one is currently blocked on the codeql-gate/ci-ok checks. The reproduce-first test you co-authored with @jstar0 carries over there, so your work isn't lost. Closing this as superseded — genuinely appreciate the contribution. 🙏

piiiico force-pushed the fix/utf8-sanitize-get-code-snippet branch from 0fce176 to 93e33d7 Compare June 21, 2026 01:47

piiiico force-pushed the fix/utf8-sanitize-get-code-snippet branch from ab981a3 to ba5e876 Compare June 21, 2026 03:03

DeusData mentioned this pull request Jun 22, 2026

fix(mcp): return valid UTF-8 snippets #526

Merged

DeusData closed this Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: sanitize non-UTF-8 source in get_code_snippet#541

fix: sanitize non-UTF-8 source in get_code_snippet#541
piiiico wants to merge 3 commits into
DeusData:mainfrom
piiiico:fix/utf8-sanitize-get-code-snippet

piiiico commented Jun 21, 2026

Uh oh!

piiiico commented Jun 21, 2026

Uh oh!

DeusData commented Jun 22, 2026

Uh oh!

piiiico commented Jun 22, 2026

Uh oh!

piiiico commented Jun 23, 2026

Uh oh!

DeusData commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

piiiico commented Jun 21, 2026

Problem

Fix

What it handles

Scope

Note

Uh oh!

piiiico commented Jun 21, 2026

Uh oh!

DeusData commented Jun 22, 2026

Uh oh!

piiiico commented Jun 22, 2026

Uh oh!

piiiico commented Jun 23, 2026

Uh oh!

DeusData commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants