fix: sanitize non-UTF-8 source in get_code_snippet#541
Conversation
get_code_snippet reads file bytes via read_file_lines and passes them directly to yyjson without validating the encoding. When a source file uses a non-UTF-8 encoding (CP949, EUC-KR, Shift_JIS, GBK — common in legacy CJK codebases), the emitted JSON contains raw bytes that are invalid UTF-8. Over the MCP stdio JSON-RPC transport this hangs clients indefinitely because the stream decoder stalls on a malformed multi-byte lead byte. Add sanitize_utf8() — a single-pass UTF-8 validator that replaces each invalid byte with U+FFFD (EF BF BD) per Unicode §3.9 (Maximal Subpart substitution). Unlike the existing sanitize_ascii() used in search_code, this preserves valid multi-byte UTF-8 (CJK, emoji, accented Latin, etc.) while only replacing genuinely invalid sequences. Applied in build_snippet_response before JSON serialization so that both MCP and CLI paths emit valid UTF-8. Handles: overlong encodings, truncated sequences, unexpected continuation bytes, surrogates (U+D800..U+DFFF), and codepoints above U+10FFFF. Fixes DeusData#511. Signed-off-by: piiiico <pico@aamdal.dev>
0fce176 to
93e33d7
Compare
Compact multi-condition if-checks that fit within the 100-char column limit onto single lines, and repack the 4-byte sequence check so three conditions share the first line (95 chars) with the fourth wrapping to the continuation, as clang-format-20 requires. Signed-off-by: piiiico <pico@amdal.dev>
ab981a3 to
ba5e876
Compare
|
Lint passes ✅ and DCO is now fixed (the clang-format fixup commit was missing the Signed-off-by line — amended and force-pushed). Remaining jobs (build/test matrix, CodeQL) are still running — all green so far. Ready for review once those complete. |
|
Thanks @piiiico — of the two #511 fixes, your sanitizer is the cleaner implementation; the Two things needed before it can land:
Plan: land your sanitizer + #526's test + the guard as one combined fix, crediting both contributors. Sound good? 🙏 |
…f8 (DeusData#511) Per maintainer review on DeusData#541, two additions to the snippet UTF-8 sanitizer: 1. **Reproduce-first test** (`snippet_source_invalid_utf8`) Writes a Go source file containing the CP949 sequence 0xC0 0xD4 0xB7 0xC2 inside a comment, then calls `get_code_snippet` and asserts: - the outer MCP envelope parses as strict JSON, - the embedded snippet JSON parses as strict JSON, - the original invalid bytes (0xC0 0xD4) do not appear verbatim, - surrounding valid bytes (HandleRequest, return nil) survive, - U+FFFD (EF BF BD) marks where invalid bytes were. Without `sanitize_utf8`, the response contains raw non-UTF-8 bytes and step 1 fails (yyjson rejects it) — i.e. the test fails on pre-fix `main`. Byte pattern and structure borrowed from DeusData#526 (jstar0) so both contributions converge on the same regression. 2. **Allocation overflow guard** `malloc(src_len * 3 + 1)` is technically UB if `src_len > (SIZE_MAX - 1) / 3`. Snippet sources can't reach that in practice, but the guard makes the safety property local to the function rather than relying on call-site bounds. The same bound is enforced in DeusData#526. Refs DeusData#511. Co-authored-by: King Star <54024410+jstar0@users.noreply.github.com> Signed-off-by: piiiico <pico@amdal.dev>
|
Pushed
Single commit for the pair; happy to split into |
|
|
|
Thanks @piiiico — your non-UTF-8 sanitizer was the best-written of the two #511 fixes, and the overflow guard you added was a real improvement. We're landing #526 for #511 — it's green and merge-ready, whereas this one is currently blocked on the |
Problem
get_code_snippetreads file bytes viaread_file_linesand passes them directly toyyjson_mut_obj_add_strwithout validating the encoding. When a source file uses a non-UTF-8 encoding (CP949, EUC-KR, Shift_JIS, GBK — common in legacy CJK Windows codebases), the emitted JSON contains raw bytes that are invalid UTF-8. Over the MCP stdio JSON-RPC transport, the client's stream decoder stalls on a malformed multi-byte lead byte, hanging the session indefinitely (observed 40+ minutes).The existing
sanitize_ascii()used insearch_codehandles this by replacing all non-ASCII bytes with?, which also destroys valid UTF-8 (Korean comments, Japanese identifiers, emoji).Fix
Add
sanitize_utf8()— a single-pass UTF-8 validator (RFC 3629 §3) that replaces each invalid byte with U+FFFD (EF BF BD) per Unicode §3.9 (Maximal Subpart substitution), while preserving valid multi-byte UTF-8.Applied in
build_snippet_responsebefore JSON serialization.What it handles
00..7FC2..DF+80..BFE0..EF+ continuations (checked)F0..F4+ continuations (checked)C0,C1E0+80..9FF0+80..8FED+A0..BFF4+90..BF, orF5..FF80..BFwhere lead expectedScope
src/mcp/mcp.c)<string.h>and<stdlib.h>(both already included)sanitize_asciiorsearch_code(separate concern)gcc -fsyntax-onlyNote
The broader pattern —
yy_doc_to_strusingYYJSON_WRITE_ALLOW_INVALID_UNICODE(line 113) — means other tools could emit invalid UTF-8 from database-stored strings. This PR fixes the known reproduction path; a follow-up could audit the other tool handlers.Fixes #511.