Skip to content

fix(mcp): return valid UTF-8 snippets#526

Open
jstar0 wants to merge 1 commit into
DeusData:mainfrom
jstar0:fix/snippet-valid-utf8
Open

fix(mcp): return valid UTF-8 snippets#526
jstar0 wants to merge 1 commit into
DeusData:mainfrom
jstar0:fix/snippet-valid-utf8

Conversation

@jstar0

@jstar0 jstar0 commented Jun 19, 2026

Copy link
Copy Markdown

Summary

  • Sanitize get_code_snippet source text to valid UTF-8 before serializing the MCP response.
  • Preserve valid UTF-8 bytes and replace only invalid source bytes with U+FFFD.
  • Add a regression test for non-UTF-8 source bytes in the snippet file.

Fixes #511.

Changes

  • Adds a small lossy UTF-8 sanitizer for snippet source strings.
  • Uses the sanitized copy for the source field in build_snippet_response().
  • Verifies both the raw MCP response envelope and the embedded snippet JSON parse as strict JSON when the source contains CP949/EUC-KR-like bytes.

Verification

make -f Makefile.cbm test
make -f Makefile.cbm lint-format
make -f Makefile.cbm lint-no-suppress
git diff --check

I also attempted:

make -f Makefile.cbm lint-ci

That local run stopped at cppcheck because cppcheck is not installed in my environment.

Signed-off-by: King Star <mcxin.y@gmail.com>
@DeusData

Copy link
Copy Markdown
Owner

Thanks @jstar0 — your reproduce-first test here (write known-bad CP949 bytes, assert the response is valid JSON with U+FFFD replacements) is excellent and we want to keep it.

There's a parallel PR #541 fixing the same issue (#511) with a cleaner sanitizer (canonical (c & 0xF0) == 0xE0 lead-byte checks). Our plan is to land one combined fix: #541's sanitizer + your test + an explicit overflow guard, crediting both of you.

One change needed here regardless: the overflow guard len > (((size_t)-1) - SKIP_ONE) / UTF8_REPLACEMENT_LEN hinges on the opaque SKIP_ONE macro — please make the + 1 explicit (with a comment) so the bound is auditable without chasing the macro.

Would you be open to combining your test with #541's sanitizer? 🙏

piiiico added a commit to piiiico/codebase-memory-mcp that referenced this pull request Jun 22, 2026
…f8 (DeusData#511)

Per maintainer review on DeusData#541, two additions to the snippet UTF-8
sanitizer:

1. **Reproduce-first test** (`snippet_source_invalid_utf8`)
   Writes a Go source file containing the CP949 sequence 0xC0 0xD4
   0xB7 0xC2 inside a comment, then calls `get_code_snippet` and
   asserts:
   - the outer MCP envelope parses as strict JSON,
   - the embedded snippet JSON parses as strict JSON,
   - the original invalid bytes (0xC0 0xD4) do not appear verbatim,
   - surrounding valid bytes (HandleRequest, return nil) survive,
   - U+FFFD (EF BF BD) marks where invalid bytes were.

   Without `sanitize_utf8`, the response contains raw non-UTF-8 bytes
   and step 1 fails (yyjson rejects it) — i.e. the test fails on
   pre-fix `main`. Byte pattern and structure borrowed from DeusData#526
   (jstar0) so both contributions converge on the same regression.

2. **Allocation overflow guard**
   `malloc(src_len * 3 + 1)` is technically UB if
   `src_len > (SIZE_MAX - 1) / 3`. Snippet sources can't reach that
   in practice, but the guard makes the safety property local to
   the function rather than relying on call-site bounds. The same
   bound is enforced in DeusData#526.

Refs DeusData#511.

Co-authored-by: King Star <54024410+jstar0@users.noreply.github.com>
Signed-off-by: piiiico <pico@amdal.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_code_snippet hangs MCP client for non-UTF-8 (CP949/EUC-KR) source files — raw bytes emitted as invalid UTF-8 in the source field

2 participants