Skip to content

fix: replace surrogate codepoints with U+FFFD in std.char and %c#965

Draft
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:fix/surrogate-codepoint-validation
Draft

fix: replace surrogate codepoints with U+FFFD in std.char and %c#965
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:fix/surrogate-codepoint-validation

Conversation

@He-Pin

@He-Pin He-Pin commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Motivation

std.char() and the %c format specifier accepted surrogate codepoints (U+D800-U+DFFF) and passed them directly to Character.toString(), which produces lone surrogates. These corrupt UTF-16 strings and cause downstream encoding failures ("malformed" errors). Aligning with go-jsonnet's behavior: replace surrogates with U+FFFD (replacement character).

Modification

  • StringModule.scala (std.char): detect surrogate range (0xD800-0xDFFF) and substitute U+FFFD before calling Character.toString()
  • Format.scala (%c): apply the same surrogate→U+FFFD substitution in the format operator
  • UnicodeHandlingTests.scala: update tests to expect U+FFFD replacement instead of raw surrogate preservation

Result

  • std.char(55296) and "%c" % 55296 now produce U+FFFD, matching go-jsonnet
  • Lone surrogates no longer leak into string values, preventing encoding errors in JSON/YAML/TOML renderers
  • std.codepoint(std.char(0xD800)) returns 65533 (U+FFFD) instead of 55296
Implementation std.char(0xD800) "%c" % 0xD800 std.codepoint(std.char(0xD800))
go-jsonnet v0.22.0 U+FFFD (replace) U+FFFD (replace) 65533
jrsonnet v0.5.0-pre99 Error (reject) Error (reject) Error
C++ jsonnet Invalid UTF-8 Invalid UTF-8 55296 (raw)
sjsonnet (before) "malformed" crash "malformed" crash 55296 (raw)
sjsonnet (after) U+FFFD (replace) U+FFFD (replace) 65533

Test plan

  • std.char(55296) returns "\uFFFD" (U+D800 high surrogate → U+FFFD)
  • std.char(56320) returns "\uFFFD" (U+DC00 low surrogate → U+FFFD)
  • std.codepoint(std.char(55296)) returns 65533
  • "%c" % 55296 produces U+FFFD (Format.scala fix)
  • Valid codepoints (0, 65, 0x10FFFF) unchanged
  • All existing tests pass

Motivation:
std.char() and the %c format specifier accepted surrogate codepoints
(U+D800-U+DFFF) and passed them directly to Character.toString(), which
produces lone surrogates. These corrupt UTF-16 strings and cause
downstream encoding failures. go-jsonnet replaces surrogates with
U+FFFD (replacement character); sjsonnet should align.

Modification:
- StringModule.scala (std.char): detect surrogate range and substitute
  U+FFFD before calling Character.toString()
- Format.scala (%c): apply the same surrogate→U+FFFD substitution
- UnicodeHandlingTests.scala: update tests to expect U+FFFD replacement
  instead of raw surrogate preservation

Result:
std.char(55296) and "%c" % 55296 now produce U+FFFD, matching go-jsonnet.
Lone surrogates no longer leak into string values, preventing encoding
errors in JSON/YAML/TOML renderers.

References:
- go-jsonnet behavior: std.char(0xD800) → U+FFFD
- jrsonnet behavior: rejects surrogates as errors
- C++ jsonnet behavior: produces invalid UTF-8 (no validation)
@He-Pin He-Pin marked this pull request as draft June 18, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant