fix: replace surrogate codepoints with U+FFFD in std.char and %c by He-Pin · Pull Request #965 · databricks/sjsonnet

He-Pin · 2026-06-18T06:51:07Z

Motivation

std.char() and the %c format specifier accepted surrogate codepoints (U+D800-U+DFFF) and passed them directly to Character.toString(), which produces lone surrogates. These corrupt UTF-16 strings and cause downstream encoding failures ("malformed" errors). Aligning with go-jsonnet's behavior: replace surrogates with U+FFFD (replacement character).

Modification

StringModule.scala (std.char): detect surrogate range (0xD800-0xDFFF) and substitute U+FFFD before calling Character.toString()
Format.scala (%c): apply the same surrogate→U+FFFD substitution in the format operator
UnicodeHandlingTests.scala: update tests to expect U+FFFD replacement instead of raw surrogate preservation

Result

std.char(55296) and "%c" % 55296 now produce U+FFFD, matching go-jsonnet
Lone surrogates no longer leak into string values, preventing encoding errors in JSON/YAML/TOML renderers
std.codepoint(std.char(0xD800)) returns 65533 (U+FFFD) instead of 55296

Implementation	`std.char(0xD800)`	`"%c" % 0xD800`	`std.codepoint(std.char(0xD800))`
go-jsonnet v0.22.0	U+FFFD (replace)	U+FFFD (replace)	65533
jrsonnet v0.5.0-pre99	Error (reject)	Error (reject)	Error
C++ jsonnet	Invalid UTF-8	Invalid UTF-8	55296 (raw)
sjsonnet (before)	"malformed" crash	"malformed" crash	55296 (raw)
sjsonnet (after)	U+FFFD (replace)	U+FFFD (replace)	65533

Test plan

std.char(55296) returns "\uFFFD" (U+D800 high surrogate → U+FFFD)
std.char(56320) returns "\uFFFD" (U+DC00 low surrogate → U+FFFD)
std.codepoint(std.char(55296)) returns 65533
"%c" % 55296 produces U+FFFD (Format.scala fix)
Valid codepoints (0, 65, 0x10FFFF) unchanged
All existing tests pass

Motivation: std.char() and the %c format specifier accepted surrogate codepoints (U+D800-U+DFFF) and passed them directly to Character.toString(), which produces lone surrogates. These corrupt UTF-16 strings and cause downstream encoding failures. go-jsonnet replaces surrogates with U+FFFD (replacement character); sjsonnet should align. Modification: - StringModule.scala (std.char): detect surrogate range and substitute U+FFFD before calling Character.toString() - Format.scala (%c): apply the same surrogate→U+FFFD substitution - UnicodeHandlingTests.scala: update tests to expect U+FFFD replacement instead of raw surrogate preservation Result: std.char(55296) and "%c" % 55296 now produce U+FFFD, matching go-jsonnet. Lone surrogates no longer leak into string values, preventing encoding errors in JSON/YAML/TOML renderers. References: - go-jsonnet behavior: std.char(0xD800) → U+FFFD - jrsonnet behavior: rejects surrogates as errors - C++ jsonnet behavior: produces invalid UTF-8 (no validation)

He-Pin marked this pull request as draft June 18, 2026 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: replace surrogate codepoints with U+FFFD in std.char and %c#965

fix: replace surrogate codepoints with U+FFFD in std.char and %c#965
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:fix/surrogate-codepoint-validation

He-Pin commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

He-Pin commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Result

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

He-Pin commented Jun 18, 2026 •

edited

Loading