Conversation
nojaf
left a comment
There was a problem hiding this comment.
I'm not sure if this is the way to go.
I'd prefer a solution where everything happens in one pass.
There was a problem hiding this comment.
Pull request overview
This PR addresses issue #964 by improving Markdown-to-HTML generation so that emoji-heavy content (common in AI-generated docs) is more robustly representable in HTML output.
Changes:
- Add
encodeHighUnicodein the Markdown HTML formatter to emit numeric HTML entities for certain high-Unicode characters. - Apply the new encoding when rendering
Literalspans to HTML. - Add regression tests for emoji entity encoding and a heading+list parsing scenario.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/FSharp.Markdown.Tests/Markdown.fs | Adds tests validating emoji-to-numeric-entity output and related Markdown rendering behaviors. |
| src/FSharp.Formatting.Markdown/HtmlFormatting.fs | Introduces high-Unicode/emoji encoding logic and uses it when emitting literal text spans. |
Comments suppressed due to low confidence (2)
tests/FSharp.Markdown.Tests/Markdown.fs:48
should not' (contain "&#")is a fairly broad assertion and may become brittle if the HTML renderer introduces unrelated numeric entities in the future (e.g., in some contexts). Consider narrowing this to assert that the specific Cyrillic/Chinese characters are preserved as-is (and/or that none of those characters are replaced with entities), rather than banning any numeric entity anywhere in the output.
let html = "This is regular text with пристаням Cyrillic and 中文 Chinese" |> Markdown.ToHtml
html |> should contain "пристаням"
html |> should contain "中文"
html |> should not' (contain "&#") // No HTML entities for regular international text
src/FSharp.Formatting.Markdown/HtmlFormatting.fs:53
encodeHighUnicodedoes not encode Variation Selectors (e.g., U+FE0F) used by many emojis like "⚠️ ". This means the output can still contain non-ASCII Unicode even after encoding (e.g.,⚠followed by a raw VS16), which undercuts the goal of making emoji-safe HTML. Consider extendingneedsEncodingto also encode U+FE00–U+FE0F (and any other emoji-related BMP code points you want to guarantee as entities).
let codePoint = int c
// Encode specific ranges that contain emojis and symbols:
// U+2000-U+2BFF: General Punctuation, Superscripts, Currency, Dingbats, Arrows, Math, Technical, Box Drawing, etc.
// U+1F000-U+1FFFF: Supplementary Multilingual Plane emojis (handled above via surrogates)
(codePoint >= 0x2000 && codePoint <= 0x2BFF), codePoint, false
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | AnchorLink(id, _) -> ctx.Writer.Write("<a name=\"" + htmlEncodeQuotes id + "\"> </a>") | ||
| | EmbedSpans(cmd, _) -> formatSpans ctx (cmd.Render()) | ||
| | Literal(str, _) -> ctx.Writer.Write(str) | ||
| | Literal(str, _) -> ctx.Writer.Write(encodeHighUnicode str) | ||
| | HardLineBreak(_) -> ctx.Writer.Write("<br />" + ctx.Newline) |
There was a problem hiding this comment.
Emojis/high Unicode are only being encoded for Literal spans. InlineCode and CodeBlock still write the raw emoji characters via htmlEncode, and attribute values (e.g., alt, title) go through htmlEncodeQuotes which also leaves emojis untouched. If the intent is to avoid encoding/serialization issues, consider applying encodeHighUnicode as part of htmlEncode/htmlEncodeQuotes (or at least in the InlineCode/CodeBlock branches) so emoji handling is consistent across all HTML output paths.
This issue also appears on line 48 of the same file.
| let fullCodePoint = Char.ConvertToUtf32(c, text.[i + 1]) | ||
| // Encode all characters outside BMP (>= 0x10000) as they're typically emojis | ||
| true, fullCodePoint, true |
There was a problem hiding this comment.
The comment "Encode all characters outside BMP (>= 0x10000) as they're typically emojis" is misleading: many non-emoji scripts/symbol sets live outside the BMP. If the intended behavior is "encode all non-BMP code points to avoid output encoding issues", consider rewording the comment to reflect that rationale rather than implying they're mostly emojis.
| let ``Emojis are encoded as HTML numeric entities`` () = | ||
| let html = "Like this 🎉🚧⭐⚠️✅" |> Markdown.ToHtml | ||
| html |> should contain "🎉" // 🎉 party popper | ||
| html |> should contain "🚧" // 🚧 construction | ||
| html |> should contain "⭐" // ⭐ star | ||
| html |> should contain "⚠" // ⚠️ warning | ||
| html |> should contain "✅" // ✅ check mark |
There was a problem hiding this comment.
The emoji test currently checks for the base warning sign entity (U+26A0) but not the variation selector that is part of the literal "️) or otherwise assert that no raw emoji/variation-selector characters remain in the HTML output.
This issue also appears on line 45 of the same file.
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
@copilot open a new pull request to apply changes based on the comments in this thread #965 (review) |
|
/repo-assist implement all the code suggestions in this thread |
- Add encodeHighUnicode function encoding non-BMP code points, emoji BMP blocks (U+2600-U+26FF, U+2700-U+27BF, U+2B00-U+2BFF), and variation selectors (U+FE00-U+FE0F) as numeric HTML entities - Apply encodeHighUnicode inside htmlEncode so InlineCode, CodeBlock, and attribute values all encode emoji consistently (not just Literal spans) - Apply encodeHighUnicode to Literal spans which bypass htmlEncode - Fix misleading comment: encode is for output encoding safety, not just emojis - Add test asserting variation selector U+FE0F is also encoded - Add tests for emoji encoding, regular text passthrough, and list after heading Addresses review suggestions from PR #965. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🤖 This is an automated response from Repo Assist. I've implemented all three unresolved code suggestions from the review. A new draft PR has been created with the following changes:
The BMP range was narrowed from the broad U+2000–U+2BFF to targeted emoji blocks (U+2600–U+26FF, U+2700–U+27BF, U+2B00–U+2BFF) to avoid breaking CommonMark spec tests for box-drawing characters. All 182 markdown tests pass.
|
|
Looks like Repo Assist created a new PR addressing the remaining code review points rather than patching up this PR |
|
That is totally fine, I only care about getting my issue fixed, not who actually fixes it. |
|
We took a look at this. The original problem reported looks real, but we'd like to get a systematic test suite in place and have an agent solve it. |
This fixes #964