[Repo Assist] Encode emoji/high-Unicode as HTML entities in all HTML output paths#986
Conversation
- Add encodeHighUnicode function encoding non-BMP code points, emoji BMP blocks (U+2600-U+26FF, U+2700-U+27BF, U+2B00-U+2BFF), and variation selectors (U+FE00-U+FE0F) as numeric HTML entities - Apply encodeHighUnicode inside htmlEncode so InlineCode, CodeBlock, and attribute values all encode emoji consistently (not just Literal spans) - Apply encodeHighUnicode to Literal spans which bypass htmlEncode - Fix misleading comment: encode is for output encoding safety, not just emojis - Add test asserting variation selector U+FE0F is also encoded - Add tests for emoji encoding, regular text passthrough, and list after heading Addresses review suggestions from PR #965. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
✅ Pull request created: #986 |
|
@dsyme as this is in a draft state, what should I make out of this? |
|
@nojaf This PR was the result of me asking Repo Assist to fix up #965. This comment It's not indicative of what Repo Assist would normally do autonomously (it would usually start with an issue if at all, and never work on something so obscure) It also incorrectly made a new PR targeting main, effectively a replacement PR, instead of fixing up the PR itself. I'll fix that for future. Next step is I'll manually merge this into OG PR #965 and we can decide what to do there. |
|
Oh I see, the OG PR came from a fork so Repo Assist couldn't fix it up. |
🤖 This is an automated PR from Repo Assist, implementing the code review suggestions from PR #965.
Addresses the three unresolved review threads in #965.
What this PR does
Implements the code review suggestions from PR #965:
1. Consistent emoji encoding across all HTML output paths (Thread 3)
PR #965 added
encodeHighUnicodebut only applied it toLiteralspans. This PR applies it insidehtmlEncodeitself so thatInlineCode,CodeBlock, attribute values (alt,title), and all other paths automatically encode emoji/high-Unicode as numeric HTML entities — not justLiteralspans.Literalspans (which bypasshtmlEncode) also getencodeHighUnicodeapplied directly.2. Fix misleading comment (Thread 4)
The original comment said "Encode all characters outside BMP (>= 0x10000) as they're typically emojis". Rewrote to clarify the actual rationale: encoding non-BMP code points and selected BMP ranges to avoid output encoding issues, since many non-emoji scripts also live outside the BMP.
3. Variation selector encoding + test assertion (Thread 5)
The
⚠️sequence consists of⚠(U+26A0) followed by variation selector VS16 (U+FE0F). The review noted that the test checked for⚠(⚠) but not for️(VS16). This PR:html |> should contain "️"to the emoji testNote: The PR also narrows the broad U+2000–U+2BFF range from #965 to targeted emoji/symbol blocks only (U+2600–U+26FF, U+2700–U+27BF, U+2B00–U+2BFF). This avoids breaking CommonMark spec tests that expect box-drawing characters (e.g., U+255C
╜) to pass through unchanged in<pre><code>blocks.Tests
All 182 markdown tests pass, including the CommonMark spec suite.
New tests added:
Emojis are encoded as HTML numeric entities— verifies 🎉🚧⭐Regular text without emojis is not modified— verifies Cyrillic and Chinese characters are not touchedList without blank line after heading— covers the Conversions from fsx comments markdown to html containing emoji (on AI-generated documents) #964 scenarioTest Status
✅
dotnet test tests/FSharp.Markdown.Tests/FSharp.Markdown.Tests.fsproj— 182 passed, 0 failed