Skip to content

Markdown to html improvement#965

Closed
Thorium wants to merge 3 commits intofsprojects:mainfrom
Thorium:fix-964
Closed

Markdown to html improvement#965
Thorium wants to merge 3 commits intofsprojects:mainfrom
Thorium:fix-964

Conversation

@Thorium
Copy link
Member

@Thorium Thorium commented Nov 11, 2025

This fixes #964

Copy link
Collaborator

@nojaf nojaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the way to go.
I'd prefer a solution where everything happens in one pass.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #964 by improving Markdown-to-HTML generation so that emoji-heavy content (common in AI-generated docs) is more robustly representable in HTML output.

Changes:

  • Add encodeHighUnicode in the Markdown HTML formatter to emit numeric HTML entities for certain high-Unicode characters.
  • Apply the new encoding when rendering Literal spans to HTML.
  • Add regression tests for emoji entity encoding and a heading+list parsing scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/FSharp.Markdown.Tests/Markdown.fs Adds tests validating emoji-to-numeric-entity output and related Markdown rendering behaviors.
src/FSharp.Formatting.Markdown/HtmlFormatting.fs Introduces high-Unicode/emoji encoding logic and uses it when emitting literal text spans.
Comments suppressed due to low confidence (2)

tests/FSharp.Markdown.Tests/Markdown.fs:48

  • should not' (contain "&#") is a fairly broad assertion and may become brittle if the HTML renderer introduces unrelated numeric entities in the future (e.g.,   in some contexts). Consider narrowing this to assert that the specific Cyrillic/Chinese characters are preserved as-is (and/or that none of those characters are replaced with entities), rather than banning any numeric entity anywhere in the output.
    let html = "This is regular text with пристаням Cyrillic and 中文 Chinese" |> Markdown.ToHtml
    html |> should contain "пристаням"
    html |> should contain "中文"
    html |> should not' (contain "&#") // No HTML entities for regular international text

src/FSharp.Formatting.Markdown/HtmlFormatting.fs:53

  • encodeHighUnicode does not encode Variation Selectors (e.g., U+FE0F) used by many emojis like "⚠️". This means the output can still contain non-ASCII Unicode even after encoding (e.g., ⚠ followed by a raw VS16), which undercuts the goal of making emoji-safe HTML. Consider extending needsEncoding to also encode U+FE00–U+FE0F (and any other emoji-related BMP code points you want to guarantee as entities).
                    let codePoint = int c
                    // Encode specific ranges that contain emojis and symbols:
                    // U+2000-U+2BFF: General Punctuation, Superscripts, Currency, Dingbats, Arrows, Math, Technical, Box Drawing, etc.
                    // U+1F000-U+1FFFF: Supplementary Multilingual Plane emojis (handled above via surrogates)
                    (codePoint >= 0x2000 && codePoint <= 0x2BFF), codePoint, false


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 141 to 144
| AnchorLink(id, _) -> ctx.Writer.Write("<a name=\"" + htmlEncodeQuotes id + "\">&#160;</a>")
| EmbedSpans(cmd, _) -> formatSpans ctx (cmd.Render())
| Literal(str, _) -> ctx.Writer.Write(str)
| Literal(str, _) -> ctx.Writer.Write(encodeHighUnicode str)
| HardLineBreak(_) -> ctx.Writer.Write("<br />" + ctx.Newline)
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emojis/high Unicode are only being encoded for Literal spans. InlineCode and CodeBlock still write the raw emoji characters via htmlEncode, and attribute values (e.g., alt, title) go through htmlEncodeQuotes which also leaves emojis untouched. If the intent is to avoid encoding/serialization issues, consider applying encodeHighUnicode as part of htmlEncode/htmlEncodeQuotes (or at least in the InlineCode/CodeBlock branches) so emoji handling is consistent across all HTML output paths.

This issue also appears on line 48 of the same file.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +46
let fullCodePoint = Char.ConvertToUtf32(c, text.[i + 1])
// Encode all characters outside BMP (>= 0x10000) as they're typically emojis
true, fullCodePoint, true
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Encode all characters outside BMP (>= 0x10000) as they're typically emojis" is misleading: many non-emoji scripts/symbol sets live outside the BMP. If the intended behavior is "encode all non-BMP code points to avoid output encoding issues", consider rewording the comment to reflect that rationale rather than implying they're mostly emojis.

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +40
let ``Emojis are encoded as HTML numeric entities`` () =
let html = "Like this 🎉🚧⭐⚠️✅" |> Markdown.ToHtml
html |> should contain "&#127881;" // 🎉 party popper
html |> should contain "&#128679;" // 🚧 construction
html |> should contain "&#11088;" // ⭐ star
html |> should contain "&#9888;" // ⚠️ warning
html |> should contain "&#9989;" // ✅ check mark
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The emoji test currently checks for the base warning sign entity (U+26A0) but not the variation selector that is part of the literal "⚠️" sequence. If you keep/expect VS16 in the input, it would be good to assert it is also encoded (e.g., &#65039;) or otherwise assert that no raw emoji/variation-selector characters remain in the HTML output.

This issue also appears on line 45 of the same file.

Copilot uses AI. Check for mistakes.
@dsyme
Copy link
Contributor

dsyme commented Feb 22, 2026

@copilot open a new pull request to apply changes based on the comments in this thread

@dsyme
Copy link
Contributor

dsyme commented Feb 22, 2026

@copilot open a new pull request to apply changes based on the comments in this thread #965 (review)

@dsyme
Copy link
Contributor

dsyme commented Feb 23, 2026

/repo-assist implement all the code suggestions in this thread

github-actions bot pushed a commit that referenced this pull request Feb 23, 2026
- Add encodeHighUnicode function encoding non-BMP code points,
  emoji BMP blocks (U+2600-U+26FF, U+2700-U+27BF, U+2B00-U+2BFF),
  and variation selectors (U+FE00-U+FE0F) as numeric HTML entities
- Apply encodeHighUnicode inside htmlEncode so InlineCode, CodeBlock,
  and attribute values all encode emoji consistently (not just Literal spans)
- Apply encodeHighUnicode to Literal spans which bypass htmlEncode
- Fix misleading comment: encode is for output encoding safety, not
  just emojis
- Add test asserting variation selector U+FE0F is also encoded
- Add tests for emoji encoding, regular text passthrough, and list
  after heading

Addresses review suggestions from PR #965.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Contributor

🤖 This is an automated response from Repo Assist.

I've implemented all three unresolved code suggestions from the review. A new draft PR has been created with the following changes:

  1. Consistent encoding across all HTML pathsencodeHighUnicode is now applied inside htmlEncode, so InlineCode, CodeBlock, attribute values (alt, title), etc. all encode emoji/high-Unicode as numeric entities, not just Literal spans.

  2. Fixed misleading comment — The comment "they're typically emojis" has been replaced with language clarifying the actual rationale: encoding to avoid output encoding issues.

  3. Variation selector test — Extended the encoded BMP ranges to include variation selectors (U+FE00–U+FE0F) and added html |> should contain "️" to assert that the VS16 in ⚠️ is also encoded.

The BMP range was narrowed from the broad U+2000–U+2BFF to targeted emoji blocks (U+2600–U+26FF, U+2700–U+27BF, U+2B00–U+2BFF) to avoid breaking CommonMark spec tests for box-drawing characters. All 182 markdown tests pass.

Generated by Repo Assist for issue #965

To install this workflow, run gh aw add githubnext/agentics/workflows/repo-assist.md@828ac109efb43990f59475cbfce90ede5546586c. View source at https://github.com/githubnext/agentics/tree/828ac109efb43990f59475cbfce90ede5546586c/workflows/repo-assist.md.

@dsyme
Copy link
Contributor

dsyme commented Feb 23, 2026

Looks like Repo Assist created a new PR addressing the remaining code review points rather than patching up this PR

#986

@Thorium
Copy link
Member Author

Thorium commented Feb 23, 2026

That is totally fine, I only care about getting my issue fixed, not who actually fixes it.

@dsyme
Copy link
Contributor

dsyme commented Feb 23, 2026

We took a look at this. The original problem reported looks real, but we'd like to get a systematic test suite in place and have an agent solve it.

@dsyme dsyme closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Conversions from fsx comments markdown to html containing emoji (on AI-generated documents)

3 participants