Markdown to html improvement by Thorium · Pull Request #965 · fsprojects/FSharp.Formatting

Thorium · 2025-11-11T11:14:56Z

This fixes #964

nojaf

I'm not sure if this is the way to go.
I'd prefer a solution where everything happens in one pass.

src/FSharp.Formatting.Markdown/HtmlFormatting.fs

Copilot

Pull request overview

This PR addresses issue #964 by improving Markdown-to-HTML generation so that emoji-heavy content (common in AI-generated docs) is more robustly representable in HTML output.

Changes:

Add encodeHighUnicode in the Markdown HTML formatter to emit numeric HTML entities for certain high-Unicode characters.
Apply the new encoding when rendering Literal spans to HTML.
Add regression tests for emoji entity encoding and a heading+list parsing scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
tests/FSharp.Markdown.Tests/Markdown.fs	Adds tests validating emoji-to-numeric-entity output and related Markdown rendering behaviors.
src/FSharp.Formatting.Markdown/HtmlFormatting.fs	Introduces high-Unicode/emoji encoding logic and uses it when emitting literal text spans.

Comments suppressed due to low confidence (2)

tests/FSharp.Markdown.Tests/Markdown.fs:48

should not' (contain "&#") is a fairly broad assertion and may become brittle if the HTML renderer introduces unrelated numeric entities in the future (e.g.,   in some contexts). Consider narrowing this to assert that the specific Cyrillic/Chinese characters are preserved as-is (and/or that none of those characters are replaced with entities), rather than banning any numeric entity anywhere in the output.

    let html = "This is regular text with пристаням Cyrillic and 中文 Chinese" |> Markdown.ToHtml
    html |> should contain "пристаням"
    html |> should contain "中文"
    html |> should not' (contain "&#") // No HTML entities for regular international text

src/FSharp.Formatting.Markdown/HtmlFormatting.fs:53

encodeHighUnicode does not encode Variation Selectors (e.g., U+FE0F) used by many emojis like "⚠️". This means the output can still contain non-ASCII Unicode even after encoding (e.g., ⚠ followed by a raw VS16), which undercuts the goal of making emoji-safe HTML. Consider extending needsEncoding to also encode U+FE00–U+FE0F (and any other emoji-related BMP code points you want to guarantee as entities).

                    let codePoint = int c
                    // Encode specific ranges that contain emojis and symbols:
                    // U+2000-U+2BFF: General Punctuation, Superscripts, Currency, Dingbats, Arrows, Math, Technical, Box Drawing, etc.
                    // U+1F000-U+1FFFF: Supplementary Multilingual Plane emojis (handled above via surrogates)
                    (codePoint >= 0x2000 && codePoint <= 0x2BFF), codePoint, false

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-22T01:47:09Z

src/FSharp.Formatting.Markdown/HtmlFormatting.fs

    | AnchorLink(id, _) -> ctx.Writer.Write("<a name=\"" + htmlEncodeQuotes id + "\">&#160;</a>")
    | EmbedSpans(cmd, _) -> formatSpans ctx (cmd.Render())
-    | Literal(str, _) -> ctx.Writer.Write(str)
+    | Literal(str, _) -> ctx.Writer.Write(encodeHighUnicode str)
    | HardLineBreak(_) -> ctx.Writer.Write("<br />" + ctx.Newline)


Emojis/high Unicode are only being encoded for Literal spans. InlineCode and CodeBlock still write the raw emoji characters via htmlEncode, and attribute values (e.g., alt, title) go through htmlEncodeQuotes which also leaves emojis untouched. If the intent is to avoid encoding/serialization issues, consider applying encodeHighUnicode as part of htmlEncode/htmlEncodeQuotes (or at least in the InlineCode/CodeBlock branches) so emoji handling is consistent across all HTML output paths.

This issue also appears on line 48 of the same file.

Copilot · 2026-02-22T01:47:09Z

src/FSharp.Formatting.Markdown/HtmlFormatting.fs

+                    let fullCodePoint = Char.ConvertToUtf32(c, text.[i + 1])
+                    // Encode all characters outside BMP (>= 0x10000) as they're typically emojis
+                    true, fullCodePoint, true


The comment "Encode all characters outside BMP (>= 0x10000) as they're typically emojis" is misleading: many non-emoji scripts/symbol sets live outside the BMP. If the intended behavior is "encode all non-BMP code points to avoid output encoding issues", consider rewording the comment to reflect that rationale rather than implying they're mostly emojis.

Copilot · 2026-02-22T01:47:09Z

tests/FSharp.Markdown.Tests/Markdown.fs

+let ``Emojis are encoded as HTML numeric entities`` () =
+    let html = "Like this 🎉🚧⭐⚠️✅" |> Markdown.ToHtml
+    html |> should contain "&#127881;" // 🎉 party popper
+    html |> should contain "&#128679;" // 🚧 construction
+    html |> should contain "&#11088;" // ⭐ star
+    html |> should contain "&#9888;" // ⚠️ warning
+    html |> should contain "&#9989;" // ✅ check mark


The emoji test currently checks for the base warning sign entity (U+26A0) but not the variation selector that is part of the literal "⚠️" sequence. If you keep/expect VS16 in the input, it would be good to assert it is also encoded (e.g., ️) or otherwise assert that no raw emoji/variation-selector characters remain in the HTML output.

This issue also appears on line 45 of the same file.

dsyme · 2026-02-22T13:52:13Z

@copilot open a new pull request to apply changes based on the comments in this thread

dsyme · 2026-02-22T19:57:21Z

@copilot open a new pull request to apply changes based on the comments in this thread #965 (review)

dsyme · 2026-02-23T02:15:15Z

/repo-assist implement all the code suggestions in this thread

- Add encodeHighUnicode function encoding non-BMP code points, emoji BMP blocks (U+2600-U+26FF, U+2700-U+27BF, U+2B00-U+2BFF), and variation selectors (U+FE00-U+FE0F) as numeric HTML entities - Apply encodeHighUnicode inside htmlEncode so InlineCode, CodeBlock, and attribute values all encode emoji consistently (not just Literal spans) - Apply encodeHighUnicode to Literal spans which bypass htmlEncode - Fix misleading comment: encode is for output encoding safety, not just emojis - Add test asserting variation selector U+FE0F is also encoded - Add tests for emoji encoding, regular text passthrough, and list after heading Addresses review suggestions from PR #965. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-02-23T02:24:37Z

🤖 This is an automated response from Repo Assist.

I've implemented all three unresolved code suggestions from the review. A new draft PR has been created with the following changes:

Consistent encoding across all HTML paths — encodeHighUnicode is now applied inside htmlEncode, so InlineCode, CodeBlock, attribute values (alt, title), etc. all encode emoji/high-Unicode as numeric entities, not just Literal spans.
Fixed misleading comment — The comment "they're typically emojis" has been replaced with language clarifying the actual rationale: encoding to avoid output encoding issues.
Variation selector test — Extended the encoded BMP ranges to include variation selectors (U+FE00–U+FE0F) and added html |> should contain "️" to assert that the VS16 in ⚠️ is also encoded.

The BMP range was narrowed from the broad U+2000–U+2BFF to targeted emoji blocks (U+2600–U+26FF, U+2700–U+27BF, U+2B00–U+2BFF) to avoid breaking CommonMark spec tests for box-drawing characters. All 182 markdown tests pass.

Generated by Repo Assist for issue #965

To install this workflow, run gh aw add githubnext/agentics/workflows/repo-assist.md@828ac109efb43990f59475cbfce90ede5546586c. View source at https://github.com/githubnext/agentics/tree/828ac109efb43990f59475cbfce90ede5546586c/workflows/repo-assist.md.

dsyme · 2026-02-23T11:13:25Z

Looks like Repo Assist created a new PR addressing the remaining code review points rather than patching up this PR

#986

Thorium · 2026-02-23T11:29:54Z

That is totally fine, I only care about getting my issue fixed, not who actually fixes it.

dsyme · 2026-02-23T14:34:43Z

We took a look at this. The original problem reported looks real, but we'd like to get a systematic test suite in place and have an agent solve it.

markdown to html improvement

b35a76c

nojaf reviewed Nov 12, 2025

View reviewed changes

src/FSharp.Formatting.Markdown/HtmlFormatting.fs Outdated Show resolved Hide resolved

src/FSharp.Formatting.Markdown/HtmlFormatting.fs Outdated Show resolved Hide resolved

Thorium added 2 commits November 16, 2025 11:17

addressed nojaf feedback

0ba7b77

Merge branch 'main' into fix-964

2fbbbac

github-actions bot mentioned this pull request Feb 22, 2026

[Repo Assist] Monthly Activity 2026-02 #973

Open

26 tasks

dsyme requested a review from Copilot February 22, 2026 01:40

Copilot started reviewing on behalf of dsyme February 22, 2026 01:40 View session

dsyme approved these changes Feb 22, 2026

View reviewed changes

Copilot AI reviewed Feb 22, 2026

View reviewed changes

github-actions bot mentioned this pull request Feb 22, 2026

Conversions from fsx comments markdown to html containing emoji (on AI-generated documents) #964

Closed

github-actions bot mentioned this pull request Feb 23, 2026

[Repo Assist] Encode emoji/high-Unicode as HTML entities in all HTML output paths #986

Closed

dsyme closed this Feb 23, 2026

Conversation

Thorium commented Nov 11, 2025

Uh oh!

nojaf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

dsyme commented Feb 22, 2026

Uh oh!

dsyme commented Feb 22, 2026

Uh oh!

dsyme commented Feb 23, 2026

Uh oh!

github-actions bot commented Feb 23, 2026

Uh oh!

dsyme commented Feb 23, 2026

Uh oh!

Thorium commented Feb 23, 2026

Uh oh!

dsyme commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants