Python: Fix HTML-like tags being dropped in ChatHistory.from_rendered_prompt#13659
Python: Fix HTML-like tags being dropped in ChatHistory.from_rendered_prompt#13659jgarrison929 wants to merge 1 commit intomicrosoft:mainfrom
Conversation
…_prompt When a prompt contained valid XML elements like <p>, <div>, etc., the XML parser in from_rendered_prompt treated them as elements but silently discarded their content because they weren't recognized as chat message tags (message, history). This fix serializes unrecognized elements back to their XML string representation using tostring(), preserving both the tags and their content in the final prompt text. Added regression tests for: - Single HTML-like tag (<p>) - Nested HTML-like tags (<div><p>...</p></div>) Fixes microsoft#13632
There was a problem hiding this comment.
Pull request overview
Fixes ChatHistory.from_rendered_prompt so valid XML elements that are not recognized as chat message tags (for example <p>, <div>, <b>) are preserved in the resulting user prompt, instead of being silently dropped.
Changes:
- Update
ChatHistory.from_rendered_promptto accumulate free text and serialize unrecognized XML elements back into text usingtostring(). - Add unit tests to cover prompts containing HTML like tags and nested HTML like tags.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| python/semantic_kernel/contents/chat_history.py | Preserve unrecognized XML elements by serializing them back into the prompt text while parsing rendered prompts. |
| python/tests/unit/contents/test_chat_history.py | Add regression coverage for prompts containing HTML like tags. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| flush_pending_text() | ||
| messages.append(ChatMessageContent.from_element(item)) | ||
| # Tail text after a recognized message element is treated as USER content | ||
| if item.tail and item.tail.strip(): | ||
| messages.append(ChatMessageContent(role=AuthorRole.USER, content=unescape(item.tail.strip()))) | ||
| elif item.tag == CHAT_HISTORY_TAG: | ||
| flush_pending_text() |
There was a problem hiding this comment.
flush_pending_text() defaults to SYSTEM, so when an unrecognized element (for example
) appears between already-parsed chat messages and a subsequent element, the preserved HTML gets emitted as a SYSTEM message. That is inconsistent with how inter-message text (item.tail) is treated as USER, and can unintentionally escalate user-provided content into the system role. Consider flushing pending text with role=USER when messages already exist (and only SYSTEM when no messages have been emitted yet).
| flush_pending_text() | |
| messages.append(ChatMessageContent.from_element(item)) | |
| # Tail text after a recognized message element is treated as USER content | |
| if item.tail and item.tail.strip(): | |
| messages.append(ChatMessageContent(role=AuthorRole.USER, content=unescape(item.tail.strip()))) | |
| elif item.tag == CHAT_HISTORY_TAG: | |
| flush_pending_text() | |
| flush_pending_text(role=AuthorRole.USER if messages else AuthorRole.SYSTEM) | |
| messages.append(ChatMessageContent.from_element(item)) | |
| # Tail text after a recognized message element is treated as USER content | |
| if item.tail and item.tail.strip(): | |
| messages.append(ChatMessageContent(role=AuthorRole.USER, content=unescape(item.tail.strip()))) | |
| elif item.tag == CHAT_HISTORY_TAG: | |
| # Flush any pending text before a structured history block | |
| flush_pending_text(role=AuthorRole.USER if messages else AuthorRole.SYSTEM) |
| if item.tail and item.tail.strip(): | ||
| messages.append(ChatMessageContent(role=AuthorRole.USER, content=unescape(item.tail.strip()))) | ||
| elif item.tag == CHAT_HISTORY_TAG: | ||
| flush_pending_text() |
There was a problem hiding this comment.
Same role issue as the branch: flushing pending_text_parts before a <chat_history> element currently uses the default SYSTEM role, even if earlier messages have already been parsed. If pending text contains preserved HTML that occurs between message blocks, it will become a SYSTEM message; it should likely be USER whenever messages already exist (to match tail handling).
| flush_pending_text() | |
| flush_pending_text(role=AuthorRole.USER if messages else AuthorRole.SYSTEM) |
| # Prompt with HTML-like tags | ||
| prompt_with_html = 'Translate this: "<p>What is your name?</p>"' | ||
| # Same prompt without HTML tags | ||
| prompt_without_html = 'Translate this: "What is your name?"' | ||
|
|
||
| history_with_html = ChatHistory.from_rendered_prompt(prompt_with_html) | ||
| history_without_html = ChatHistory.from_rendered_prompt(prompt_without_html) | ||
|
|
||
| # Both should produce a single user message | ||
| assert len(history_with_html.messages) == 1 | ||
| assert len(history_without_html.messages) == 1 | ||
|
|
||
| # Both should contain the question text | ||
| assert "What is your name?" in history_with_html.messages[0].content | ||
| assert "What is your name?" in history_without_html.messages[0].content | ||
|
|
||
| # The HTML version should preserve the <p> tags | ||
| assert "<p>" in history_with_html.messages[0].content | ||
| assert "</p>" in history_with_html.messages[0].content |
There was a problem hiding this comment.
These tests only assert that the HTML tags and inner text appear somewhere in the parsed content, but they do not verify that content surrounding the tags (for example, trailing text after
such as the closing quote in this prompt) is preserved. Strengthen the regression by asserting the full expected message content (or at least asserting the suffix/prefix around the tag), so a future regression that drops element tail text or punctuation will be caught.
Motivation and Context
Fixes #13632
When a prompt contains valid XML elements like
<p>,<div>,<b>, etc., the XML parser inChatHistory.from_rendered_prompttreats them as elements but silently discards their content.Description
This PR fixes the issue by serializing unrecognized XML elements back to their string representation using
tostring().After the fix:
Contribution Checklist