-
Notifications
You must be signed in to change notification settings - Fork 664
FEAT Add WordDocConverter for Word document generation #1365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…te rendering and integration with PyRIT's data serialization system.
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds a new PyRIT file converter that emits Word documents (.docx) from text prompts, complementing the existing PDF file-converter tooling and documentation.
Changes:
- Introduces
WordDocConverterwith direct.docxgeneration and template-based placeholder injection. - Exports
WordDocConverterfrompyrit.prompt_converter. - Adds unit tests and updates converter documentation; updates project dependencies for
python-docx.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
pyrit/prompt_converter/word_doc_converter.py |
Implements .docx generation + template injection and serialization. |
pyrit/prompt_converter/__init__.py |
Exposes WordDocConverter in the package exports. |
pyproject.toml |
Adds python-docx dependency (but also introduces a duplicate pypdf entry). |
tests/unit/converter/test_word_doc_converter.py |
Adds unit tests for direct + template modes and identifier behavior. |
doc/code/converters/5_file_converters.py |
Documents Word document conversion usage alongside PDF converters. |
| async def _serialize_docx(self, docx_bytes: bytes) -> DataTypeSerializer: | ||
| """ | ||
| Save the generated ``.docx`` bytes through PyRIT's data serializer. | ||
|
|
||
| The serializer picks a unique filename and writes the bytes to the configured storage location (local disk by default). | ||
|
|
||
| Args: | ||
| docx_bytes: Raw content of the Word document. | ||
|
|
||
| Returns: | ||
| DataTypeSerializer: Serializer whose ``.value`` contains the output path. | ||
| """ | ||
| docx_serializer = data_serializer_factory( | ||
| category="prompt-memory-entries", | ||
| data_type="binary_path", | ||
| extension="docx", | ||
| ) | ||
|
|
||
| await docx_serializer.save_data(docx_bytes) | ||
|
|
||
| return docx_serializer |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Async private method _serialize_docx doesn’t follow the project convention that async methods must end with _async. Rename it (e.g., _serialize_docx_async) and update the call site and related tests/patches accordingly.
| # Rewind to read from the start of the stored bytes. | ||
| self._existing_doc_bytes.seek(0) | ||
| document = Document(self._existing_doc_bytes) | ||
|
|
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._existing_doc_bytes is a shared BytesIO whose cursor is mutated via seek(0). If convert_async is called concurrently on the same converter instance, the shared cursor can cause races/corrupted reads. Store immutable bytes instead and create a new BytesIO per conversion (or otherwise guard access).
| # Rewind to read from the start of the stored bytes. | |
| self._existing_doc_bytes.seek(0) | |
| document = Document(self._existing_doc_bytes) | |
| existing_doc_bytes = self._existing_doc_bytes | |
| if isinstance(existing_doc_bytes, BytesIO): | |
| template_bytes = existing_doc_bytes.getvalue() | |
| else: | |
| template_bytes = existing_doc_bytes | |
| document_stream = BytesIO(template_bytes) | |
| document = Document(document_stream) |
| (e.g. ``{{ prompt }}``) while preserving all original formatting. | ||
| If no template was provided, a new document is generated where each line of the prompt (split on ``\\n``) | ||
| becomes a new paragraph with the configured font settings. | ||
|
|
||
| Args: | ||
| prompt: The text to embed in the Word document. | ||
| input_type: Must be ``text``. |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring claims template-based injection preserves “all original formatting”, but _render_paragraph rewrites runs and can collapse mixed formatting when placeholders span runs (it clears subsequent runs and applies first-run formatting to the whole rendered text). Please either (a) implement run-aware replacement that preserves mixed formatting, or (b) adjust the docstring/behavior description to reflect this limitation.
| (e.g. ``{{ prompt }}``) while preserving all original formatting. | |
| If no template was provided, a new document is generated where each line of the prompt (split on ``\\n``) | |
| becomes a new paragraph with the configured font settings. | |
| Args: | |
| prompt: The text to embed in the Word document. | |
| input_type: Must be ``text``. | |
| (e.g. ``{{ prompt }}``), using the original document's layout and styling as a base. Mixed formatting within or | |
| across placeholder regions may not be preserved exactly. | |
| If no template was provided, a new document is generated where each line of the prompt (split on ``\\n``) | |
| becomes a new paragraph with the configured font settings. | |
| Args: | |
| prompt (str): The text to embed in the Word document. | |
| input_type (PromptDataType): Must be ``text``. |
| template = Template(full_text) | ||
| rendered_text = template.render(**template_vars) | ||
| except Exception as e: | ||
| logger.warning(f"Failed to render paragraph template: {e}") | ||
| return |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rendering arbitrary Jinja2 templates from document text via Template(full_text).render(...) is unsafe if the template content is not fully trusted (Jinja2 templates can be abused for code execution/data access). Consider using jinja2.sandbox.SandboxedEnvironment, restricting to a simple {{ prompt }} replacement, or otherwise documenting and enforcing that templates must be trusted.
| from io import BytesIO | ||
| from pathlib import Path | ||
| from typing import Optional | ||
| from docx import Document | ||
| from docx.shared import Pt | ||
| from jinja2 import Template | ||
|
|
||
| from pyrit.common.logger import logger | ||
| from pyrit.identifiers import ConverterIdentifier | ||
| from pyrit.models import PromptDataType, data_serializer_factory | ||
| from pyrit.models.data_type_serializer import DataTypeSerializer | ||
| from pyrit.prompt_converter.prompt_converter import ConverterResult, PromptConverter |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import grouping is inconsistent with other modules (stdlib vs third-party vs local). Add a blank line between standard-library imports (io/pathlib/typing) and third-party imports (docx/jinja2) to match the repository’s import organization pattern.
| "pyodbc>=5.1.0", | ||
| "python-dotenv>=1.0.1", | ||
| "python-docx>=1.2.0", | ||
| "pypdf>=5.1.0", |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dependencies lists pypdf twice with different minimum versions (>=5.1.0 and >=6.6.2). This is conflicting/ambiguous for resolvers and should be collapsed to a single requirement (likely keep only the stricter >=6.6.2 unless there’s a specific reason to lower it).
| "pypdf>=5.1.0", |
| # The `WordDocConverter` generates Word documents (.docx) from text using `python-docx`. It supports two modes: | ||
| # | ||
| # 1. **Direct generation**: Convert plain text strings into Word documents. The prompt becomes the document content. | ||
| # 2. **Template-based generation**: Supply an existing `.docx` file containing jinja2 placeholders (e.g., `{{ prompt }}`). The converter replaces placeholders with the prompt text while preserving the original document's formatting, tables, headers, and footers. The original file is never modified — a new file is always generated. |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs state that template-based generation preserves the original document’s formatting. Given the current implementation can collapse run-level formatting when placeholders span multiple runs, please either update the documentation to mention this limitation or improve the implementation to truly preserve mixed formatting.
| # This mode takes an existing `.docx` file that contains jinja2 `{{ prompt }}` placeholders and replaces them with the provided prompt text. This is useful for embedding adversarial content into realistic document templates (e.g., resumes, reports, invoices) while preserving all original formatting. | ||
|
|
||
| # %% | ||
| import tempfile |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import of module tempfile is redundant, as it was previously imported on line 144.
| import tempfile |
| from pyrit.prompt_converter.prompt_converter import ConverterResult, PromptConverter | ||
|
|
||
|
|
||
| class WordDocConverter(PromptConverter): |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class does not call PromptConverter.init during initialization. (WordDocConverter.init may be missing a call to a base class init)
Description:
Adds WordDocConverter - a file converter that transforms text prompts into Word documents (.docx). Issue #424.
Two modes:
Files changed:
Tests and Documentation:
Tests: 21 unit tests covering init/validation, direct generation, template-based generation (body paragraphs, tables, multiple placeholders, no-placeholder passthrough), end-to-end with real .docx output, and identifier correctness. All passed. (21/21)