Skip to content

Conversation

@ducktv1203
Copy link

Description:
Adds WordDocConverter - a file converter that transforms text prompts into Word documents (.docx). Issue #424.

Two modes:

  • Direct generation: plain text prompt is written into a new .docx with configurable font name and size
  • Template-based generation: an existing .docx file containing jinja2 {{ prompt }} placeholders is used as a template; placeholders are replaced with the prompt text while preserving all original formatting, tables, headers, and footers. The original file is never modified.

Files changed:

  • pyrit/prompt_converter/word_doc_converter.py: converter implementation
  • pyrit/prompt_converter/init.py: export WordDocConverter
  • pyproject.toml: added python-docx>=1.2.0 dependency
  • tests/unit/converter/test_word_doc_converter.py: 21 unit tests

Tests and Documentation:
Tests: 21 unit tests covering init/validation, direct generation, template-based generation (body paragraphs, tables, multiple placeholders, no-placeholder passthrough), end-to-end with real .docx output, and identifier correctness. All passed. (21/21)

@ducktv1203
Copy link
Author

@microsoft-github-policy-service agree

@ducktv1203 ducktv1203 marked this pull request as draft February 10, 2026 12:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new PyRIT file converter that emits Word documents (.docx) from text prompts, complementing the existing PDF file-converter tooling and documentation.

Changes:

  • Introduces WordDocConverter with direct .docx generation and template-based placeholder injection.
  • Exports WordDocConverter from pyrit.prompt_converter.
  • Adds unit tests and updates converter documentation; updates project dependencies for python-docx.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pyrit/prompt_converter/word_doc_converter.py Implements .docx generation + template injection and serialization.
pyrit/prompt_converter/__init__.py Exposes WordDocConverter in the package exports.
pyproject.toml Adds python-docx dependency (but also introduces a duplicate pypdf entry).
tests/unit/converter/test_word_doc_converter.py Adds unit tests for direct + template modes and identifier behavior.
doc/code/converters/5_file_converters.py Documents Word document conversion usage alongside PDF converters.

Comment on lines +256 to +276
async def _serialize_docx(self, docx_bytes: bytes) -> DataTypeSerializer:
"""
Save the generated ``.docx`` bytes through PyRIT's data serializer.

The serializer picks a unique filename and writes the bytes to the configured storage location (local disk by default).

Args:
docx_bytes: Raw content of the Word document.

Returns:
DataTypeSerializer: Serializer whose ``.value`` contains the output path.
"""
docx_serializer = data_serializer_factory(
category="prompt-memory-entries",
data_type="binary_path",
extension="docx",
)

await docx_serializer.save_data(docx_bytes)

return docx_serializer
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async private method _serialize_docx doesn’t follow the project convention that async methods must end with _async. Rename it (e.g., _serialize_docx_async) and update the call site and related tests/patches accordingly.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +180 to +183
# Rewind to read from the start of the stored bytes.
self._existing_doc_bytes.seek(0)
document = Document(self._existing_doc_bytes)

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._existing_doc_bytes is a shared BytesIO whose cursor is mutated via seek(0). If convert_async is called concurrently on the same converter instance, the shared cursor can cause races/corrupted reads. Store immutable bytes instead and create a new BytesIO per conversion (or otherwise guard access).

Suggested change
# Rewind to read from the start of the stored bytes.
self._existing_doc_bytes.seek(0)
document = Document(self._existing_doc_bytes)
existing_doc_bytes = self._existing_doc_bytes
if isinstance(existing_doc_bytes, BytesIO):
template_bytes = existing_doc_bytes.getvalue()
else:
template_bytes = existing_doc_bytes
document_stream = BytesIO(template_bytes)
document = Document(document_stream)

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +103
(e.g. ``{{ prompt }}``) while preserving all original formatting.
If no template was provided, a new document is generated where each line of the prompt (split on ``\\n``)
becomes a new paragraph with the configured font settings.

Args:
prompt: The text to embed in the Word document.
input_type: Must be ``text``.
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring claims template-based injection preserves “all original formatting”, but _render_paragraph rewrites runs and can collapse mixed formatting when placeholders span runs (it clears subsequent runs and applies first-run formatting to the whole rendered text). Please either (a) implement run-aware replacement that preserves mixed formatting, or (b) adjust the docstring/behavior description to reflect this limitation.

Suggested change
(e.g. ``{{ prompt }}``) while preserving all original formatting.
If no template was provided, a new document is generated where each line of the prompt (split on ``\\n``)
becomes a new paragraph with the configured font settings.
Args:
prompt: The text to embed in the Word document.
input_type: Must be ``text``.
(e.g. ``{{ prompt }}``), using the original document's layout and styling as a base. Mixed formatting within or
across placeholder regions may not be preserved exactly.
If no template was provided, a new document is generated where each line of the prompt (split on ``\\n``)
becomes a new paragraph with the configured font settings.
Args:
prompt (str): The text to embed in the Word document.
input_type (PromptDataType): Must be ``text``.

Copilot uses AI. Check for mistakes.
Comment on lines +234 to +238
template = Template(full_text)
rendered_text = template.render(**template_vars)
except Exception as e:
logger.warning(f"Failed to render paragraph template: {e}")
return
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rendering arbitrary Jinja2 templates from document text via Template(full_text).render(...) is unsafe if the template content is not fully trusted (Jinja2 templates can be abused for code execution/data access). Consider using jinja2.sandbox.SandboxedEnvironment, restricting to a simple {{ prompt }} replacement, or otherwise documenting and enforcing that templates must be trusted.

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +15
from io import BytesIO
from pathlib import Path
from typing import Optional
from docx import Document
from docx.shared import Pt
from jinja2 import Template

from pyrit.common.logger import logger
from pyrit.identifiers import ConverterIdentifier
from pyrit.models import PromptDataType, data_serializer_factory
from pyrit.models.data_type_serializer import DataTypeSerializer
from pyrit.prompt_converter.prompt_converter import ConverterResult, PromptConverter
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import grouping is inconsistent with other modules (stdlib vs third-party vs local). Add a blank line between standard-library imports (io/pathlib/typing) and third-party imports (docx/jinja2) to match the repository’s import organization pattern.

Copilot uses AI. Check for mistakes.
"pyodbc>=5.1.0",
"python-dotenv>=1.0.1",
"python-docx>=1.2.0",
"pypdf>=5.1.0",
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependencies lists pypdf twice with different minimum versions (>=5.1.0 and >=6.6.2). This is conflicting/ambiguous for resolvers and should be collapsed to a single requirement (likely keep only the stricter >=6.6.2 unless there’s a specific reason to lower it).

Suggested change
"pypdf>=5.1.0",

Copilot uses AI. Check for mistakes.
Comment on lines +211 to +214
# The `WordDocConverter` generates Word documents (.docx) from text using `python-docx`. It supports two modes:
#
# 1. **Direct generation**: Convert plain text strings into Word documents. The prompt becomes the document content.
# 2. **Template-based generation**: Supply an existing `.docx` file containing jinja2 placeholders (e.g., `{{ prompt }}`). The converter replaces placeholders with the prompt text while preserving the original document's formatting, tables, headers, and footers. The original file is never modified — a new file is always generated.
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs state that template-based generation preserves the original document’s formatting. Given the current implementation can collapse run-level formatting when placeholders span multiple runs, please either update the documentation to mention this limitation or improve the implementation to truly preserve mixed formatting.

Copilot uses AI. Check for mistakes.
# This mode takes an existing `.docx` file that contains jinja2 `{{ prompt }}` placeholders and replaces them with the provided prompt text. This is useful for embedding adversarial content into realistic document templates (e.g., resumes, reports, invoices) while preserving all original formatting.

# %%
import tempfile
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import of module tempfile is redundant, as it was previously imported on line 144.

Suggested change
import tempfile

Copilot uses AI. Check for mistakes.
from pyrit.prompt_converter.prompt_converter import ConverterResult, PromptConverter


class WordDocConverter(PromptConverter):
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class does not call PromptConverter.init during initialization. (WordDocConverter.init may be missing a call to a base class init)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant