Skip to content

Unhandled exceptions (zlib.error, EOFError, RuntimeError) during malicious/corrupt DOCX parsing #1561

Description

@MR-SS

Summary

When parsing a .docx file using docx.Document(stream), the library relies heavily on Python's standard zipfile module to extract the internal XML files.

If a user provides a malformed, corrupt, or unexpectedly modified ZIP structure, the underlying zipfile and zlib modules will throw native Python exceptions (zlib.error, EOFError, and RuntimeError). Because python-docx does not catch these specific exceptions during PackageReader.from_file(), they completely escape the library and cause an unhandled application crash.

These issues were discovered via fuzzing.


Details & Tracebacks

1. zlib.error (Corrupted Compression Stream)

If the ZIP compression data is modified, zlib encounters an illegal compression distance.

Traceback (most recent call last):
  File "poc.py", line 4, in <module>
    doc = docx.Document("crash_zlib.docx")
  ...
  File "/usr/lib/python3.11/zipfile.py", line 1027, in _read1
    data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid distance too far back

2. EOFError (Truncated ZIP Data)

If the ZIP headers declare a larger size than the actual file stream contains, zipfile hits EOF unexpectedly.

Traceback (most recent call last):
  File "poc.py", line 4, in <module>
    doc = docx.Document("crash_eof.docx")
  ...
  File "/usr/lib/python3.11/zipfile.py", line 1054, in _read2
    raise EOFError
EOFError

3. RuntimeError (Unexpected Encryption Flag)

If a ZIP file flag is modified to indicate an internal XML file (e.g. word/_rels/document.xml.rels) is encrypted, zipfile throws a RuntimeError because no password was provided.

Traceback (most recent call last):
  File "poc.py", in <module>
    doc = docx.Document("crash_encrypted.docx")
  File "/usr/lib/python3.11/zipfile.py", line 1598, in open
    raise RuntimeError("File 'word/_rels/document.xml.rels' is encrypted, password required for extraction")
RuntimeError: File 'word/_rels/document.xml.rels' is encrypted, password required for extraction

Suggested Remediation

Any application parsing user-uploaded .docx files should catch a standard PackageNotFoundError or BadZipFile if the document is corrupt. The application developer should not have to manually catch zlib.error or EOFError.

I suggest updating docx.opc.pkgreader.PackageReader.from_file() (or the underlying phys_pkg.blob_for method) to catch these native extraction exceptions and wrap them in a standard python-docx exception (such as PackageNotFoundError or a new InvalidPackageError).

import zipfile
import zlib

try:
    return self._zipf.read(pack_uri.membername)
except (KeyError, zipfile.BadZipFile, zlib.error, EOFError, RuntimeError):
    raise PackageNotFoundError("Package is corrupt, truncated, or encrypted.")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions