Summary
When parsing a .docx file using docx.Document(stream), the library relies heavily on Python's standard zipfile module to extract the internal XML files.
If a user provides a malformed, corrupt, or unexpectedly modified ZIP structure, the underlying zipfile and zlib modules will throw native Python exceptions (zlib.error, EOFError, and RuntimeError). Because python-docx does not catch these specific exceptions during PackageReader.from_file(), they completely escape the library and cause an unhandled application crash.
These issues were discovered via fuzzing.
Details & Tracebacks
1. zlib.error (Corrupted Compression Stream)
If the ZIP compression data is modified, zlib encounters an illegal compression distance.
Traceback (most recent call last):
File "poc.py", line 4, in <module>
doc = docx.Document("crash_zlib.docx")
...
File "/usr/lib/python3.11/zipfile.py", line 1027, in _read1
data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid distance too far back
2. EOFError (Truncated ZIP Data)
If the ZIP headers declare a larger size than the actual file stream contains, zipfile hits EOF unexpectedly.
Traceback (most recent call last):
File "poc.py", line 4, in <module>
doc = docx.Document("crash_eof.docx")
...
File "/usr/lib/python3.11/zipfile.py", line 1054, in _read2
raise EOFError
EOFError
3. RuntimeError (Unexpected Encryption Flag)
If a ZIP file flag is modified to indicate an internal XML file (e.g. word/_rels/document.xml.rels) is encrypted, zipfile throws a RuntimeError because no password was provided.
Traceback (most recent call last):
File "poc.py", in <module>
doc = docx.Document("crash_encrypted.docx")
File "/usr/lib/python3.11/zipfile.py", line 1598, in open
raise RuntimeError("File 'word/_rels/document.xml.rels' is encrypted, password required for extraction")
RuntimeError: File 'word/_rels/document.xml.rels' is encrypted, password required for extraction
Suggested Remediation
Any application parsing user-uploaded .docx files should catch a standard PackageNotFoundError or BadZipFile if the document is corrupt. The application developer should not have to manually catch zlib.error or EOFError.
I suggest updating docx.opc.pkgreader.PackageReader.from_file() (or the underlying phys_pkg.blob_for method) to catch these native extraction exceptions and wrap them in a standard python-docx exception (such as PackageNotFoundError or a new InvalidPackageError).
import zipfile
import zlib
try:
return self._zipf.read(pack_uri.membername)
except (KeyError, zipfile.BadZipFile, zlib.error, EOFError, RuntimeError):
raise PackageNotFoundError("Package is corrupt, truncated, or encrypted.")
Summary
When parsing a
.docxfile usingdocx.Document(stream), the library relies heavily on Python's standardzipfilemodule to extract the internal XML files.If a user provides a malformed, corrupt, or unexpectedly modified ZIP structure, the underlying
zipfileandzlibmodules will throw native Python exceptions (zlib.error,EOFError, andRuntimeError). Becausepython-docxdoes not catch these specific exceptions duringPackageReader.from_file(), they completely escape the library and cause an unhandled application crash.These issues were discovered via fuzzing.
Details & Tracebacks
1.
zlib.error(Corrupted Compression Stream)If the ZIP compression data is modified,
zlibencounters an illegal compression distance.2.
EOFError(Truncated ZIP Data)If the ZIP headers declare a larger size than the actual file stream contains,
zipfilehits EOF unexpectedly.3.
RuntimeError(Unexpected Encryption Flag)If a ZIP file flag is modified to indicate an internal XML file (e.g.
word/_rels/document.xml.rels) is encrypted,zipfilethrows aRuntimeErrorbecause no password was provided.Suggested Remediation
Any application parsing user-uploaded
.docxfiles should catch a standardPackageNotFoundErrororBadZipFileif the document is corrupt. The application developer should not have to manually catchzlib.errororEOFError.I suggest updating
docx.opc.pkgreader.PackageReader.from_file()(or the underlyingphys_pkg.blob_formethod) to catch these native extraction exceptions and wrap them in a standardpython-docxexception (such asPackageNotFoundErroror a newInvalidPackageError).