Skip to content

fix(security): Cap untrusted XML/archive reads to bound memory use#1342

Open
reddraconi wants to merge 2 commits intoTagStudioDev:mainfrom
reddraconi:security-xml-archive-caps
Open

fix(security): Cap untrusted XML/archive reads to bound memory use#1342
reddraconi wants to merge 2 commits intoTagStudioDev:mainfrom
reddraconi:security-xml-archive-caps

Conversation

@reddraconi
Copy link
Copy Markdown

@reddraconi reddraconi commented Apr 25, 2026

Two-part fix to TagStudio's XML/archive parsing, delivered as two commits in this PR to address #1341

NOTE: This adds a new dependency, defusedxml.ElementTree and updates the settings UI to include a new Advanced page for these (probably-not-often-touched settings).

  1. Swap xml.etree.ElementTree to defusedxml.ElementTree

The stdlib parser is known to be unsafe on untrusted input. It happily expands DTD entities (billion-laughs /
OOM bombs), follows external-entity references that read local files or fetch over HTTP (XXE), and applies
no internal bounds. defusedxml is the recommended drop-in replacement that disables those parser behaviors by default. The Element import is kept from stdlib since defusedxml only ships replacement parsers, not data classes.

We left the legacy core/library/json/library.py xml.etree.ElementTree.parse alone, per
CONTRIBUTING.md guidance. If we need to swap that too, we can.

  1. Cap attacker-controlled XML reads

defusedxml only protects the parser. Since there's a handful of calls that happen before the parser can protect us, we'd like to cap those read sizes so a malicious or accidentally-large input doesn't OOM TagStudio

Each path now checks the declared/on-disk size against a byte cap before allocating, raising ValueError with a message that names the exact GlobalSettings field to raise. Caps live as fields on GlobalSettings (canonical defaults in core/constants.py) so users can raise them if they need to:

Setting Default MiB Protects
comic_info_max_mb 1 ePub / CB* ComicInfo.xml zip-bomb members
mdp_header_max_mb 1 .mdp (MDIPACK) headers — 32-bit declared length
pdn_header_max_mb 16 .pdn (Paint.NET) headers — 24-bit declared length (matches format ceiling)
dupe_results_max_mb 128 DupeGuru results XML before ET.parse

A new Advanced tab in the settings panel exposes these settings. The renderer paths' ValueError is caught by each method's exception handler and throws as Couldn't render thumbnail error=ValueError log line, so users hitting their cap get useful messages. The dupe registry raises ValueError before ET.parse runs
and bubbles up to the caller in fix_dupe_files.py.

_epub_cover, _mdp_thumb, and _pdn_thumb are no longer @staticmethod because they now read self.driver.settings. All existing call sites already used self.<method>(...), so this should be a no-op for callers.

Tasks Completed

  • Platforms Tested:
    • Windows x86
    • Windows ARM
    • macOS x86
    • macOS ARM
    • Linux x86
    • Linux ARM
  • Tested For:
    • Basic Functionality
    • PyInstaller Executable
    • pytest: All 231 tests still pass. I added 5 new tests
    • ruff and mypy came back clean.
    • Manual: Opened the new Advanced tab, changed values, save + reloaded to ensure values persisted.

The thumbnail rendered reads XMLs out of user-supplied files and the
duplicate-files registry reads DupeGuru results.

The native Python xml.etree.ElementTree is known to be vulnerable to
Billion laughs attacks and other parsing-based issues.

Malformed or malicious XMLs can cause the app to freeze or crash, so I
swapped in defusedxml, since it's specifically designed to be a safe
drop-in replacement.

This fix updates the imports in `dupe_files_registry.py` and `renderer.py` to
use defusedxml rather than `xml.etree.ElementTree`. Element is still
imported from `xml.etree.Elementree` since defusedxml only replaces the
parsers, not the data class stuff.

We avoided `json/library.py`, since the CONTRIBUTING.md file said to
leave legacy stuff alone.
Note: This updates the requirements.txt to add `defusedxml`!

Multiple small changes:

1. Swapped `defusedxml` in to replace python's native XML parser which
   is known to be vulernable to billion laughs and XXE attacks. I left
   the native Python module in place for PEP484 stuff and didn't touch
   any of the legacy uses per the CONTRIBUTION.md file.

2. Added memory caps for XML and Archive items to help prevent
   parser-side issues:
   - ePub/CBZ/CBR/CBT ComicInfo.xml: These can manifest as zip-bombs
     before `defusedxml` can save us.
   - .mdp headers: `bin_header[1]` is a 32-bit read from the file, but
     uses an unbounded size of whatever `f.read()` is on the system.
   - .pdn headers: Same as `.mdb`, but 24-bits
   - DupeGuru XML results: A multi-gig results file shipped entirely
     into memory will cause issues and was only bounded by available system
     memory

Each of these are now compared against a byte cap. The caps have been
added to a new "Advanced" tab in the settings window. If a user needs to
bump the cap up for any reason, that's the place to do it.

The following defaults have been added to `core/constants.py`:
  - comic_info_max_mb (default 1)
  - mdp_header_max_mb (default 1 since .mdp header_size has a 32-bit ceiling)
  - pdn_header_max_mb (default 16 since .pdn header_size has a 24-bit ceiling)
  - dupe_results_max_mb (default 128)

The ePub/CBZ/CBR/CBT ComicInfo.xml, `.mdp`, and `.pdn` paths will toss a
ValueError, which is logged as `Couldn't render thumbnail` with a note
where the user can update the cap if they need to.

As a part of tis change, `_epub_cover`, `_mdp_thumb`, and `_pdn_thumb`
are no longer @staticmethod because they now read self.driver.settings;
all existing call sites already used self.<method>(...), so this is a
no-op for callers.

Added tests for Dupe results, `.mdp` and `.pdn` guardrails. Updated
GlobalSettings roundtrips to make sure the new configurations are
working.

I didn't add a test for ComicInfo because I'm lazy and adding a 1MB file
to the repo just for testing seemed like overkill.

Tests: 234 passed (plus 5 new ones!). Ruff and mypy passed.
@reddraconi reddraconi changed the title Sec: Cap untrusted XML/archive reads to bound memory use fix(security): Cap untrusted XML/archive reads to bound memory use Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant