fix(security): Cap untrusted XML/archive reads to bound memory use#1342
Open
reddraconi wants to merge 2 commits intoTagStudioDev:mainfrom
Open
fix(security): Cap untrusted XML/archive reads to bound memory use#1342reddraconi wants to merge 2 commits intoTagStudioDev:mainfrom
reddraconi wants to merge 2 commits intoTagStudioDev:mainfrom
Conversation
The thumbnail rendered reads XMLs out of user-supplied files and the duplicate-files registry reads DupeGuru results. The native Python xml.etree.ElementTree is known to be vulnerable to Billion laughs attacks and other parsing-based issues. Malformed or malicious XMLs can cause the app to freeze or crash, so I swapped in defusedxml, since it's specifically designed to be a safe drop-in replacement. This fix updates the imports in `dupe_files_registry.py` and `renderer.py` to use defusedxml rather than `xml.etree.ElementTree`. Element is still imported from `xml.etree.Elementree` since defusedxml only replaces the parsers, not the data class stuff. We avoided `json/library.py`, since the CONTRIBUTING.md file said to leave legacy stuff alone.
Note: This updates the requirements.txt to add `defusedxml`!
Multiple small changes:
1. Swapped `defusedxml` in to replace python's native XML parser which
is known to be vulernable to billion laughs and XXE attacks. I left
the native Python module in place for PEP484 stuff and didn't touch
any of the legacy uses per the CONTRIBUTION.md file.
2. Added memory caps for XML and Archive items to help prevent
parser-side issues:
- ePub/CBZ/CBR/CBT ComicInfo.xml: These can manifest as zip-bombs
before `defusedxml` can save us.
- .mdp headers: `bin_header[1]` is a 32-bit read from the file, but
uses an unbounded size of whatever `f.read()` is on the system.
- .pdn headers: Same as `.mdb`, but 24-bits
- DupeGuru XML results: A multi-gig results file shipped entirely
into memory will cause issues and was only bounded by available system
memory
Each of these are now compared against a byte cap. The caps have been
added to a new "Advanced" tab in the settings window. If a user needs to
bump the cap up for any reason, that's the place to do it.
The following defaults have been added to `core/constants.py`:
- comic_info_max_mb (default 1)
- mdp_header_max_mb (default 1 since .mdp header_size has a 32-bit ceiling)
- pdn_header_max_mb (default 16 since .pdn header_size has a 24-bit ceiling)
- dupe_results_max_mb (default 128)
The ePub/CBZ/CBR/CBT ComicInfo.xml, `.mdp`, and `.pdn` paths will toss a
ValueError, which is logged as `Couldn't render thumbnail` with a note
where the user can update the cap if they need to.
As a part of tis change, `_epub_cover`, `_mdp_thumb`, and `_pdn_thumb`
are no longer @staticmethod because they now read self.driver.settings;
all existing call sites already used self.<method>(...), so this is a
no-op for callers.
Added tests for Dupe results, `.mdp` and `.pdn` guardrails. Updated
GlobalSettings roundtrips to make sure the new configurations are
working.
I didn't add a test for ComicInfo because I'm lazy and adding a 1MB file
to the repo just for testing seemed like overkill.
Tests: 234 passed (plus 5 new ones!). Ruff and mypy passed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two-part fix to TagStudio's XML/archive parsing, delivered as two commits in this PR to address #1341
NOTE: This adds a new dependency,
defusedxml.ElementTreeand updates the settings UI to include a new Advanced page for these (probably-not-often-touched settings).xml.etree.ElementTreetodefusedxml.ElementTreeThe stdlib parser is known to be unsafe on untrusted input. It happily expands DTD entities (billion-laughs /
OOM bombs), follows external-entity references that read local files or fetch over HTTP (XXE), and applies
no internal bounds.
defusedxmlis the recommended drop-in replacement that disables those parser behaviors by default. TheElementimport is kept from stdlib sincedefusedxmlonly ships replacement parsers, not data classes.We left the legacy
core/library/json/library.pyxml.etree.ElementTree.parsealone, perCONTRIBUTING.mdguidance. If we need to swap that too, we can.defusedxmlonly protects the parser. Since there's a handful of calls that happen before the parser can protect us, we'd like to cap those read sizes so a malicious or accidentally-large input doesn't OOM TagStudioEach path now checks the declared/on-disk size against a byte cap before allocating, raising
ValueErrorwith a message that names the exact GlobalSettings field to raise. Caps live as fields onGlobalSettings(canonical defaults incore/constants.py) so users can raise them if they need to:comic_info_max_mbComicInfo.xmlzip-bomb membersmdp_header_max_mb.mdp(MDIPACK) headers — 32-bit declared lengthpdn_header_max_mb.pdn(Paint.NET) headers — 24-bit declared length (matches format ceiling)dupe_results_max_mbET.parseA new Advanced tab in the settings panel exposes these settings. The renderer paths'
ValueErroris caught by each method's exception handler and throws asCouldn't render thumbnail error=ValueErrorlog line, so users hitting their cap get useful messages. The dupe registry raisesValueErrorbeforeET.parserunsand bubbles up to the caller in
fix_dupe_files.py._epub_cover,_mdp_thumb, and_pdn_thumbare no longer@staticmethodbecause they now readself.driver.settings. All existing call sites already usedself.<method>(...), so this should be a no-op for callers.Tasks Completed
pytest: All 231 tests still pass. I added 5 new testsruffandmypycame back clean.