Harden XML configuration parsing with commons-xml#4162
Draft
ppkarwasz wants to merge 5 commits into
Draft
Conversation
Replace the hand-rolled `DocumentBuilderFactory` hardening in `XmlConfiguration` with the hardened JAXP factories from `eu.copernik:copernik-xml-factory`, and harden the schema-validation path the same way: * `newDocumentBuilder` uses `XmlFactories.newDocumentBuilderFactory()`, * `validateDocument` uses `XmlFactories.newSchemaFactory()`. By the library's contract these factories, and everything they produce, never fetch external entities, DTDs or schemas, across the JAXP implementations the library recognizes. The OSGi integration tests link the `eu.copernik.xml.factory` bundle. Assisted-By: Claude Opus 4.8 <noreply@anthropic.com>
4 tasks
Assisted-By: Claude Opus 4.8 <noreply@anthropic.com>
`XmlConfigurationSchemaTest` exercises schema validation whose `xsd:include`/`import` resources are resolved through an `LSResourceResolver`. On JDK 8 Xerces enforces the `accessExternalSchema` restriction on the resources the resolver returns; the `isCreatedByResolver` exemption that lets resolver-supplied resources bypass that check was only added in JDK 9. copernik-xml-factory sets `accessExternalSchema` to "" at API precedence (deliberately not overridable by a system property), so this resolution path cannot run on JDK 8. Skip the whole class on JDK 8 via a `@BeforeEach` assumption. The previous guard only covered `schemaIncludeRespectsAllowedProtocols` (which already passed on JDK 8), leaving the actually-failing `validConfigurationValidates` unguarded. Assisted-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Stacked on #4161.
Since Commons XML has not yet been released, this PR is based on
copernik-xml-factory. The target dependency will be swapped after the first release.Replaces Log4j's hand-rolled
DocumentBuilderFactoryhardening witheu.copernik:copernik-xml-factory, and hardens the schema path, which until #4161 used a defaultSchemaFactory.The change to
XmlConfigurationis two method calls (XmlFactories.newDocumentBuilderFactory()andXmlFactories.newSchemaFactory()), plus the build, OSGi and JPMS plumbing.The point of this PR is not the lines it removes; it is the hardening Log4j is missing today, and the cost of writing and owning that hardening ourselves.
This is a probe PR
This PR exists to evaluate the library in a real consumer. It should not be merged before the first release of Apache Commons XML, into which the library's code has migrated and is being reviewed Apache-style. At that point the dependency is no longer "third party": it is an ASF-governed artifact with the same provenance as the rest of the Log4j supply chain. Until then, treat this as a spike that measures whether the integration is clean, and whether the hardening is worth a managed dependency.
What Log4j fails to harden today
The current
XmlConfigurationhardening isdisableDtdProcessing()+setFeature(), 19 lines, and it is incomplete:FEATURE_SECURE_PROCESSINGis never set, so on an external Xerces, Billion-Laughs-class entity expansion is not bounded. External entity resolution is covered decently (external general and parameter entities and external DTD loading are disabled); entity expansion is not.In
2.26.0, XInclude is on by design, yetALLOWED_PROTOCOLSrestricts only the main configuration file and not the files it includes. The main document is policed; its includes are not.SchemaFactoryis not hardened at all,neither against entity expansion nor against external resource access (external DTD/schema). A
strictconfiguration validating against a schema parses that schema with stock, unhardened JAXP.The honest line count
Before counting what the library removes, count what it takes to fill these gaps.
the
ConfigurationSourceResolverand the schema-validation routing, ~190 lines inXmlConfigurationalone.The library does not remove this and does not claim to: per its current threat model, installing a resolver makes resolution the caller's responsibility, so this stays Log4j's regardless (the next revision of that model is expected to lift the responsibility).
FEATURE_SECURE_PROCESSINGon both factories.That is the one feature the JAXP specification requires every implementation to support, and yet not every implementation honors it (Android, for one), so even this single line is not portable on its own, before one even reaches the per-provider entity, DTD and external-access settings.
So the real comparison is not "library vs. 10 lines". It is two hardened-factory calls against a hand-rolled, single-provider, partial hardening that Log4j would have to complete, test, and keep current.
The tests come with it
The hardening Log4j would otherwise write and maintain already exists in the library, tested. Restricting the library's suite to the two JAXP factories Log4j uses:
Those 90 parameterized security cases exercise
DocumentBuilderFactoryandSchemaFactorydirectly:Billion Laughs, external general and parameter entities, external DTD, doctype handling, the entity-resolver floor, XInclude, and schemaimport/include/redefine/schemaLocation, across providers. They are backed by ~2,800 lines of test code and their leaked-resource fixtures, against a library of ~3,000 lines of hardening logic.Put next to the handful of lines of hardening Log4j would otherwise inline, that is two to three orders of magnitude of security code and tests that adopting the library imports, and that reimplementing the hardening would mean writing and owning. Reimplementing the hardening means reimplementing the tests too.
Lightning rod
GHSA-xm28-xvqc-gxxg
(High, CVSS 8.2, fixed in
0.1.2) is precisely Log4j's path:newDocumentBuilderFactory()+setXIncludeAware(true)on the stock JDK provider.It shows that even someone well versed in JAXP can get a detail slightly wrong, here believing that XInclude fallback resolution is governed by
ACCESS_EXTERNAL_DTDorACCESS_EXTERNAL_SCHEMA.The relevant point for Log4j is not the bug; it is where the report landed. A dedicated, threat-modeled library acts as a lightning rod: it attracts the scanners and the researchers and triages them against a published model. More often than not it can reject a report on every consumer's behalf; when a report is valid, it ships one fix for everybody.
A consuming project can then meet the steady stream of XXE scanner reports with a single response that points at the library's threat model, instead of re-triaging each one in its own parser code.
Scope of this PR
XmlFactories.newDocumentBuilderFactory()innewDocumentBuilder(removesdisableDtdProcessing/setFeature).XmlFactories.newSchemaFactory()invalidateDocument.log4j-osgi-test) wiring.ALLOWED_PROTOCOLSon includes and schema imports) is already covered by the tests added in Resolve XML configuration resources through ConfigurationSource #4161.The changelog
<issue>will be set to this PR number once assigned.Adoption modalities
Commons XML does not have to be an external dependency. In apache/commons-xml#5 the
DocumentBuilderFactory-relevant code was moved into a single class, so Log4j can shade and relocate just that part. An early attempt (withminimizeJarenabled) pulled about 30% of the library into Log4j, a figure that can certainly be improved.Shading and relocating part of a library, rather than all of it, is the intended use of the Maven Shade Plugin, one that even I, a fervent opponent of shading, approve of.