Copilot commented on code in PR #3932:
URL: https://github.com/apache/solr/pull/3932#discussion_r2601701286
##########
solr/modules/extraction/src/resources/solr-default-tika-config.xml:
##########
@@ -17,4 +17,19 @@
-->
<properties>
<service-loader initializableProblemHandler="ignore"/>
+ <parsers>
+ <!--
+ Security hardening: Disable XFA parsing in PDFs to mitigate
CVE-2025-54988.
+ This prevents XXE (XML External Entity) injection attacks via crafted
XFA content.
+ -->
+ <parser class="org.apache.tika.parser.DefaultParser">
+ <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
+ </parser>
+ <parser class="org.apache.tika.parser.pdf.PDFParser">
+ <params>
+ <param name="extractAcroFormContent" type="bool">false</param>
+ <param name="ifXFAExtractOnlyXFA" type="bool">false</param>
+ </params>
Review Comment:
The PR description states this approach "really targets Tika 2.x" and hasn't
been tested, but this codebase uses Tika 1.28.5 (see versions.props). The
parameter names `extractAcroFormContent` and `ifXFAExtractOnlyXFA` may not be
compatible with Tika 1.x's configuration format.
Additionally, the existing codebase configures PDFParser through
`parseContext.xml` using `PDFParserConfig` (see
solr/modules/extraction/src/test-files/extraction/solr/collection1/conf/parseContext.xml),
which is a different mechanism than the `tika-config.xml` params approach used
here.
This compatibility issue could render the CVE-2025-54988 mitigation
ineffective. Please verify that:
1. These parameter names are valid for Tika 1.28.5
2. The parser exclusion/re-addition pattern works correctly with Tika 1.x
3. The configuration actually disables XFA parsing as intended
Consider testing with a PDF containing XFA content to verify the mitigation
is effective.
```suggestion
<!--
XFA parsing must be disabled via parseContext.xml using
PDFParserConfig for Tika 1.x.
Example (parseContext.xml):
<bean id="org.apache.tika.parser.pdf.PDFParserConfig"
class="org.apache.tika.parser.pdf.PDFParserConfig">
<property name="ifXFAExtractOnlyXFA" value="false"/>
<property name="extractAcroFormContent" value="false"/>
</bean>
-->
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]