Copilot commented on code in PR #3932:
URL: https://github.com/apache/solr/pull/3932#discussion_r2601701286


##########
solr/modules/extraction/src/resources/solr-default-tika-config.xml:
##########
@@ -17,4 +17,19 @@
   -->
 <properties>
   <service-loader initializableProblemHandler="ignore"/>
+  <parsers>
+    <!--
+      Security hardening: Disable XFA parsing in PDFs to mitigate 
CVE-2025-54988.
+      This prevents XXE (XML External Entity) injection attacks via crafted 
XFA content.
+    -->
+    <parser class="org.apache.tika.parser.DefaultParser">
+      <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
+    </parser>
+    <parser class="org.apache.tika.parser.pdf.PDFParser">
+      <params>
+        <param name="extractAcroFormContent" type="bool">false</param>
+        <param name="ifXFAExtractOnlyXFA" type="bool">false</param>
+      </params>

Review Comment:
   The PR description states this approach "really targets Tika 2.x" and hasn't 
been tested, but this codebase uses Tika 1.28.5 (see versions.props). The 
parameter names `extractAcroFormContent` and `ifXFAExtractOnlyXFA` may not be 
compatible with Tika 1.x's configuration format.
   
   Additionally, the existing codebase configures PDFParser through 
`parseContext.xml` using `PDFParserConfig` (see 
solr/modules/extraction/src/test-files/extraction/solr/collection1/conf/parseContext.xml),
 which is a different mechanism than the `tika-config.xml` params approach used 
here.
   
   This compatibility issue could render the CVE-2025-54988 mitigation 
ineffective. Please verify that:
   1. These parameter names are valid for Tika 1.28.5
   2. The parser exclusion/re-addition pattern works correctly with Tika 1.x
   3. The configuration actually disables XFA parsing as intended
   
   Consider testing with a PDF containing XFA content to verify the mitigation 
is effective.
   ```suggestion
         <!--
           XFA parsing must be disabled via parseContext.xml using 
PDFParserConfig for Tika 1.x.
           Example (parseContext.xml):
             <bean id="org.apache.tika.parser.pdf.PDFParserConfig" 
class="org.apache.tika.parser.pdf.PDFParserConfig">
               <property name="ifXFAExtractOnlyXFA" value="false"/>
               <property name="extractAcroFormContent" value="false"/>
             </bean>
         -->
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to