Re: [PR] SOLR-17961 Remove deprecated Tika Extraction Backend [solr]

via GitHub Mon, 20 Oct 2025 00:53:13 -0700


Copilot commented on code in PR #3784:
URL: https://github.com/apache/solr/pull/3784#discussion_r2444069685



##########
solr/solr-ref-guide/modules/indexing-guide/pages/post-tool.adoc:
##########
@@ -134,6 +134,8 @@ The 
xref:indexing-with-update-handlers.adoc#csv-formatted-index-updates[CSV hand
 
 Index a PDF file into `gettingstarted`.
 
+NOTE: This requires a Tika Serer to be configured. See 
xref:indexing-with-tika.adoc#tika-server[Indexing With Tika] for details.

Review Comment:
   Typo: 'Tika Serer' should be 'Tika Server'.
   ```suggestion
   NOTE: This requires a Tika Server to be configured. See 
xref:indexing-with-tika.adoc#tika-server[Indexing With Tika] for details.
   ```



##########
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-with-tika.adoc:
##########
@@ -54,29 +54,27 @@ This is provided via the `extraction` 
xref:configuration-guide:solr-modules.adoc
 The "techproducts" example included with Solr is pre-configured to have Solr 
Cell configured.
 If you are not using the example, you will want to pay attention to the 
section <<solrconfig.xml Configuration>> below.
 
-== Tika Extraction Backends
+== Extraction Backends
 
-There are two backends for this module. The `local` backend embeds Tika inside 
Solr's own process, while the `tikaserver` backend uses an external Tika server 
process to do the extraction.
+The ExtractionRequestHandler supports multiple backends, selectable with the 
`extraction.backend` parameter. The only backend currently supported is the 
`tikaserver` backend, which uses an external Tika server process to do the 
extraction.

Review Comment:
   Class name is incorrect; it should be 'ExtractingRequestHandler' (with 
'ing'). Please update the reference.
   ```suggestion
   The ExtractingRequestHandler supports multiple backends, selectable with the 
`extraction.backend` parameter. The only backend currently supported is the 
`tikaserver` backend, which uses an external Tika server process to do the 
extraction.
   ```



##########
solr/solr-ref-guide/modules/getting-started/pages/tutorial-diy.adoc:
##########
@@ -53,6 +53,9 @@ Local Files with `bin/solr post`::
 If you have a local directory of files, the Post Tool (`bin/solr post`) can 
index a directory of files.
 We saw this in action in our first exercise.
 +
+// NOCOMMIT: The user will need to add /update/extract handler?
+// TODO: The user will need to start a Tika server

Review Comment:
   Leftover editorial markers ('NOCOMMIT'/'TODO') should be removed or resolved 
before publishing the docs.
   ```suggestion
   Note: To index rich document formats (such as PDF, Microsoft Office files, 
etc.), you may need to enable the `/update/extract` handler in your Solr 
configuration and ensure that a Tika server is available.
   ```



##########
solr/modules/extraction/src/test-files/extraction/solr/collection1/conf/solrconfig.xml:
##########
@@ -151,14 +151,13 @@
   </requestHandler>
 
   <requestHandler name="/update/extract" 
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
-    <str name="parseContext.config">parseContext.xml</str>
-    <str name="extraction.backend">${solr.test.extraction.backend:local}</str>
+    <str 
name="extraction.backend">${solr.test.extraction.backend:tikaserver}</str>
     <str name="tikaserver.url">${solr.test.tikaserver.url:}</str>
     <str 
name="tikaserver.metadata.compatibility">${solr.test.tikaserver.metadata.compatibility:false}</str>
   </requestHandler>
 
   <requestHandler name="/update/extract/lit-def" 
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
-    <str name="extraction.backend">${solr.test.extraction.backend:local}</str>
+    <str 
name="extraction.backend">${solr.test.extraction.backend:tikaserver}</str>
     <str name="tikaserver.url">${solr.test.tikaserver.url:}</str>
     <str 
name="tikaserver.metadata.compatibility">${solr.test.tikaserver.metadata.compatibility:false}</str>
     <lst name="defaults">

Review Comment:
   With the handler now requiring a non-empty 'tikaserver.url', the default 
empty property value will cause handler initialization to fail at core load. 
Either add startup=\"lazy\" to these handlers or provide a non-empty default 
via the property (or ensure tests set 'solr.test.tikaserver.url').



##########
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-with-tika.adoc:
##########
@@ -589,26 +540,9 @@ So you can use the other URPs without worrying about 
unexpected field additions.
 
 === Parser-Specific Properties
 
-NOTE: This setting currently applies to the `local` backend only. When using 
`tikaserver` you can configure similar settings on the Tika Server side.
-
-Parsers used by Tika may have specific properties to govern how data is 
extracted.
-These can be passed through Solr for special parsing situations.
-
-For instance, when using the Tika library from a Java program, the 
`PDFParserConfig` class has a method `setSortByPosition(boolean)` that can 
extract vertically oriented text.
-To access that method via configuration with the `ExtractingRequestHandler`, 
one can add the `parseContext.config` property to `solrconfig.xml` and then set 
properties in Tika's `PDFParserConfig` as in the example below.
-
-[source,xml]
-----
-<entries>
-  <entry class="org.apache.tika.parser.pdf.PDFParserConfig" 
impl="org.apache.tika.parser.pdf.PDFParserConfig">
-    <property name="extractInlineImages" value="true"/>
-    <property name="sortByPosition" value="true"/>
-  </entry>
-  <entry>...</entry>
-</entries>
-----
+Parser-specific properties for Tika must be configured directly on your Tika 
Server instance. Consult the https://tika.apache.org/[Apache Tika 
documentation] documentation of this.

Review Comment:
   Redundant wording: 'documentation documentation'. Suggest: 'Consult the 
Apache Tika documentation for details.'
   ```suggestion
   Parser-specific properties for Tika must be configured directly on your Tika 
Server instance. Consult the https://tika.apache.org/[Apache Tika 
documentation] for details.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SOLR-17961 Remove deprecated Tika Extraction Backend [solr]

Reply via email to