Re: [PR] SOLR-7632 TikaServer support (branch_9x) [solr]

via GitHub Thu, 16 Oct 2025 13:27:50 -0700


Copilot commented on code in PR #3780:
URL: https://github.com/apache/solr/pull/3780#discussion_r2437382078



##########
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-with-tika.adoc:
##########
@@ -355,6 +410,40 @@ See the section <<Indexing Encrypted Documents>> for more 
information about usin
 +
 Example: `passwordsFile=/path/to/passwords.txt`
 
+// TODO: Feature exists but keep undocumented for now
+// `tikaserver.metadata.compatibility`::
+// +
+// [%autowidth,frame=none]
+// |===
+// |Optional |Default: false
+// |===
+// +
+// When enabled, Solr Cell tries to map some common metadata to other common 
names, e.g. `dc:author` is mapped also to `Author`. This can be useful if 
switching from `local` to `tikaserver` backend, since `tikaserver` uses more 
industry standard name-spaced metadata keys.
+// +
+// Only applicable for `tikaserver` backend. Can only be set in 
`solrconfig.xml`, not per request.
+
+`tikaserver.maxChars`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: 100 MBytes
+|===
++
+Sets a hard limit on the number of bytes Solr will accept from the Tika Server 
response body when using the `tikaserver` backend. If the extracted content 
exceeds this limit, the request will fail with HTTP 400 (Bad Request).
++
+Only applicable for the `tikaserver` backend. This parameter can only be 
configured in the request handler configuration (`solrconfig.xml`), not per 
request.
++
+Example: In `solrconfig.xml`: `<long name="tikaserver.maxChars">1000000</long>`

Review Comment:
   Documentation states the limit is on 'bytes' but the parameter name is 
'maxChars' and the default description says '100 MBytes'. This is confusing - 
clarify whether this limits bytes or characters, and ensure parameter name 
matches its purpose.
   ```suggestion
   |Optional |Default: 100 million characters
   |===
   +
   Sets a hard limit on the number of characters Solr will accept from the Tika 
Server response body when using the `tikaserver` backend. If the extracted 
content exceeds this limit, the request will fail with HTTP 400 (Bad Request).
   +
   Only applicable for the `tikaserver` backend. This parameter can only be 
configured in the request handler configuration (`solrconfig.xml`), not per 
request.
   +
   Example: In `solrconfig.xml`: `<long 
name="tikaserver.maxChars">1000000</long>` (sets the limit to 1 million 
characters)
   ```



##########
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-with-tika.adoc:
##########
@@ -355,6 +410,40 @@ See the section <<Indexing Encrypted Documents>> for more 
information about usin
 +
 Example: `passwordsFile=/path/to/passwords.txt`
 
+// TODO: Feature exists but keep undocumented for now
+// `tikaserver.metadata.compatibility`::
+// +
+// [%autowidth,frame=none]
+// |===
+// |Optional |Default: false
+// |===
+// +
+// When enabled, Solr Cell tries to map some common metadata to other common 
names, e.g. `dc:author` is mapped also to `Author`. This can be useful if 
switching from `local` to `tikaserver` backend, since `tikaserver` uses more 
industry standard name-spaced metadata keys.
+// +
+// Only applicable for `tikaserver` backend. Can only be set in 
`solrconfig.xml`, not per request.
+
+`tikaserver.maxChars`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: 100 MBytes
+|===
++
+Sets a hard limit on the number of bytes Solr will accept from the Tika Server 
response body when using the `tikaserver` backend. If the extracted content 
exceeds this limit, the request will fail with HTTP 400 (Bad Request).
++
+Only applicable for the `tikaserver` backend. This parameter can only be 
configured in the request handler configuration (`solrconfig.xml`), not per 
request.
++
+Example: In `solrconfig.xml`: `<long name="tikaserver.maxChars">1000000</long>`
+
+`tikaserver.recursive`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: false
+|===
++
+**Only applicable for the `tikaserver` backend`.** For `local` backend parsing 
is always recursive, i.e. extracts text from embedded documents. For 
`tikaserver` you have to enable it explicitly.

Review Comment:
   Extra backtick after 'backend' causing incorrect formatting.
   ```suggestion
   **Only applicable for the `tikaserver` backend.** For `local` backend 
parsing is always recursive, i.e. extracts text from embedded documents. For 
`tikaserver` you have to enable it explicitly.
   ```



##########
solr/modules/extraction/src/java/org/apache/solr/handler/extraction/TikaServerExtractionBackend.java:
##########
@@ -0,0 +1,450 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.handler.extraction;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.ConnectException;
+import java.net.SocketTimeoutException;
+import java.nio.channels.ClosedChannelException;
+import java.time.Duration;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.Map;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.ThreadFactory;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.TimeoutException;
+import org.apache.solr.common.SolrException;
+import org.apache.solr.common.util.ExecutorUtil;
+import org.apache.solr.common.util.NamedList;
+import org.apache.solr.common.util.SolrNamedThreadFactory;
+import org.apache.solr.handler.extraction.fromtika.BodyContentHandler;
+import org.apache.solr.util.RefCounted;
+import org.eclipse.jetty.client.HttpClient;
+import org.eclipse.jetty.client.api.Request;
+import org.eclipse.jetty.client.api.Response;
+import org.eclipse.jetty.client.util.InputStreamRequestContent;
+import org.eclipse.jetty.client.util.InputStreamResponseListener;
+import org.eclipse.jetty.io.EofException;
+import org.eclipse.jetty.util.thread.ScheduledExecutorScheduler;
+import org.xml.sax.helpers.DefaultHandler;
+
+/** Extraction backend using the Tika Server. It uses a shared Jetty 
HttpClient. */
+public class TikaServerExtractionBackend implements ExtractionBackend {
+  /**
+   * Default maximum response size (100MB) to prevent excessive memory usage 
from large documents
+   */
+  public static final long DEFAULT_MAXCHARS_LIMIT = 100 * 1024 * 1024;
+
+  private static final Object INIT_LOCK = new Object();
+  private final String baseUrl;
+  private static final int DEFAULT_TIMEOUT_SECONDS = 3 * 60;
+  private final Duration defaultTimeout;
+  private final TikaServerParser tikaServerResponseParser = new 
TikaServerParser();
+  private boolean tikaMetadataCompatibility;
+  private HashMap<String, Object> initArgsMap = new HashMap<>();
+  private final long maxCharsLimit;
+
+  // Singleton holder for the shared HttpClient/Executor resources (one per 
JVM)
+  private static volatile RefCounted<HttpClientResources> SHARED_RESOURCES;
+  // Per-backend handle (same RefCounted instance as SHARED_RESOURCES) that 
this instance will
+  // decref() on close
+  private RefCounted<HttpClientResources> acquiredResourcesRef;
+
+  public TikaServerExtractionBackend(String baseUrl) {
+    this(baseUrl, DEFAULT_TIMEOUT_SECONDS, null, DEFAULT_MAXCHARS_LIMIT);
+  }
+
+  public TikaServerExtractionBackend(
+      String baseUrl, int timeoutSeconds, NamedList<?> initArgs, long 
maxCharsLimit) {
+    // Validate baseUrl
+    if (baseUrl == null || baseUrl.trim().isEmpty()) {
+      throw new IllegalArgumentException("baseUrl cannot be null or empty");
+    }
+    // Validate URL format and scheme
+    try {
+      java.net.URI uri = new java.net.URI(baseUrl);
+      String scheme = uri.getScheme();
+      if (scheme == null
+          || (!scheme.equalsIgnoreCase("http") && 
!scheme.equalsIgnoreCase("https"))) {
+        throw new IllegalArgumentException(
+            "baseUrl must use http or https scheme, got: " + baseUrl);
+      }
+      uri.toURL(); // Additional validation that it's a valid URL
+    } catch (java.net.URISyntaxException | java.net.MalformedURLException e) {
+      throw new IllegalArgumentException("Invalid baseUrl: " + baseUrl, e);
+    }
+
+    this.maxCharsLimit = maxCharsLimit;
+    if (initArgs != null) {
+      initArgs.toMap(this.initArgsMap);
+    }
+    Object metaCompatObh = 
this.initArgsMap.get(ExtractingParams.TIKASERVER_METADATA_COMPATIBILITY);
+    if (metaCompatObh != null) {
+      this.tikaMetadataCompatibility = 
Boolean.parseBoolean(metaCompatObh.toString());
+    }
+    if (timeoutSeconds <= 0) {
+      timeoutSeconds = DEFAULT_TIMEOUT_SECONDS;
+    }
+    if (baseUrl.endsWith("/")) {
+      this.baseUrl = baseUrl.substring(0, baseUrl.length() - 1);
+    } else {
+      this.baseUrl = baseUrl;
+    }
+    this.defaultTimeout =
+        Duration.ofSeconds(timeoutSeconds > 0 ? timeoutSeconds : 
DEFAULT_TIMEOUT_SECONDS);
+
+    // Acquire a reference to the shared resources; keep a handle so we can 
decref() on close
+    acquiredResourcesRef = initializeHttpClient().incref();
+  }
+
+  public static final String NAME = "tikaserver";
+
+  @Override
+  public String name() {
+    return NAME;
+  }
+
+  @Override
+  public ExtractionResult extract(InputStream inputStream, ExtractionRequest 
request)
+      throws Exception {
+    try (InputStream tikaResponse = callTikaServer(inputStream, request)) {
+      ExtractionMetadata md = buildMetadataFromRequest(request);
+      BodyContentHandler bodyContentHandler = new BodyContentHandler(-1);
+      if (request.tikaServerRecursive) {
+        tikaServerResponseParser.parseRmetaJson(tikaResponse, 
bodyContentHandler, md);
+      } else {
+        tikaServerResponseParser.parseXml(tikaResponse, bodyContentHandler, 
md);
+      }
+      if (tikaMetadataCompatibility) {
+        appendBackCompatTikaMetadata(md);
+      }
+      return new ExtractionResult(bodyContentHandler.toString(), md);
+    }
+  }
+
+  @Override
+  public void extractWithSaxHandler(
+      InputStream inputStream,
+      ExtractionRequest request,
+      ExtractionMetadata md,
+      DefaultHandler saxContentHandler)
+      throws Exception {
+    try (InputStream tikaResponse = callTikaServer(inputStream, request)) {
+      if (request.tikaServerRecursive) {
+        tikaServerResponseParser.parseRmetaJson(tikaResponse, 
saxContentHandler, md);
+      } else {
+        tikaServerResponseParser.parseXml(tikaResponse, saxContentHandler, md);
+      }
+      if (tikaMetadataCompatibility) {
+        appendBackCompatTikaMetadata(md);
+      }
+    }
+  }
+
+  /**
+   * Call the Tika Server to extract text and metadata. Depending on 
<code>request.recursive</code>,
+   * will either return XML (false) or JSON array (true). <b>The recursive 
mode consumes more memory
+   * both on the TikaServer side and on the Solr side</b>
+   *
+   * @return InputStream of the response body, either XML or JSON depending on 
<code>
+   *     request.tikaserverRecursive</code>

Review Comment:
   Typo in method javadoc - 'tikaserverRecursive' should be 
'tikaServerRecursive' (capital S in Server).
   ```suggestion
      *     request.tikaServerRecursive</code>
   ```



##########
solr/modules/extraction/src/java/org/apache/solr/handler/extraction/TikaServerExtractionBackend.java:
##########
@@ -0,0 +1,450 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.handler.extraction;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.net.ConnectException;
+import java.net.SocketTimeoutException;
+import java.nio.channels.ClosedChannelException;
+import java.time.Duration;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.Map;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.ThreadFactory;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.TimeoutException;
+import org.apache.solr.common.SolrException;
+import org.apache.solr.common.util.ExecutorUtil;
+import org.apache.solr.common.util.NamedList;
+import org.apache.solr.common.util.SolrNamedThreadFactory;
+import org.apache.solr.handler.extraction.fromtika.BodyContentHandler;
+import org.apache.solr.util.RefCounted;
+import org.eclipse.jetty.client.HttpClient;
+import org.eclipse.jetty.client.api.Request;
+import org.eclipse.jetty.client.api.Response;
+import org.eclipse.jetty.client.util.InputStreamRequestContent;
+import org.eclipse.jetty.client.util.InputStreamResponseListener;
+import org.eclipse.jetty.io.EofException;
+import org.eclipse.jetty.util.thread.ScheduledExecutorScheduler;
+import org.xml.sax.helpers.DefaultHandler;
+
+/** Extraction backend using the Tika Server. It uses a shared Jetty 
HttpClient. */
+public class TikaServerExtractionBackend implements ExtractionBackend {
+  /**
+   * Default maximum response size (100MB) to prevent excessive memory usage 
from large documents
+   */
+  public static final long DEFAULT_MAXCHARS_LIMIT = 100 * 1024 * 1024;
+
+  private static final Object INIT_LOCK = new Object();
+  private final String baseUrl;
+  private static final int DEFAULT_TIMEOUT_SECONDS = 3 * 60;
+  private final Duration defaultTimeout;
+  private final TikaServerParser tikaServerResponseParser = new 
TikaServerParser();
+  private boolean tikaMetadataCompatibility;
+  private HashMap<String, Object> initArgsMap = new HashMap<>();
+  private final long maxCharsLimit;
+
+  // Singleton holder for the shared HttpClient/Executor resources (one per 
JVM)
+  private static volatile RefCounted<HttpClientResources> SHARED_RESOURCES;
+  // Per-backend handle (same RefCounted instance as SHARED_RESOURCES) that 
this instance will
+  // decref() on close
+  private RefCounted<HttpClientResources> acquiredResourcesRef;
+
+  public TikaServerExtractionBackend(String baseUrl) {
+    this(baseUrl, DEFAULT_TIMEOUT_SECONDS, null, DEFAULT_MAXCHARS_LIMIT);
+  }
+
+  public TikaServerExtractionBackend(
+      String baseUrl, int timeoutSeconds, NamedList<?> initArgs, long 
maxCharsLimit) {
+    // Validate baseUrl
+    if (baseUrl == null || baseUrl.trim().isEmpty()) {
+      throw new IllegalArgumentException("baseUrl cannot be null or empty");
+    }
+    // Validate URL format and scheme
+    try {
+      java.net.URI uri = new java.net.URI(baseUrl);
+      String scheme = uri.getScheme();
+      if (scheme == null
+          || (!scheme.equalsIgnoreCase("http") && 
!scheme.equalsIgnoreCase("https"))) {
+        throw new IllegalArgumentException(
+            "baseUrl must use http or https scheme, got: " + baseUrl);
+      }
+      uri.toURL(); // Additional validation that it's a valid URL
+    } catch (java.net.URISyntaxException | java.net.MalformedURLException e) {
+      throw new IllegalArgumentException("Invalid baseUrl: " + baseUrl, e);
+    }
+
+    this.maxCharsLimit = maxCharsLimit;
+    if (initArgs != null) {
+      initArgs.toMap(this.initArgsMap);
+    }
+    Object metaCompatObh = 
this.initArgsMap.get(ExtractingParams.TIKASERVER_METADATA_COMPATIBILITY);
+    if (metaCompatObh != null) {
+      this.tikaMetadataCompatibility = 
Boolean.parseBoolean(metaCompatObh.toString());
+    }
+    if (timeoutSeconds <= 0) {
+      timeoutSeconds = DEFAULT_TIMEOUT_SECONDS;
+    }
+    if (baseUrl.endsWith("/")) {
+      this.baseUrl = baseUrl.substring(0, baseUrl.length() - 1);
+    } else {
+      this.baseUrl = baseUrl;
+    }
+    this.defaultTimeout =
+        Duration.ofSeconds(timeoutSeconds > 0 ? timeoutSeconds : 
DEFAULT_TIMEOUT_SECONDS);
+
+    // Acquire a reference to the shared resources; keep a handle so we can 
decref() on close
+    acquiredResourcesRef = initializeHttpClient().incref();
+  }
+
+  public static final String NAME = "tikaserver";
+
+  @Override
+  public String name() {
+    return NAME;
+  }
+
+  @Override
+  public ExtractionResult extract(InputStream inputStream, ExtractionRequest 
request)
+      throws Exception {
+    try (InputStream tikaResponse = callTikaServer(inputStream, request)) {
+      ExtractionMetadata md = buildMetadataFromRequest(request);
+      BodyContentHandler bodyContentHandler = new BodyContentHandler(-1);
+      if (request.tikaServerRecursive) {
+        tikaServerResponseParser.parseRmetaJson(tikaResponse, 
bodyContentHandler, md);
+      } else {
+        tikaServerResponseParser.parseXml(tikaResponse, bodyContentHandler, 
md);
+      }
+      if (tikaMetadataCompatibility) {
+        appendBackCompatTikaMetadata(md);
+      }
+      return new ExtractionResult(bodyContentHandler.toString(), md);
+    }
+  }
+
+  @Override
+  public void extractWithSaxHandler(
+      InputStream inputStream,
+      ExtractionRequest request,
+      ExtractionMetadata md,
+      DefaultHandler saxContentHandler)
+      throws Exception {
+    try (InputStream tikaResponse = callTikaServer(inputStream, request)) {
+      if (request.tikaServerRecursive) {
+        tikaServerResponseParser.parseRmetaJson(tikaResponse, 
saxContentHandler, md);
+      } else {
+        tikaServerResponseParser.parseXml(tikaResponse, saxContentHandler, md);
+      }
+      if (tikaMetadataCompatibility) {
+        appendBackCompatTikaMetadata(md);
+      }
+    }
+  }
+
+  /**
+   * Call the Tika Server to extract text and metadata. Depending on 
<code>request.recursive</code>,
+   * will either return XML (false) or JSON array (true). <b>The recursive 
mode consumes more memory
+   * both on the TikaServer side and on the Solr side</b>
+   *
+   * @return InputStream of the response body, either XML or JSON depending on 
<code>
+   *     request.tikaserverRecursive</code>

Review Comment:
   Javadoc references `request.recursive` but the actual field is 
`request.tikaServerRecursive`. Update documentation to use the correct field 
name.
   ```suggestion
      * Call the Tika Server to extract text and metadata. Depending on 
<code>request.tikaServerRecursive</code>,
      * will either return XML (false) or JSON array (true). <b>The recursive 
mode consumes more memory
      * both on the TikaServer side and on the Solr side</b>
      *
      * @return InputStream of the response body, either XML or JSON depending 
on <code>
      *     request.tikaServerRecursive</code>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SOLR-7632 TikaServer support (branch_9x) [solr]

Reply via email to