janhoy commented on code in PR #3670:
URL: https://github.com/apache/solr/pull/3670#discussion_r2404180680


##########
solr/modules/extraction/src/java/org/apache/solr/handler/extraction/ExtractionRequest.java:
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.handler.extraction;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/** Immutable request info needed by extraction backends. */
+public class ExtractionRequest {
+  public final String streamType; // explicit MIME type (optional)
+  public final String resourceName; // filename hint
+  public final String contentType; // HTTP content-type header
+  public final String charset; // derived charset if available
+  public final String streamName;
+  public final String streamSourceInfo;
+  public final Long streamSize;
+  public final String resourcePassword; // optional password for encrypted docs
+  public final java.util.LinkedHashMap<java.util.regex.Pattern, String>
+      passwordsMap; // optional passwords map
+  public final String extractFormat;
+  public final boolean recursive;
+  public final Map<String, String> tikaRequestHeaders = new HashMap<>();

Review Comment:
   The `recursive` and `tikaRequestHeaders` properties are only used by 
TikaServer backend. Local backend has always used recursive parsing by default. 
The `tikaRequestHeaders` are tightly coupled to the TikaServer request and 
makes no sense for local. It is currently only used to pass a header for a unit 
test.
   
   We could generify these so they make more sense for any backend, but it is 
also somewhat inflexible as there may be tons of request headers you could pass 
to TikaServer. But then, there is no way for end user to pass any header, it is 
all in code, so headers could probably instead be deduced from some `features` 
map...
   
   Alternatively we rename these to include `tikaServer` in their names and 
document them to be for that backend only?



##########
solr/modules/extraction/src/java/org/apache/solr/handler/extraction/XmlSanitizingReader.java:
##########
@@ -0,0 +1,187 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.handler.extraction;
+
+import java.io.IOException;
+import java.io.Reader;
+
+/**
+ * Minimal reader that drops only null numeric character entities from the 
stream.
+ *
+ * <p>Recognizes decimal and hexadecimal numeric entities that resolve to code 
point 0 (e.g. "&#0;",
+ * "&#00;", "&#x0;", "&#x0000;") and removes them. If a null entity is 
unterminated (no ';'), it is
+ * still removed when a non-entity character terminates the sequence. 
Everything else is passed
+ * through unchanged.
+ */
+final class XmlSanitizingReader extends Reader {

Review Comment:
   This reader feels like overkill to avoid the `&#0x` chars that tika-server 
returns in some cases. Could do a simple string replace but then I'd need to 
consume the entire stream first. I looked for commons libs that could achieve 
the same and if SAX parser could be configured to be lenient, but no avail so 
far.
   
   Appreciate tips on how to get rid of the weight of this class.



##########
solr/modules/extraction/gradle.lockfile:
##########
@@ -99,7 +102,8 @@ 
javax.inject:javax.inject:1=annotationProcessor,errorprone,testAnnotationProcess
 
javax.measure:unit-api:1.0=compileClasspath,jarValidation,runtimeClasspath,runtimeLibs,testCompileClasspath,testRuntimeClasspath
 
joda-time:joda-time:2.14.0=compileClasspath,jarValidation,runtimeClasspath,runtimeLibs,testCompileClasspath,testRuntimeClasspath
 junit:junit:4.13.2=jarValidation,testCompileClasspath,testRuntimeClasspath
-net.java.dev.jna:jna:5.12.1=compileClasspath,jarValidation,runtimeClasspath,runtimeLibs,testCompileClasspath,testRuntimeClasspath
+net.java.dev.jna:jna:5.12.1=compileClasspath,runtimeClasspath,runtimeLibs
+net.java.dev.jna:jna:5.13.0=jarValidation,testCompileClasspath,testRuntimeClasspath

Review Comment:
   Is it expected to have two difference versions of JNA here?



##########
solr/modules/extraction/src/java/org/apache/solr/handler/extraction/TikaServerExtractionBackend.java:
##########
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.handler.extraction;
+
+import java.io.InputStream;
+import java.time.Duration;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.ThreadFactory;
+import java.util.concurrent.TimeUnit;
+import org.apache.solr.common.SolrException;
+import org.apache.solr.common.util.ExecutorUtil;
+import org.apache.solr.common.util.SolrNamedThreadFactory;
+import org.apache.tika.sax.BodyContentHandler;
+import org.eclipse.jetty.client.HttpClient;
+import org.eclipse.jetty.client.InputStreamRequestContent;
+import org.eclipse.jetty.client.InputStreamResponseListener;
+import org.eclipse.jetty.client.Request;
+import org.eclipse.jetty.client.Response;
+import org.eclipse.jetty.util.thread.ScheduledExecutorScheduler;
+import org.xml.sax.helpers.DefaultHandler;
+
+/** Extraction backend using the Tika Server. It uses a shared Jetty 
HttpClient. */
+public class TikaServerExtractionBackend implements ExtractionBackend {
+  private static volatile HttpClient SHARED_CLIENT;
+  private static volatile ExecutorService SHARED_EXECUTOR;
+  private static final Object INIT_LOCK = new Object();
+  private static volatile boolean INITIALIZED = false;
+  private static volatile boolean SHUTDOWN = false;
+  private final String baseUrl; // e.g., http://localhost:9998
+  private final Duration timeout = Duration.ofMinutes(3);

Review Comment:
   Make timeout configurable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to