Re: [PR] TIKA-4746 -- sweep docs [tika]

via GitHub Mon, 01 Jun 2026 05:40:59 -0700


Copilot commented on code in PR #2852:
URL: https://github.com/apache/tika/pull/2852#discussion_r3334170529



##########
tika-plugins-core/src/main/java/org/apache/tika/plugins/ThreadSafeUnzipper.java:
##########
@@ -71,6 +71,19 @@ public static void unzipPlugin(Path source) throws 
IOException {
             return;
         }
 
+        // Destination exists but has no completion marker. Possible causes:
+        // a previous extraction was killed mid-stream, the marker was deleted
+        // out from under us, or something other than our extractor put files
+        // there. Without this cleanup the subsequent Files.move() below will
+        // fail with DirectoryNotEmptyException on every run until a human
+        // manually removes the directory. Treat the half-extracted state as
+        // garbage and rebuild.
+        if (Files.exists(destination)) {
+            LOG.warn("destination {} exists without a completion marker; "
+                    + "treating as stale partial extraction and removing", 
destination);
+            deleteRecursively(destination);
+        }

Review Comment:
   If deleteRecursively(destination) fails to fully remove the stale 
destination (e.g., due to Windows file locks), the subsequent Files.move() will 
keep failing and the code will throw a misleading timeout from 
waitForExtractionComplete(). Consider verifying deletion succeeded and failing 
fast with a clear IOException when it did not.



##########
docs/modules/ROOT/pages/using-tika/server/index.adoc:
##########
@@ -24,91 +24,203 @@ This section covers running Apache Tika as a REST server 
via `tika-server`.
 Tika Server provides a RESTful HTTP interface for parsing documents and 
extracting
 content. It can be deployed as a standalone service or in a containerized 
environment.
 
+In Tika 4.x, all parsing happens in forked child processes via the Tika Pipes
+infrastructure — the request-handling process never loads parser libraries 
directly.
+This provides process isolation (a parser crash or OOM cannot take down the 
server)
+at the cost of requiring a Pipes configuration. See
+xref:migration-to-4x/migrating-tika-server-4x.adoc[Migrating Tika Server to 
4.x]
+for the full breaking-change list when upgrading from 3.x.

Review Comment:
   The overview claims *all* parsing happens in forked child processes and the 
request-handling process never loads parser libraries. However, some endpoints 
(e.g., `/meta`) still parse in-process via 
TikaResource.createParser()/TikaResource.parse(). This should be qualified to 
avoid misleading readers about isolation guarantees.



##########
docs/modules/ROOT/pages/pipes/troubleshooting.adoc:
##########
@@ -192,6 +192,40 @@ response-body bytes for HTTP-style fetchers (configurable 
via
 log catches the thrown exception. Lower `maxErrMsgSize` -- or set it to
 zero -- if your responses can contain sensitive data.
 
+== Logging
+
+Tika uses https://logging.apache.org/log4j/2.x/[Log4j 2] for both tika-app and 
tika-server. Default output goes to `SYSTEM_ERR` with the pattern `%-5p [%t] 
%d{HH:mm:ss,SSS} %c %m%n`. Each forked PipesServer logs with its own line 
prefix so parent and child output stays distinguishable; see 
<<_telling_fork_lines_from_parent_lines,Telling fork lines from parent lines>>.
+
+=== Default log4j2 configuration
+
+Each distribution ships its own `log4j2.xml` bundled inside the jar:
+
+* tika-app: `org/apache/tika/cli/log4j2.xml` (in `tika-app-<version>.jar`).
+* tika-server: `org/apache/tika/server/log4j2.xml` (in the relevant 
`tika-server-*.jar`).
+

Review Comment:
   The documented locations of the bundled log4j2.xml files inside the jars 
don't match the actual resource paths in this repo (they are at the jar root as 
`log4j2.xml`, not under `org/apache/tika/...`). This makes it hard for users to 
extract/override the right config.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] TIKA-4746 -- sweep docs [tika]

Reply via email to