Copilot commented on code in PR #2852:
URL: https://github.com/apache/tika/pull/2852#discussion_r3334170529
##########
tika-plugins-core/src/main/java/org/apache/tika/plugins/ThreadSafeUnzipper.java:
##########
@@ -71,6 +71,19 @@ public static void unzipPlugin(Path source) throws
IOException {
return;
}
+ // Destination exists but has no completion marker. Possible causes:
+ // a previous extraction was killed mid-stream, the marker was deleted
+ // out from under us, or something other than our extractor put files
+ // there. Without this cleanup the subsequent Files.move() below will
+ // fail with DirectoryNotEmptyException on every run until a human
+ // manually removes the directory. Treat the half-extracted state as
+ // garbage and rebuild.
+ if (Files.exists(destination)) {
+ LOG.warn("destination {} exists without a completion marker; "
+ + "treating as stale partial extraction and removing",
destination);
+ deleteRecursively(destination);
+ }
Review Comment:
If deleteRecursively(destination) fails to fully remove the stale
destination (e.g., due to Windows file locks), the subsequent Files.move() will
keep failing and the code will throw a misleading timeout from
waitForExtractionComplete(). Consider verifying deletion succeeded and failing
fast with a clear IOException when it did not.
##########
docs/modules/ROOT/pages/using-tika/server/index.adoc:
##########
@@ -24,91 +24,203 @@ This section covers running Apache Tika as a REST server
via `tika-server`.
Tika Server provides a RESTful HTTP interface for parsing documents and
extracting
content. It can be deployed as a standalone service or in a containerized
environment.
+In Tika 4.x, all parsing happens in forked child processes via the Tika Pipes
+infrastructure — the request-handling process never loads parser libraries
directly.
+This provides process isolation (a parser crash or OOM cannot take down the
server)
+at the cost of requiring a Pipes configuration. See
+xref:migration-to-4x/migrating-tika-server-4x.adoc[Migrating Tika Server to
4.x]
+for the full breaking-change list when upgrading from 3.x.
Review Comment:
The overview claims *all* parsing happens in forked child processes and the
request-handling process never loads parser libraries. However, some endpoints
(e.g., `/meta`) still parse in-process via
TikaResource.createParser()/TikaResource.parse(). This should be qualified to
avoid misleading readers about isolation guarantees.
##########
docs/modules/ROOT/pages/pipes/troubleshooting.adoc:
##########
@@ -192,6 +192,40 @@ response-body bytes for HTTP-style fetchers (configurable
via
log catches the thrown exception. Lower `maxErrMsgSize` -- or set it to
zero -- if your responses can contain sensitive data.
+== Logging
+
+Tika uses https://logging.apache.org/log4j/2.x/[Log4j 2] for both tika-app and
tika-server. Default output goes to `SYSTEM_ERR` with the pattern `%-5p [%t]
%d{HH:mm:ss,SSS} %c %m%n`. Each forked PipesServer logs with its own line
prefix so parent and child output stays distinguishable; see
<<_telling_fork_lines_from_parent_lines,Telling fork lines from parent lines>>.
+
+=== Default log4j2 configuration
+
+Each distribution ships its own `log4j2.xml` bundled inside the jar:
+
+* tika-app: `org/apache/tika/cli/log4j2.xml` (in `tika-app-<version>.jar`).
+* tika-server: `org/apache/tika/server/log4j2.xml` (in the relevant
`tika-server-*.jar`).
+
Review Comment:
The documented locations of the bundled log4j2.xml files inside the jars
don't match the actual resource paths in this repo (they are at the jar root as
`log4j2.xml`, not under `org/apache/tika/...`). This makes it hard for users to
extract/override the right config.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]