This is an automated email from the ASF dual-hosted git repository.
tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new d02dc13903 TIKA-4740 -- tika-server-core fix (#2841)
d02dc13903 is described below
commit d02dc13903c9087ccdf9c33d8a39f2da4426c0c2
Author: Tim Allison <[email protected]>
AuthorDate: Wed May 27 08:53:37 2026 -0400
TIKA-4740 -- tika-server-core fix (#2841)
---
docs/modules/ROOT/pages/pipes/troubleshooting.adoc | 132 +++++++++++--------
.../tika/pipes/core/PerClientServerManager.java | 51 ++++++--
.../apache/tika/pipes/core/ServerProcessIO.java | 140 +++++++--------------
.../tika/pipes/core/SharedServerManager.java | 45 +++++--
.../apache/tika/pipes/core/server/PipesServer.java | 54 ++++++++
.../tika/server/core/IntegrationTestBase.java | 31 +++++
6 files changed, 288 insertions(+), 165 deletions(-)
diff --git a/docs/modules/ROOT/pages/pipes/troubleshooting.adoc
b/docs/modules/ROOT/pages/pipes/troubleshooting.adoc
index ff119c5324..3765bae2e7 100644
--- a/docs/modules/ROOT/pages/pipes/troubleshooting.adoc
+++ b/docs/modules/ROOT/pages/pipes/troubleshooting.adoc
@@ -33,19 +33,10 @@ ERROR clientId=2: Process exited with code 1 before
connecting to socket
ERROR Shared server process exited with code 1 before becoming ready
----
-For most failures (bad JVM args, missing classpath entry, OOM at boot), the
-parent logger additionally prints the tail of the child's stderr immediately
-after the exit-code line:
-
-[source]
-----
-ERROR clientId=2: child stderr tail:
-Error: Could not find or load main class
org.apache.tika.pipes.core.server.PipesServer
-----
-
-For native crashes (segfault in a JNI parser, JVM bug), the JVM writes an
-`hs_err_pid<N>.log` file to the child's working directory. The parent logger
-will read and print that file too:
+For *native* JVM crashes (e.g. a segfault in a JNI parser), the JVM writes an
+`hs_err_pid<N>.log` file. We direct that via `-XX:ErrorFile=` into the
+manager's per-server temp directory, then read it into the parent's SLF4J
+logger before cleanup:
[source]
----
@@ -57,51 +48,84 @@ ERROR clientId=2: JVM crash log hs_err_pid12345.log:
...
----
-In short: read the *parent* application's log first. The diagnostics from the
-dead child are inlined there, so you don't have to find anything on disk.
+So for native crashes, read the parent application's log first -- the hs_err
+contents are inlined there.
-== Keeping child log files for post-mortem analysis
+== Child JVM stdout/stderr
-By default, each forked server's stdout, stderr, and any JVM crash logs are
-written into a per-server temp directory. The temp directory is cleaned up
-when the manager shuts the server down. If you need to keep those files
-around -- for example, to diff stderr across multiple failed restart attempts,
-or to ship crash logs to a support contact -- set the
-`tika.pipes.server.logDir` system property on the *parent* JVM:
+By default the child `PipesServer` JVM inherits its stdout and stderr from
+the parent. This is the 12-factor / container-friendly default: when Tika
+runs in Docker or Kubernetes, the pipes-server's log records flow through
+to the container's stdio stream where the runtime (Docker, containerd) and
+any log aggregator (fluentd, fluent-bit, Promtail, the K8s log API, etc.)
+pick them up automatically. The default `pipes-fork-server-default-log4j2.xml`
+writes to `SYSTEM_ERR`, so inheritance is what makes those records visible
+to your observability stack.
-[source,bash]
-----
-java -Dtika.pipes.server.logDir=/var/log/tika-pipes-crashes \
- -jar your-app.jar ...
-----
+If you don't want the pipes-server's output interleaved with your own --
+e.g. an embedded use case where the parent is producing its own structured
+stdout, or a test environment where you want a quieter console -- set the
+system property `tika.pipes.server.stdio=discard` on the parent JVM:
-When set, the manager copies the child's `server-stdout.log`,
-`server-stderr.log`, and any `hs_err_pid*.log` files to that directory on
-every abnormal exit, with a timestamp prefix:
-
-[source]
+[source,bash]
----
-/var/log/tika-pipes-crashes/
- 1748307123456-server-stderr.log
- 1748307123456-hs_err_pid12345.log
- 1748307145001-server-stderr.log # later restart attempt
+java -Dtika.pipes.server.stdio=discard -jar your-app.jar ...
----
-The property is off by default. Leave it off in steady-state production; turn
-it on when you are actively debugging a recurring fork failure.
-
-== What does *not* go to those files
-
-Steady-state log output from the parser (every parse, every emitter, every
-embedded-document warning) does **not** go to `server-stderr.log`. It goes
-through SLF4J inside the child JVM and lands in whatever your `log4j2.xml` or
-`logback.xml` directs it to. The child's stderr is only useful for things the
-JVM writes before logging is wired up, or that bypass logging entirely:
-
-* JVM startup errors (bad classpath, unrecognized flag, "could not find main
- class").
-* Uncaught throwables on the main thread that never reached an SLF4J logger.
-* Output from `System.err.println` calls (if any).
-
-For native crash investigation, the JVM-generated `hs_err_pid<N>.log` is the
-primary artifact, and it is collected automatically as described above.
+With this set, the child's stdout and stderr are routed to the null sink
+and the pipes server's log records are silently dropped at the OS level.
+(Records written via SLF4J inside the child can still be captured by
+configuring `log4j2.xml` / `logback.xml` to write to your own file or
+network appender, independent of the stdio setting.)
+
+=== Safety of the inherit default on Windows
+
+Earlier versions of Tika hit a surefire hang on Windows when inheriting
+child stdio: a forked child held a duplicate of the parent JVM's stderr
+handle, and any reader upstream of the parent (a maven-surefire controller,
+typically) never saw EOF after the parent died -- the child kept the pipe
+open. That class of hang is now mitigated structurally: every child
+`PipesServer` watches its parent's process handle via
+`ProcessHandle.onExit()` (see <<parent-death-detection>>) and self-
+terminates within milliseconds of parent exit. The inherited handle is
+released essentially synchronously with the parent's death, and upstream
+readers see EOF promptly.
+
+[#parent-death-detection]
+== Parent-death detection
+
+The child `PipesServer` JVMs watch their parent's PID via
+`ProcessHandle.onExit()` and self-terminate within milliseconds if the
+parent dies. The parent passes its own PID via the
+`TIKA_PIPES_PARENT_PID` environment variable when spawning the child.
+
+This matters because the parent (e.g. tika-server) can be killed in ways
+that skip its JVM shutdown hooks -- for instance,
+`Process.destroy()` on Windows is equivalent to `TerminateProcess`, which
+bypasses all hooks. Without parent-death detection, an orphaned PipesServer
+would only notice via TCP RST on its next socket read, and would not
+notice at all while busy in a parse, leaving it (and any external
+subprocess it had spawned, such as a tesseract OCR worker) running
+indefinitely.
+
+When the watcher fires, the child exits via `System.exit`, which runs
+`AbstractExternalProcessParser`'s shutdown hook and cleans up any
+in-flight external subprocesses.
+
+== Configuration knobs reference
+
+[cols="2,3"]
+|===
+|System property / env var |Effect
+
+|`tika.pipes.server.stdio` (system property)
+|`discard` suppresses child stdout/stderr at the OS level. Anything else
+ (or unset) inherits the child's stdio from the parent JVM. Default: inherit.
+
+|`TIKA_PIPES_PARENT_PID` (env var)
+|Set automatically by the parent manager when spawning a `PipesServer`
+ child. The child uses it to watch its parent and self-terminate if the
+ parent dies. Not normally set by users; if you launch `PipesServer`
+ standalone (outside the normal manager flow) and leave it unset, the
+ parent-watch is simply skipped.
+|===
diff --git
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
index 3b53cecc41..50165dce1b 100644
---
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
+++
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
@@ -34,6 +34,7 @@ import org.apache.commons.io.FileUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
+import org.apache.tika.pipes.core.server.PipesServer;
import org.apache.tika.utils.ProcessUtils;
/**
@@ -270,15 +271,33 @@ public class PerClientServerManager implements
ServerManager {
tmpDir = Files.createTempDirectory("pipes-server-" + clientId + "-");
ProcessBuilder pb = new ProcessBuilder(getCommandline());
- // Run the child in tmpDir so any hs_err_pid<N>.log JVM crash log lands
- // where surfaceCrashDiagnostics() looks for it. Redirect stdio to per-
- // server files instead of inheriting the parent JVM's handles -- on
- // Windows inheritIO() duplicates surefire's stderr handle into the
- // child, blocking the controller's pipe reader past parent exit and
- // hanging CI.
- pb.directory(tmpDir.toFile());
- pb.redirectOutput(ServerProcessIO.stdoutLog(tmpDir));
- pb.redirectError(ServerProcessIO.stderrLog(tmpDir));
+ // Tell the child our PID so it can watch ProcessHandle.onExit() and
+ // self-terminate promptly if we die. Without this, an orphan child
+ // can only notice via socket-read timeout (default 60s) and can't
+ // notice at all while it is mid-parse -- the leak that TIKA-4740
+ // surfaced via @TempDir cleanup failures.
+ pb.environment().put(PipesServer.PARENT_PID_ENV,
+ Long.toString(ProcessHandle.current().pid()));
+ // Default: inherit stdio so the pipes-server's log records show up
+ // in the parent's stdio stream (the production case is Docker/K8s
+ // where container stdio is picked up by log aggregators -- writing
+ // to files in a container is anti-pattern). Set
+ // -Dtika.pipes.server.stdio=discard on the parent to suppress.
+ //
+ // The Windows surefire hang that previously made inheritIO() risky
+ // is mitigated by PipesServer.watchParentProcess(): when the parent
+ // exits, the child detects it via ProcessHandle.onExit() within
+ // milliseconds and System.exit()s, releasing its inherited stderr
+ // handle so upstream pipe readers see EOF promptly.
+ //
+ // hs_err crash logs are pointed at tmpDir via -XX:ErrorFile in
+ // getCommandline() and surfaced via SLF4J on abnormal exit.
+ if (ServerProcessIO.inheritStdio()) {
+ pb.inheritIO();
+ } else {
+ pb.redirectOutput(ProcessBuilder.Redirect.DISCARD);
+ pb.redirectError(ProcessBuilder.Redirect.DISCARD);
+ }
try {
process = pb.start();
@@ -383,6 +402,7 @@ public class PerClientServerManager implements
ServerManager {
boolean hasExitOnOOM = false;
boolean hasLog4j = false;
boolean hasActiveProcessorCount = false;
+ boolean hasErrorFile = false;
String origGCString = null;
String newGCLogString = null;
@@ -402,12 +422,25 @@ public class PerClientServerManager implements
ServerManager {
if (arg.startsWith("-XX:ActiveProcessorCount=")) {
hasActiveProcessorCount = true;
}
+ if (arg.startsWith("-XX:ErrorFile=")) {
+ hasErrorFile = true;
+ }
if (arg.startsWith("-Xloggc:")) {
origGCString = arg;
newGCLogString = arg.replace("${pipesClientId}", "id-" +
clientId);
}
}
+ // Direct native-crash dumps (hs_err_pid<N>.log) into tmpDir so
+ // ServerProcessIO.surfaceCrashDiagnostics() can find and emit them on
+ // abnormal exit. The child JVM inherits the parent's CWD (we do NOT
+ // call pb.directory()), so without this the JVM would write hs_err
+ // wherever the parent was launched -- typically lost.
+ if (!hasErrorFile) {
+ configArgs.add("-XX:ErrorFile=" +
tmpDir.resolve("hs_err_pid%p.log")
+ .toAbsolutePath());
+ }
+
// If the user hasn't explicitly set -XX:ActiveProcessorCount, size
each
// forked JVM's view of CPUs to a fair slice of the host. Otherwise
each
// JVM defaults its GC, JIT, and common ForkJoinPool to "all cores",
which
diff --git
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/ServerProcessIO.java
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/ServerProcessIO.java
index 34054b8642..a18e4dd0ad 100644
---
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/ServerProcessIO.java
+++
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/ServerProcessIO.java
@@ -16,72 +16,78 @@
*/
package org.apache.tika.pipes.core;
-import java.io.File;
import java.io.IOException;
-import java.io.RandomAccessFile;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
-import java.nio.file.Paths;
-import java.nio.file.StandardCopyOption;
import java.util.stream.Stream;
import org.slf4j.Logger;
/**
- * Helpers for routing child pipes-server JVM stdout/stderr to per-server log
- * files in the manager's temp dir, and for surfacing those files (and any
- * native JVM crash logs) via the parent's SLF4J logger when the child exits
- * abnormally.
+ * Helpers for child pipes-server JVM stdio.
* <p>
- * Background: previously the managers used {@code pb.inheritIO()} /
- * {@code Redirect.INHERIT}, which duplicated the parent JVM's stdio handles
- * into the child. On Windows that leaks the parent JVM's stderr handle past
- * the parent JVM's own lifetime -- a surefire pipe reader thread on the
- * controller side then blocks forever waiting for EOF, hanging CI.
+ * The managers default to {@link ProcessBuilder#inheritIO()} for the child's
+ * stdout/stderr -- the production case is container deployments
+ * (Docker/Kubernetes) where stdio is the canonical log stream and gets
+ * picked up by the container runtime / log aggregator. The pipes server's
+ * default log4j2 config writes to {@code SYSTEM_ERR}; inheriting routes
+ * those records to the operator's existing observability stack with no
+ * extra wiring.
+ * <p>
+ * A previous version of this code defaulted to DISCARD to work around a
+ * Windows-only CI hang where surefire's pipe reader waited on a child that
+ * had inherited (and was still holding) the parent JVM's stderr handle.
+ * That hang is now mitigated structurally by the parent-PID watch in
+ * {@code PipesServer.watchParentProcess()} -- when the parent dies, the
+ * child notices via {@code ProcessHandle.onExit()} and exits within
+ * milliseconds, releasing the inherited handle and letting upstream
+ * readers see EOF.
+ * <p>
+ * To suppress child stdio (e.g. embedded use cases that don't want
+ * pipes-server log records interleaved with their own stream, or rare
+ * test environments where the parent-watch isn't enough), set the system
+ * property {@value #STDIO_MODE_PROPERTY} to {@value #STDIO_MODE_DISCARD}
+ * on the parent JVM.
+ * <p>
+ * Native JVM crashes ({@code hs_err_pid<N>.log}) are always redirected into
+ * the manager's per-server tmpDir via {@code -XX:ErrorFile=}, and
+ * {@link #surfaceCrashDiagnostics(Logger, String, Path)} reads any found
+ * crash logs into the parent SLF4J logger before cleanup.
*/
final class ServerProcessIO {
- /** System property opt-in: when set, child log files and any hs_err
- * crash logs are copied here before tmpDir cleanup. */
- static final String LOG_DIR_PROPERTY = "tika.pipes.server.logDir";
-
- static final String STDOUT_LOG = "server-stdout.log";
- static final String STDERR_LOG = "server-stderr.log";
+ /** System property selecting child stdio handling.
+ * See class javadoc. */
+ static final String STDIO_MODE_PROPERTY = "tika.pipes.server.stdio";
- private static final int TAIL_BYTES = 64 * 1024;
+ /** Property value to suppress child stdio. Any other value (or unset)
+ * inherits. */
+ static final String STDIO_MODE_DISCARD = "discard";
private ServerProcessIO() {
}
- static File stdoutLog(Path tmpDir) {
- return tmpDir.resolve(STDOUT_LOG).toFile();
- }
-
- static File stderrLog(Path tmpDir) {
- return tmpDir.resolve(STDERR_LOG).toFile();
+ /**
+ * Returns {@code true} (the default) unless the operator has set
+ * {@value #STDIO_MODE_PROPERTY}={@value #STDIO_MODE_DISCARD} to silence
+ * the child's stdio.
+ */
+ static boolean inheritStdio() {
+ return !STDIO_MODE_DISCARD.equalsIgnoreCase(
+ System.getProperty(STDIO_MODE_PROPERTY));
}
/**
- * Emits the child's stderr tail and any {@code hs_err_pid<N>.log} JVM
- * crash logs via {@code log.error} so they show up in the parent's log
- * output. Call this on every abnormal-exit path before {@code tmpDir}
- * gets deleted, otherwise the diagnostics disappear with the temp dir.
- * <p>
- * If {@code tika.pipes.server.logDir} is set, the same files are also
- * copied to that directory for post-mortem inspection.
+ * Emits any {@code hs_err_pid<N>.log} JVM crash logs found in
+ * {@code tmpDir} via {@code log.error} so they show up in the parent's
+ * log output. Call this on every abnormal-exit path before tmpDir gets
+ * deleted, otherwise the diagnostics disappear with the temp dir.
*/
static void surfaceCrashDiagnostics(Logger log, String contextLabel, Path
tmpDir) {
if (tmpDir == null || !Files.isDirectory(tmpDir)) {
return;
}
- Path stderr = tmpDir.resolve(STDERR_LOG);
- if (Files.isRegularFile(stderr)) {
- String tail = readTail(stderr);
- if (!tail.isEmpty()) {
- log.error("{}: child stderr tail:\n{}", contextLabel, tail);
- }
- }
try (Stream<Path> entries = Files.list(tmpDir)) {
entries.filter(ServerProcessIO::isJvmCrashLog).forEach(p -> {
try {
@@ -97,64 +103,10 @@ final class ServerProcessIO {
log.warn("{}: failed to list tmpDir for hs_err logs: {}",
contextLabel, e.toString());
}
-
- String persistDir = System.getProperty(LOG_DIR_PROPERTY);
- if (persistDir != null && !persistDir.isBlank()) {
- persistCrashFiles(log, contextLabel, tmpDir,
Paths.get(persistDir));
- }
}
private static boolean isJvmCrashLog(Path p) {
String name = p.getFileName().toString();
return name.startsWith("hs_err_pid") && name.endsWith(".log");
}
-
- private static void persistCrashFiles(Logger log, String contextLabel,
- Path tmpDir, Path dest) {
- try {
- Files.createDirectories(dest);
- } catch (IOException e) {
- log.warn("{}: failed to create persist dir {}: {}",
- contextLabel, dest, e.toString());
- return;
- }
- String stamp = Long.toString(System.currentTimeMillis());
- try (Stream<Path> entries = Files.list(tmpDir)) {
- entries.filter(p -> {
- String name = p.getFileName().toString();
- return name.equals(STDOUT_LOG) || name.equals(STDERR_LOG)
- || isJvmCrashLog(p);
- }).forEach(p -> {
- Path target = dest.resolve(stamp + "-" + p.getFileName());
- try {
- Files.copy(p, target, StandardCopyOption.REPLACE_EXISTING);
- log.info("{}: persisted {} to {}", contextLabel,
- p.getFileName(), target);
- } catch (IOException e) {
- log.warn("{}: failed to copy {} to {}: {}", contextLabel,
- p.getFileName(), target, e.toString());
- }
- });
- } catch (IOException e) {
- log.warn("{}: failed to enumerate tmpDir for persistence: {}",
- contextLabel, e.toString());
- }
- }
-
- private static String readTail(Path file) {
- try (RandomAccessFile raf = new RandomAccessFile(file.toFile(), "r")) {
- long len = raf.length();
- long start = Math.max(0, len - TAIL_BYTES);
- raf.seek(start);
- byte[] buf = new byte[(int) (len - start)];
- raf.readFully(buf);
- String s = new String(buf, StandardCharsets.UTF_8);
- if (start > 0) {
- s = "...[truncated, showing last " + TAIL_BYTES + "
bytes]...\n" + s;
- }
- return s;
- } catch (IOException e) {
- return "";
- }
- }
}
diff --git
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/SharedServerManager.java
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/SharedServerManager.java
index 3778b082cf..ea18f31b0e 100644
---
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/SharedServerManager.java
+++
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/SharedServerManager.java
@@ -283,15 +283,30 @@ public class SharedServerManager implements ServerManager
{
// eliminating the TOCTOU race between probing a free port and binding
it.
pb.environment().put("TIKA_PIPES_PORT", "0");
pb.environment().put("TIKA_PIPES_AUTH_TOKEN",
HexFormat.of().formatHex(token));
- // Run the child in tmpDir so any hs_err_pid<N>.log JVM crash log
- // lands where surfaceCrashDiagnostics() looks for it. Keep stdout on
- // a parent-owned pipe so we can read the READY:port signal. Redirect
- // stderr to a file rather than INHERIT -- on Windows, inheriting
- // stderr duplicates surefire's stderr handle into the child, blocking
- // the controller's pipe reader past parent exit and hanging CI.
- pb.directory(tmpDir.toFile());
+ // Tell the child our PID so it can watch ProcessHandle.onExit() and
+ // self-terminate promptly if we die. See
PipesServer.watchParentProcess.
+ pb.environment().put(PipesServer.PARENT_PID_ENV,
+ Long.toString(ProcessHandle.current().pid()));
+ // stdout stays on a parent-owned pipe so we can read the READY:port
+ // signal. stderr defaults to INHERIT so the shared-server's log
+ // records show up in the parent's stdio stream (the production case
+ // is Docker/K8s where container stdio is picked up by log
+ // aggregators). Set -Dtika.pipes.server.stdio=discard on the parent
+ // to suppress.
+ //
+ // The Windows surefire hang that previously made INHERIT risky is
+ // mitigated by PipesServer.watchParentProcess(): when the parent
+ // exits, the child detects via ProcessHandle.onExit() in
+ // milliseconds and System.exit()s, releasing its inherited stderr.
+ //
+ // hs_err crash logs are pointed at tmpDir via -XX:ErrorFile in
+ // getCommandline() and surfaced via SLF4J on abnormal exit.
pb.redirectErrorStream(false);
- pb.redirectError(ServerProcessIO.stderrLog(tmpDir));
+ if (ServerProcessIO.inheritStdio()) {
+ pb.redirectError(ProcessBuilder.Redirect.INHERIT);
+ } else {
+ pb.redirectError(ProcessBuilder.Redirect.DISCARD);
+ }
try {
process = pb.start();
@@ -417,6 +432,7 @@ public class SharedServerManager implements ServerManager {
boolean hasHeadless = false;
boolean hasExitOnOOM = false;
boolean hasLog4j = false;
+ boolean hasErrorFile = false;
for (String arg : configArgs) {
if (arg.startsWith("-Djava.awt.headless")) {
@@ -431,6 +447,19 @@ public class SharedServerManager implements ServerManager {
if (arg.startsWith("-Dlog4j.configuration") ||
arg.startsWith("-Dlog4j2.configuration")) {
hasLog4j = true;
}
+ if (arg.startsWith("-XX:ErrorFile=")) {
+ hasErrorFile = true;
+ }
+ }
+
+ // Direct native-crash dumps (hs_err_pid<N>.log) into tmpDir so
+ // ServerProcessIO.surfaceCrashDiagnostics() can find and emit them on
+ // abnormal exit. The child JVM inherits the parent's CWD (we do NOT
+ // call pb.directory()), so without this the JVM would write hs_err
+ // wherever the parent was launched.
+ if (!hasErrorFile) {
+ configArgs.add("-XX:ErrorFile=" +
tmpDir.resolve("hs_err_pid%p.log")
+ .toAbsolutePath());
}
List<String> commandLine = new ArrayList<>();
diff --git
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/server/PipesServer.java
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/server/PipesServer.java
index 2056c6aab7..e84e0c9048 100644
---
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/server/PipesServer.java
+++
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/server/PipesServer.java
@@ -33,6 +33,7 @@ import java.time.Duration;
import java.time.Instant;
import java.util.HexFormat;
import java.util.Locale;
+import java.util.Optional;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutionException;
@@ -95,6 +96,15 @@ public class PipesServer implements AutoCloseable {
public static final int AUTH_TOKEN_LENGTH_BYTES = 32;
+ /** Env var the parent manager sets so the child can watch the parent's
+ * process handle and exit promptly if the parent dies. */
+ public static final String PARENT_PID_ENV = "TIKA_PIPES_PARENT_PID";
+
+ /** Exit code used when the child self-terminates because its parent JVM
+ * disappeared. Distinct from UNSPECIFIED_CRASH (19) so log readers can
+ * tell the difference between "I crashed" and "my parent went away". */
+ public static final int PARENT_GONE_EXIT_CODE = 23;
+
private final long heartbeatIntervalMs;
private final String pipesClientId;
@@ -196,6 +206,10 @@ public class PipesServer implements AutoCloseable {
public static void main(String[] args) throws Exception {
+ // Register parent-death watcher FIRST so that even bootstrap failures
+ // below don't strand us if the parent has already died.
+ watchParentProcess();
+
// Check for shared mode: --shared <numConnections> <tikaConfigPath>
if (args.length > 0 && "--shared".equals(args[0])) {
String portEnv = System.getenv("TIKA_PIPES_PORT");
@@ -517,6 +531,46 @@ public class PipesServer implements AutoCloseable {
System.exit(exitCode);
}
+ /**
+ * Registers a {@link ProcessHandle#onExit()} callback on the parent PID
+ * (read from the {@value #PARENT_PID_ENV} env var) so that this JVM
+ * self-terminates promptly if its parent disappears. Without this, an
+ * orphaned PipesServer would only notice the parent is gone when the
+ * next socket read fails -- which can take up to
+ * {@code socketTimeoutMs} (default 60s) and doesn't fire at all while
+ * the server is mid-parse. {@code System.exit} here lets the
+ * {@code AbstractExternalProcessParser} shutdown hook run, killing any
+ * in-flight external subprocess (e.g. tesseract) cleanly.
+ */
+ private static void watchParentProcess() {
+ String parentPidStr = System.getenv(PARENT_PID_ENV);
+ if (parentPidStr == null || parentPidStr.isEmpty()) {
+ LOG.info("{} not set; skipping parent-watch", PARENT_PID_ENV);
+ return;
+ }
+ long parentPid;
+ try {
+ parentPid = Long.parseLong(parentPidStr);
+ } catch (NumberFormatException e) {
+ LOG.warn("invalid {} value '{}'; skipping parent-watch",
+ PARENT_PID_ENV, parentPidStr);
+ return;
+ }
+ Optional<ProcessHandle> parent = ProcessHandle.of(parentPid);
+ if (parent.isEmpty()) {
+ LOG.error("parent pid {} not found at startup; exiting to avoid
orphan",
+ parentPid);
+ System.exit(PARENT_GONE_EXIT_CODE);
+ return;
+ }
+ parent.get().onExit().thenRun(() -> {
+ LOG.error("parent pid {} exited; shutting down to avoid orphan",
+ parentPid);
+ System.exit(PARENT_GONE_EXIT_CODE);
+ });
+ LOG.info("watching parent pid {} for exit", parentPid);
+ }
+
protected void initializeResources() throws TikaException, IOException,
SAXException {
TikaJsonConfig tikaJsonConfig = tikaLoader.getConfig();
diff --git
a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/IntegrationTestBase.java
b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/IntegrationTestBase.java
index 2299a65f6a..19ee324ef0 100644
---
a/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/IntegrationTestBase.java
+++
b/tika-server/tika-server-core/src/test/java/org/apache/tika/server/core/IntegrationTestBase.java
@@ -90,9 +90,40 @@ public class IntegrationTestBase extends TikaTest {
if (process.isAlive()) {
throw new RuntimeException("process still alive!");
}
+ // DIAGNOSTIC (TIKA-4740): on Windows, Process.destroy() is the
same
+ // as destroyForcibly() (supportsNormalTermination() is false), so
+ // tika-server's JVM shutdown hooks never run. That means
+ // PipesParser.close() never fires, and the forked PipesServer
+ // children become orphans. They eventually self-terminate when
+ // their socket to the parent breaks, but the timing is racy --
+ // if @TempDir cleanup runs before they exit, they're still
+ // holding the redirect log files open and the test fails.
+ //
+ // We give the OS a moment for the kill to propagate, then list
+ // any PipesServer JVMs still alive. Anything that shows up here
+ // is an orphan and explains downstream @TempDir cleanup failures.
+ logOrphanPipesServers();
}
}
+ private static void logOrphanPipesServers() {
+ try {
+ Thread.sleep(500);
+ } catch (InterruptedException e) {
+ Thread.currentThread().interrupt();
+ return;
+ }
+ long count = ProcessHandle.allProcesses()
+ .filter(p -> p.info().commandLine()
+ .map(c ->
c.contains("org.apache.tika.pipes.core.server.PipesServer"))
+ .orElse(false))
+ .peek(p -> LOG.warn(
+ "ORPHAN PipesServer alive after tika-server exit:
pid={} cmd={}",
+ p.pid(), p.info().commandLine().orElse("?")))
+ .count();
+ LOG.info("post-teardown orphan PipesServer count: {}", count);
+ }
+
public void startProcess(String[] extraArgs) throws IOException {
String[] base = new String[]{"java",
"-Djava.io.tmpdir=" + TEMP_WORKING_DIR.toAbsolutePath(), //
make sure we're using subdir cleaned up by JUnit