(tika) branch main updated: Automatically add slices and/or log underprovisioned pipes configurations (#2793)

tallison Tue, 28 Apr 2026 18:09:39 -0700

This is an automated email from the ASF dual-hosted git repository.

tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git



The following commit(s) were added to refs/heads/main by this push:
     new 6b53816661 Automatically add slices and/or log underprovisioned pipes 
configurations (#2793)
6b53816661 is described below

commit 6b538166612ef83aaefd838349bf8c59713faf40
Author: Tim Allison <[email protected]>
AuthorDate: Tue Apr 28 21:09:26 2026 -0400

    Automatically add slices and/or log underprovisioned pipes configurations 
(#2793)
---
 docs/modules/ROOT/nav.adoc                         |   1 +
 docs/modules/ROOT/pages/pipes/configuration.adoc   |   4 +-
 docs/modules/ROOT/pages/pipes/cpu-sizing.adoc      | 134 +++++++++++++++++++++
 .../tika/pipes/core/PerClientServerManager.java    | 101 ++++++++++++++++
 4 files changed, 238 insertions(+), 2 deletions(-)

diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
index 9ae77c03fb..16429e45bf 100644
--- a/docs/modules/ROOT/nav.adoc
+++ b/docs/modules/ROOT/nav.adoc
@@ -30,6 +30,7 @@
 ** xref:pipes/parse-modes.adoc[Parse Modes]
 ** xref:pipes/unpack-config.adoc[Extracting Embedded Bytes]
 ** xref:pipes/timeouts.adoc[Timeouts]
+** xref:pipes/cpu-sizing.adoc[Forked-JVM CPU Sizing]
 * xref:configuration/index.adoc[Configuration]
 ** xref:configuration/parsers/pdf-parser.adoc[PDF Parser]
 ** xref:configuration/parsers/tesseract-ocr-parser.adoc[Tesseract OCR]
diff --git a/docs/modules/ROOT/pages/pipes/configuration.adoc 
b/docs/modules/ROOT/pages/pipes/configuration.adoc
index 7204d39589..c6614e7811 100644
--- a/docs/modules/ROOT/pages/pipes/configuration.adoc
+++ b/docs/modules/ROOT/pages/pipes/configuration.adoc
@@ -42,11 +42,11 @@ how many forked JVMs to run, timeouts, memory management, 
and parse behavior.
 
 |`numClients`
 |`4`
-|Number of parallel forked JVM processes. Each processes one document at a 
time.
+|Number of parallel forked JVM processes. Each processes one document at a 
time. See xref:pipes/cpu-sizing.adoc[Forked-JVM CPU Sizing] for guidance on 
choosing this value relative to host CPU count.
 
 |`forkedJvmArgs`
 |`[]`
-|JVM arguments for forked processes (e.g., `["-Xmx512m", "-Xms256m"]`).
+|JVM arguments for forked processes (e.g., `["-Xmx512m", "-Xms256m"]`). When 
`numClients > 1`, Tika auto-injects `-XX:ActiveProcessorCount` to right-size 
each fork's GC and JIT thread pools unless you provide your own; see 
xref:pipes/cpu-sizing.adoc[Forked-JVM CPU Sizing].
 
 |`javaPath`
 |`java`
diff --git a/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc 
b/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
new file mode 100644
index 0000000000..e7dd9810ed
--- /dev/null
+++ b/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
@@ -0,0 +1,134 @@
+= Forked-JVM CPU Sizing
+
+Tika Pipes runs multiple forked JVMs in per-client mode (one per `numClients`).
+Each JVM independently sizes its garbage collector, JIT compiler, and common
+`ForkJoinPool` based on the host CPU count. Without intervention, this causes
+thread-pool blowup at high `numClients`: e.g., 4 forks on a 16-core host
+default to ~16 GC threads × 4 = ~64 GC threads, all competing for the same 16
+cores.
+
+To fix this, Tika Pipes auto-injects `-XX:ActiveProcessorCount` into each
+forked JVM's command line, sizing each fork's view of the CPU count to a fair
+slice of the host. This is on by default in per-client mode (`numClients > 1`)
+when the user has not already supplied `-XX:ActiveProcessorCount` in
+`forkedJvmArgs`.
+
+== Mental model
+
+----
+pod_cpus  =  parent_overhead (≈ 2)  +  numClients × per_fork_slice
+----
+
+Where `per_fork_slice ≥ 2`:
+
+* 1 CPU for the parser thread
+* 1 CPU for everything else the JVM does (GC concurrent worker, JIT,
+  protocol heartbeat, socket I/O thread)
+
+The parent JVM (the one running `tika-async-cli` / `tika-app -a`) is light
+on CPU — it just serializes requests, deserializes responses, and runs
+the heartbeat — but it must not be CPU-starved. A starved parent shows up
+as pathological tail latency on small operations like `socket.write()`,
+because the calling thread gets preempted between clock reads. We reserve
+2 cores for the parent by default.
+
+== Formula
+
+[source]
+----
+slice = (hostCores - PARENT_RESERVED_CORES) / numClients
+
+PARENT_RESERVED_CORES = 2
+MIN_AUTO_CAP_SLICE    = 2
+----
+
+If `slice ≥ 2`, Tika injects `-XX:ActiveProcessorCount=<slice>` into each
+forked JVM. If `slice < 2`, the auto-cap is *skipped* and a `WARN` is
+logged advising the operator to lower `numClients`. Skipping is intentional:
+at `slice=1` the fork's only CPU is fully consumed by parsing, so its
+socket-reader thread cannot run and the parent's writes block on
+receiver-side back-pressure — measurably worse than no cap at all.
+
+== Recommended sizing
+
+For typical cloud-VM core counts:
+
+[cols="1,1,1,3"]
+|===
+|hostCores |numClients |slice |Notes
+
+|2  |1 |n/a    |Tight; auto-cap not applied (single fork). Acceptable for low 
throughput.
+|4  |1 |n/a    |Comfortable single-fork deployment.
+|4  |2 |1 → skipped |Auto-cap declines; consider `numClients=1`.
+|8  |1 |n/a    |Lots of headroom; single-fork lifecycle isolation is fine.
+|8  |3 |2      |Sweet spot for medium pods.
+|16 |4 |3      |Sweet spot for 16-core hosts. Measured winner in benchmarks.
+|16 |6 |2      |Higher concurrency; tighter per-fork breathing room.
+|16 |8 |1 → skipped |Doesn't fit 16 cores. Keep at 4 or 6.
+|32 |8 |3      |Same shape as 16/4.
+|===
+
+The general rule is: pick the largest `numClients` that satisfies
+`numClients × 2 + 2 ≤ hostCores`. Beyond that point, adding workers
+starts hurting throughput.
+
+== Diagnostics
+
+Every `PipesParser` startup emits a one-shot summary line on its main
+logger so operators can see what was decided:
+
+[source]
+----
+INFO  pipes-cpu-sizing: hostCores=16, numClients=4, parentReserved=2, 
autoCap=slice=3
+----
+
+The `autoCap` field is one of:
+
+* `slice=N` — the auto-cap fired; each fork sees N CPUs.
+* `skipped (slice<2)` — over-provisioned; operator should reduce `numClients`.
+* `n/a (single fork; not capped)` — `numClients=1`; fork sees the whole host.
+* `user-set in forkedJvmArgs` — operator set `-XX:ActiveProcessorCount` 
themselves.
+
+Two `WARN`-level messages call out clearly-bad provisioning:
+
+* `hostCores < 2` — the host has no room for the parser plus background JVM 
threads.
+* `numClients × 2 + 2 > hostCores` — the host is too small for the requested 
concurrency.
+
+`grep pipes-cpu-sizing` on the parent's logs surfaces all sizing-related 
output.
+
+== Disabling or overriding
+
+If you want to manage `ActiveProcessorCount` yourself (e.g., to allocate a
+different slice based on workload knowledge), just include it in your config:
+
+[source,json]
+----
+"pipes": {
+  "numClients": 4,
+  "forkedJvmArgs": ["-Xmx512m", "-XX:ActiveProcessorCount=4"]
+}
+----
+
+When Tika sees an explicit `-XX:ActiveProcessorCount` in `forkedJvmArgs`, it
+respects your value and skips the auto-injection — the sizing summary will
+report `autoCap=user-set in forkedJvmArgs`.
+
+== Container & cgroup behavior
+
+The formula uses `Runtime.availableProcessors()` for the host CPU count,
+which on JDK 17+ honors cgroup CPU limits. So in Kubernetes:
+
+* If a pod has `resources.limits.cpu` set, the JVM sees that limit and the
+  formula sizes accordingly.
+* If a pod runs without an explicit `limits.cpu`, the JVM sees the *node's*
+  full CPU count, which may not match what the pod can actually use. **Always
+  set explicit CPU limits on pipes pods.**
+
+== Shared-server mode
+
+This document only covers per-client (forked-JVM) mode, which is the
+default. In shared-server mode (`useSharedServer=true`) all clients use a
+single forked JVM, so the multi-process thread-blowup problem doesn't
+apply and the auto-cap is not applied. See
+xref:pipes/shared-server-mode.adoc[Shared Server Mode] for that mode's
+trade-offs.
diff --git 
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
 
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
index 722f1ba362..2f085ed198 100644
--- 
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
+++ 
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
@@ -51,6 +51,17 @@ public class PerClientServerManager implements ServerManager 
{
     private static final Logger LOG = 
LoggerFactory.getLogger(PerClientServerManager.class);
     private static final long WAIT_ON_DESTROY_MS = 10000;
     public static final int SOCKET_CONNECT_TIMEOUT_MS = 60000;
+    /** Cores reserved for the parent JVM when auto-sizing forked JVMs'
+     *  -XX:ActiveProcessorCount. The parent has client-side serialization,
+     *  response deserialization, and heartbeat bookkeeping; if it's 
CPU-starved
+     *  small operations like socket flush show pathological tail latency. */
+    private static final int PARENT_RESERVED_CORES = 2;
+    /** Don't auto-cap below this many CPUs per fork. At cap=1 the fork's only
+     *  CPU is fully consumed by parsing, so its socket-reader thread can't run
+     *  and the parent's writes block on receiver-side back-pressure -- worse
+     *  than no cap at all. This guard matters for small k8s pods where the
+     *  formula could otherwise produce slice=1. */
+    private static final int MIN_AUTO_CAP_SLICE = 2;
 
     private final PipesConfig pipesConfig;
     private final Path tikaConfigPath;
@@ -67,6 +78,62 @@ public class PerClientServerManager implements ServerManager 
{
         this.pipesConfig = pipesConfig;
         this.tikaConfigPath = tikaConfigPath;
         this.clientId = clientId;
+        // Emit CPU-sizing diagnostics once per PipesParser (only on the first 
client).
+        if (clientId == 0) {
+            logCpuSizing();
+        }
+    }
+
+    /**
+     * Emits a one-shot summary of how the auto-cap will behave for this 
PipesParser,
+     * plus warnings for clearly-pathological provisioning. Grep for 
"pipes-cpu-sizing"
+     * in logs to see the decision the JVM made.
+     */
+    private void logCpuSizing() {
+        int hostCores = Runtime.getRuntime().availableProcessors();
+        int numClients = pipesConfig.getNumClients();
+        boolean userSetCap = pipesConfig.getForkedJvmArgs().stream()
+                .anyMatch(a -> a.startsWith("-XX:ActiveProcessorCount="));
+
+        // Hostile environment: fewer than 2 cores means the parser thread, 
GC, JIT,
+        // and protocol heartbeat all share one CPU. Pipes will run but tail 
latency
+        // will be poor regardless of numClients.
+        if (hostCores < 2) {
+            LOG.warn("pipes-cpu-sizing: hostCores={} is below the practical 
minimum. " +
+                    "Each fork JVM needs roughly 2 CPUs (1 for parsing, 1 for 
GC/JIT/" +
+                    "protocol heartbeat); on a single-CPU host these contend 
with each " +
+                    "other and performance will be poor.", hostCores);
+        }
+
+        // Over-provisioned: numClients packed too tightly given the host's 
cores.
+        // Triggers earlier than the slice<MIN guard so the user is warned even
+        // when they explicitly set -XX:ActiveProcessorCount themselves.
+        if (numClients > 1 && numClients * MIN_AUTO_CAP_SLICE + 
PARENT_RESERVED_CORES > hostCores) {
+            int recommendedMax = Math.max(1,
+                    (hostCores - PARENT_RESERVED_CORES) / MIN_AUTO_CAP_SLICE);
+            LOG.warn("pipes-cpu-sizing: numClients={} is over-provisioned for 
{}-core " +
+                    "host. Recommended max for this host: numClients={}. Forks 
need at " +
+                    "least {} CPUs each plus {} reserved for the parent JVM; 
otherwise " +
+                    "GC/JIT/protocol threads contend with parser threads 
across forks.",
+                    numClients, hostCores, recommendedMax,
+                    MIN_AUTO_CAP_SLICE, PARENT_RESERVED_CORES);
+        }
+
+        // Always-on summary so ops can see what was decided. Grep for 
"pipes-cpu-sizing".
+        String capDecision;
+        if (userSetCap) {
+            capDecision = "user-set in forkedJvmArgs";
+        } else if (numClients <= 1) {
+            capDecision = "n/a (single fork; not capped)";
+        } else {
+            int budget = Math.max(1, hostCores - PARENT_RESERVED_CORES);
+            int slice = budget / numClients;
+            capDecision = (slice >= MIN_AUTO_CAP_SLICE)
+                    ? "slice=" + slice
+                    : "skipped (slice<" + MIN_AUTO_CAP_SLICE + ")";
+        }
+        LOG.info("pipes-cpu-sizing: hostCores={}, numClients={}, 
parentReserved={}, " +
+                "autoCap={}", hostCores, numClients, PARENT_RESERVED_CORES, 
capDecision);
     }
 
     @Override
@@ -305,6 +372,7 @@ public class PerClientServerManager implements 
ServerManager {
         boolean hasHeadless = false;
         boolean hasExitOnOOM = false;
         boolean hasLog4j = false;
+        boolean hasActiveProcessorCount = false;
         String origGCString = null;
         String newGCLogString = null;
 
@@ -321,12 +389,45 @@ public class PerClientServerManager implements 
ServerManager {
             if (arg.startsWith("-Dlog4j.configuration") || 
arg.startsWith("-Dlog4j2.configuration")) {
                 hasLog4j = true;
             }
+            if (arg.startsWith("-XX:ActiveProcessorCount=")) {
+                hasActiveProcessorCount = true;
+            }
             if (arg.startsWith("-Xloggc:")) {
                 origGCString = arg;
                 newGCLogString = arg.replace("${pipesClientId}", "id-" + 
clientId);
             }
         }
 
+        // If the user hasn't explicitly set -XX:ActiveProcessorCount, size 
each
+        // forked JVM's view of CPUs to a fair slice of the host. Otherwise 
each
+        // JVM defaults its GC, JIT, and common ForkJoinPool to "all cores", 
which
+        // means N forked JVMs collectively spawn N x cores GC threads etc. and
+        // fight each other. We also reserve PARENT_RESERVED_CORES so the 
parent
+        // JVM (which serializes requests, deserializes responses, runs 
heartbeat
+        // bookkeeping) isn't starved for CPU.
+        // Skip the auto-cap when the computed slice would drop below
+        // MIN_AUTO_CAP_SLICE -- below that, the fork can't keep its socket
+        // reader responsive and back-pressures the parent.
+        if (!hasActiveProcessorCount && pipesConfig.getNumClients() > 1) {
+            int hostCores = Runtime.getRuntime().availableProcessors();
+            int forkBudget = Math.max(1, hostCores - PARENT_RESERVED_CORES);
+            int slice = forkBudget / pipesConfig.getNumClients();
+            if (slice >= MIN_AUTO_CAP_SLICE) {
+                configArgs.add("-XX:ActiveProcessorCount=" + slice);
+                LOG.debug("clientId={}: auto-injected 
-XX:ActiveProcessorCount={} " +
+                        "(hostCores={}, parentReserved={}, numClients={})",
+                        clientId, slice, hostCores, PARENT_RESERVED_CORES,
+                        pipesConfig.getNumClients());
+            } else {
+                LOG.info("clientId={}: skipping -XX:ActiveProcessorCount 
auto-cap " +
+                        "(would yield slice={} < MIN_AUTO_CAP_SLICE={}; " +
+                        "hostCores={}, parentReserved={}, numClients={}). " +
+                        "Consider lowering numClients on this host.",
+                        clientId, slice, MIN_AUTO_CAP_SLICE, hostCores,
+                        PARENT_RESERVED_CORES, pipesConfig.getNumClients());
+            }
+        }
+
         if (origGCString != null && newGCLogString != null) {
             configArgs.remove(origGCString);
             configArgs.add(newGCLogString);

(tika) branch main updated: Automatically add slices and/or log underprovisioned pipes configurations (#2793)

Reply via email to