This is an automated email from the ASF dual-hosted git repository.
tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new 6b53816661 Automatically add slices and/or log underprovisioned pipes
configurations (#2793)
6b53816661 is described below
commit 6b538166612ef83aaefd838349bf8c59713faf40
Author: Tim Allison <[email protected]>
AuthorDate: Tue Apr 28 21:09:26 2026 -0400
Automatically add slices and/or log underprovisioned pipes configurations
(#2793)
---
docs/modules/ROOT/nav.adoc | 1 +
docs/modules/ROOT/pages/pipes/configuration.adoc | 4 +-
docs/modules/ROOT/pages/pipes/cpu-sizing.adoc | 134 +++++++++++++++++++++
.../tika/pipes/core/PerClientServerManager.java | 101 ++++++++++++++++
4 files changed, 238 insertions(+), 2 deletions(-)
diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
index 9ae77c03fb..16429e45bf 100644
--- a/docs/modules/ROOT/nav.adoc
+++ b/docs/modules/ROOT/nav.adoc
@@ -30,6 +30,7 @@
** xref:pipes/parse-modes.adoc[Parse Modes]
** xref:pipes/unpack-config.adoc[Extracting Embedded Bytes]
** xref:pipes/timeouts.adoc[Timeouts]
+** xref:pipes/cpu-sizing.adoc[Forked-JVM CPU Sizing]
* xref:configuration/index.adoc[Configuration]
** xref:configuration/parsers/pdf-parser.adoc[PDF Parser]
** xref:configuration/parsers/tesseract-ocr-parser.adoc[Tesseract OCR]
diff --git a/docs/modules/ROOT/pages/pipes/configuration.adoc
b/docs/modules/ROOT/pages/pipes/configuration.adoc
index 7204d39589..c6614e7811 100644
--- a/docs/modules/ROOT/pages/pipes/configuration.adoc
+++ b/docs/modules/ROOT/pages/pipes/configuration.adoc
@@ -42,11 +42,11 @@ how many forked JVMs to run, timeouts, memory management,
and parse behavior.
|`numClients`
|`4`
-|Number of parallel forked JVM processes. Each processes one document at a
time.
+|Number of parallel forked JVM processes. Each processes one document at a
time. See xref:pipes/cpu-sizing.adoc[Forked-JVM CPU Sizing] for guidance on
choosing this value relative to host CPU count.
|`forkedJvmArgs`
|`[]`
-|JVM arguments for forked processes (e.g., `["-Xmx512m", "-Xms256m"]`).
+|JVM arguments for forked processes (e.g., `["-Xmx512m", "-Xms256m"]`). When
`numClients > 1`, Tika auto-injects `-XX:ActiveProcessorCount` to right-size
each fork's GC and JIT thread pools unless you provide your own; see
xref:pipes/cpu-sizing.adoc[Forked-JVM CPU Sizing].
|`javaPath`
|`java`
diff --git a/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
b/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
new file mode 100644
index 0000000000..e7dd9810ed
--- /dev/null
+++ b/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
@@ -0,0 +1,134 @@
+= Forked-JVM CPU Sizing
+
+Tika Pipes runs multiple forked JVMs in per-client mode (one per `numClients`).
+Each JVM independently sizes its garbage collector, JIT compiler, and common
+`ForkJoinPool` based on the host CPU count. Without intervention, this causes
+thread-pool blowup at high `numClients`: e.g., 4 forks on a 16-core host
+default to ~16 GC threads × 4 = ~64 GC threads, all competing for the same 16
+cores.
+
+To fix this, Tika Pipes auto-injects `-XX:ActiveProcessorCount` into each
+forked JVM's command line, sizing each fork's view of the CPU count to a fair
+slice of the host. This is on by default in per-client mode (`numClients > 1`)
+when the user has not already supplied `-XX:ActiveProcessorCount` in
+`forkedJvmArgs`.
+
+== Mental model
+
+----
+pod_cpus = parent_overhead (≈ 2) + numClients × per_fork_slice
+----
+
+Where `per_fork_slice ≥ 2`:
+
+* 1 CPU for the parser thread
+* 1 CPU for everything else the JVM does (GC concurrent worker, JIT,
+ protocol heartbeat, socket I/O thread)
+
+The parent JVM (the one running `tika-async-cli` / `tika-app -a`) is light
+on CPU — it just serializes requests, deserializes responses, and runs
+the heartbeat — but it must not be CPU-starved. A starved parent shows up
+as pathological tail latency on small operations like `socket.write()`,
+because the calling thread gets preempted between clock reads. We reserve
+2 cores for the parent by default.
+
+== Formula
+
+[source]
+----
+slice = (hostCores - PARENT_RESERVED_CORES) / numClients
+
+PARENT_RESERVED_CORES = 2
+MIN_AUTO_CAP_SLICE = 2
+----
+
+If `slice ≥ 2`, Tika injects `-XX:ActiveProcessorCount=<slice>` into each
+forked JVM. If `slice < 2`, the auto-cap is *skipped* and a `WARN` is
+logged advising the operator to lower `numClients`. Skipping is intentional:
+at `slice=1` the fork's only CPU is fully consumed by parsing, so its
+socket-reader thread cannot run and the parent's writes block on
+receiver-side back-pressure — measurably worse than no cap at all.
+
+== Recommended sizing
+
+For typical cloud-VM core counts:
+
+[cols="1,1,1,3"]
+|===
+|hostCores |numClients |slice |Notes
+
+|2 |1 |n/a |Tight; auto-cap not applied (single fork). Acceptable for low
throughput.
+|4 |1 |n/a |Comfortable single-fork deployment.
+|4 |2 |1 → skipped |Auto-cap declines; consider `numClients=1`.
+|8 |1 |n/a |Lots of headroom; single-fork lifecycle isolation is fine.
+|8 |3 |2 |Sweet spot for medium pods.
+|16 |4 |3 |Sweet spot for 16-core hosts. Measured winner in benchmarks.
+|16 |6 |2 |Higher concurrency; tighter per-fork breathing room.
+|16 |8 |1 → skipped |Doesn't fit 16 cores. Keep at 4 or 6.
+|32 |8 |3 |Same shape as 16/4.
+|===
+
+The general rule is: pick the largest `numClients` that satisfies
+`numClients × 2 + 2 ≤ hostCores`. Beyond that point, adding workers
+starts hurting throughput.
+
+== Diagnostics
+
+Every `PipesParser` startup emits a one-shot summary line on its main
+logger so operators can see what was decided:
+
+[source]
+----
+INFO pipes-cpu-sizing: hostCores=16, numClients=4, parentReserved=2,
autoCap=slice=3
+----
+
+The `autoCap` field is one of:
+
+* `slice=N` — the auto-cap fired; each fork sees N CPUs.
+* `skipped (slice<2)` — over-provisioned; operator should reduce `numClients`.
+* `n/a (single fork; not capped)` — `numClients=1`; fork sees the whole host.
+* `user-set in forkedJvmArgs` — operator set `-XX:ActiveProcessorCount`
themselves.
+
+Two `WARN`-level messages call out clearly-bad provisioning:
+
+* `hostCores < 2` — the host has no room for the parser plus background JVM
threads.
+* `numClients × 2 + 2 > hostCores` — the host is too small for the requested
concurrency.
+
+`grep pipes-cpu-sizing` on the parent's logs surfaces all sizing-related
output.
+
+== Disabling or overriding
+
+If you want to manage `ActiveProcessorCount` yourself (e.g., to allocate a
+different slice based on workload knowledge), just include it in your config:
+
+[source,json]
+----
+"pipes": {
+ "numClients": 4,
+ "forkedJvmArgs": ["-Xmx512m", "-XX:ActiveProcessorCount=4"]
+}
+----
+
+When Tika sees an explicit `-XX:ActiveProcessorCount` in `forkedJvmArgs`, it
+respects your value and skips the auto-injection — the sizing summary will
+report `autoCap=user-set in forkedJvmArgs`.
+
+== Container & cgroup behavior
+
+The formula uses `Runtime.availableProcessors()` for the host CPU count,
+which on JDK 17+ honors cgroup CPU limits. So in Kubernetes:
+
+* If a pod has `resources.limits.cpu` set, the JVM sees that limit and the
+ formula sizes accordingly.
+* If a pod runs without an explicit `limits.cpu`, the JVM sees the *node's*
+ full CPU count, which may not match what the pod can actually use. **Always
+ set explicit CPU limits on pipes pods.**
+
+== Shared-server mode
+
+This document only covers per-client (forked-JVM) mode, which is the
+default. In shared-server mode (`useSharedServer=true`) all clients use a
+single forked JVM, so the multi-process thread-blowup problem doesn't
+apply and the auto-cap is not applied. See
+xref:pipes/shared-server-mode.adoc[Shared Server Mode] for that mode's
+trade-offs.
diff --git
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
index 722f1ba362..2f085ed198 100644
---
a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
+++
b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/PerClientServerManager.java
@@ -51,6 +51,17 @@ public class PerClientServerManager implements ServerManager
{
private static final Logger LOG =
LoggerFactory.getLogger(PerClientServerManager.class);
private static final long WAIT_ON_DESTROY_MS = 10000;
public static final int SOCKET_CONNECT_TIMEOUT_MS = 60000;
+ /** Cores reserved for the parent JVM when auto-sizing forked JVMs'
+ * -XX:ActiveProcessorCount. The parent has client-side serialization,
+ * response deserialization, and heartbeat bookkeeping; if it's
CPU-starved
+ * small operations like socket flush show pathological tail latency. */
+ private static final int PARENT_RESERVED_CORES = 2;
+ /** Don't auto-cap below this many CPUs per fork. At cap=1 the fork's only
+ * CPU is fully consumed by parsing, so its socket-reader thread can't run
+ * and the parent's writes block on receiver-side back-pressure -- worse
+ * than no cap at all. This guard matters for small k8s pods where the
+ * formula could otherwise produce slice=1. */
+ private static final int MIN_AUTO_CAP_SLICE = 2;
private final PipesConfig pipesConfig;
private final Path tikaConfigPath;
@@ -67,6 +78,62 @@ public class PerClientServerManager implements ServerManager
{
this.pipesConfig = pipesConfig;
this.tikaConfigPath = tikaConfigPath;
this.clientId = clientId;
+ // Emit CPU-sizing diagnostics once per PipesParser (only on the first
client).
+ if (clientId == 0) {
+ logCpuSizing();
+ }
+ }
+
+ /**
+ * Emits a one-shot summary of how the auto-cap will behave for this
PipesParser,
+ * plus warnings for clearly-pathological provisioning. Grep for
"pipes-cpu-sizing"
+ * in logs to see the decision the JVM made.
+ */
+ private void logCpuSizing() {
+ int hostCores = Runtime.getRuntime().availableProcessors();
+ int numClients = pipesConfig.getNumClients();
+ boolean userSetCap = pipesConfig.getForkedJvmArgs().stream()
+ .anyMatch(a -> a.startsWith("-XX:ActiveProcessorCount="));
+
+ // Hostile environment: fewer than 2 cores means the parser thread,
GC, JIT,
+ // and protocol heartbeat all share one CPU. Pipes will run but tail
latency
+ // will be poor regardless of numClients.
+ if (hostCores < 2) {
+ LOG.warn("pipes-cpu-sizing: hostCores={} is below the practical
minimum. " +
+ "Each fork JVM needs roughly 2 CPUs (1 for parsing, 1 for
GC/JIT/" +
+ "protocol heartbeat); on a single-CPU host these contend
with each " +
+ "other and performance will be poor.", hostCores);
+ }
+
+ // Over-provisioned: numClients packed too tightly given the host's
cores.
+ // Triggers earlier than the slice<MIN guard so the user is warned even
+ // when they explicitly set -XX:ActiveProcessorCount themselves.
+ if (numClients > 1 && numClients * MIN_AUTO_CAP_SLICE +
PARENT_RESERVED_CORES > hostCores) {
+ int recommendedMax = Math.max(1,
+ (hostCores - PARENT_RESERVED_CORES) / MIN_AUTO_CAP_SLICE);
+ LOG.warn("pipes-cpu-sizing: numClients={} is over-provisioned for
{}-core " +
+ "host. Recommended max for this host: numClients={}. Forks
need at " +
+ "least {} CPUs each plus {} reserved for the parent JVM;
otherwise " +
+ "GC/JIT/protocol threads contend with parser threads
across forks.",
+ numClients, hostCores, recommendedMax,
+ MIN_AUTO_CAP_SLICE, PARENT_RESERVED_CORES);
+ }
+
+ // Always-on summary so ops can see what was decided. Grep for
"pipes-cpu-sizing".
+ String capDecision;
+ if (userSetCap) {
+ capDecision = "user-set in forkedJvmArgs";
+ } else if (numClients <= 1) {
+ capDecision = "n/a (single fork; not capped)";
+ } else {
+ int budget = Math.max(1, hostCores - PARENT_RESERVED_CORES);
+ int slice = budget / numClients;
+ capDecision = (slice >= MIN_AUTO_CAP_SLICE)
+ ? "slice=" + slice
+ : "skipped (slice<" + MIN_AUTO_CAP_SLICE + ")";
+ }
+ LOG.info("pipes-cpu-sizing: hostCores={}, numClients={},
parentReserved={}, " +
+ "autoCap={}", hostCores, numClients, PARENT_RESERVED_CORES,
capDecision);
}
@Override
@@ -305,6 +372,7 @@ public class PerClientServerManager implements
ServerManager {
boolean hasHeadless = false;
boolean hasExitOnOOM = false;
boolean hasLog4j = false;
+ boolean hasActiveProcessorCount = false;
String origGCString = null;
String newGCLogString = null;
@@ -321,12 +389,45 @@ public class PerClientServerManager implements
ServerManager {
if (arg.startsWith("-Dlog4j.configuration") ||
arg.startsWith("-Dlog4j2.configuration")) {
hasLog4j = true;
}
+ if (arg.startsWith("-XX:ActiveProcessorCount=")) {
+ hasActiveProcessorCount = true;
+ }
if (arg.startsWith("-Xloggc:")) {
origGCString = arg;
newGCLogString = arg.replace("${pipesClientId}", "id-" +
clientId);
}
}
+ // If the user hasn't explicitly set -XX:ActiveProcessorCount, size
each
+ // forked JVM's view of CPUs to a fair slice of the host. Otherwise
each
+ // JVM defaults its GC, JIT, and common ForkJoinPool to "all cores",
which
+ // means N forked JVMs collectively spawn N x cores GC threads etc. and
+ // fight each other. We also reserve PARENT_RESERVED_CORES so the
parent
+ // JVM (which serializes requests, deserializes responses, runs
heartbeat
+ // bookkeeping) isn't starved for CPU.
+ // Skip the auto-cap when the computed slice would drop below
+ // MIN_AUTO_CAP_SLICE -- below that, the fork can't keep its socket
+ // reader responsive and back-pressures the parent.
+ if (!hasActiveProcessorCount && pipesConfig.getNumClients() > 1) {
+ int hostCores = Runtime.getRuntime().availableProcessors();
+ int forkBudget = Math.max(1, hostCores - PARENT_RESERVED_CORES);
+ int slice = forkBudget / pipesConfig.getNumClients();
+ if (slice >= MIN_AUTO_CAP_SLICE) {
+ configArgs.add("-XX:ActiveProcessorCount=" + slice);
+ LOG.debug("clientId={}: auto-injected
-XX:ActiveProcessorCount={} " +
+ "(hostCores={}, parentReserved={}, numClients={})",
+ clientId, slice, hostCores, PARENT_RESERVED_CORES,
+ pipesConfig.getNumClients());
+ } else {
+ LOG.info("clientId={}: skipping -XX:ActiveProcessorCount
auto-cap " +
+ "(would yield slice={} < MIN_AUTO_CAP_SLICE={}; " +
+ "hostCores={}, parentReserved={}, numClients={}). " +
+ "Consider lowering numClients on this host.",
+ clientId, slice, MIN_AUTO_CAP_SLICE, hostCores,
+ PARENT_RESERVED_CORES, pipesConfig.getNumClients());
+ }
+ }
+
if (origGCString != null && newGCLogString != null) {
configArgs.remove(origGCString);
configArgs.add(newGCLogString);