This is an automated email from the ASF dual-hosted git repository.
tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new 4bfbdf22cf TIKA-4737 -- improve docs for tika-pipes via tika-app
(#2836)
4bfbdf22cf is described below
commit 4bfbdf22cf0d42f3661e5f0abcd42fd7e6190735
Author: Tim Allison <[email protected]>
AuthorDate: Tue May 26 08:54:27 2026 -0400
TIKA-4737 -- improve docs for tika-pipes via tika-app (#2836)
---
docs/modules/ROOT/pages/migration-to-4x/index.adoc | 2 +-
docs/modules/ROOT/pages/pipes/configuration.adoc | 2 +-
docs/modules/ROOT/pages/pipes/cpu-sizing.adoc | 2 +-
docs/modules/ROOT/pages/using-tika/cli/index.adoc | 44 ++++++++++++++-----
.../src/main/java/org/apache/tika/cli/TikaCLI.java | 50 +++++++++-------------
5 files changed, 57 insertions(+), 43 deletions(-)
diff --git a/docs/modules/ROOT/pages/migration-to-4x/index.adoc
b/docs/modules/ROOT/pages/migration-to-4x/index.adoc
index 39675318c5..9333046aa6 100644
--- a/docs/modules/ROOT/pages/migration-to-4x/index.adoc
+++ b/docs/modules/ROOT/pages/migration-to-4x/index.adoc
@@ -49,4 +49,4 @@ The following tika-app options for dumping configuration are
not yet available:
These require completing the JSON serialization support for TikaConfig
objects. The underlying serialization infrastructure exists (see
xref:migration-to-4x/serialization-4x.adoc[Serialization]) but the CLI
integration is pending.
-*Workaround:* Manually create JSON config files using the templates in
`tika-pipes/tika-async-cli/src/main/resources/config-template.json` as a
starting point.
+*Workaround:* Manually create JSON config files using the
xref:pipes/configuration.adoc#config-template[Tika Pipes config template] as a
starting point.
diff --git a/docs/modules/ROOT/pages/pipes/configuration.adoc
b/docs/modules/ROOT/pages/pipes/configuration.adoc
index f6b3d5c2b6..f0a004ae98 100644
--- a/docs/modules/ROOT/pages/pipes/configuration.adoc
+++ b/docs/modules/ROOT/pages/pipes/configuration.adoc
@@ -191,7 +191,7 @@ icon:github[]
https://github.com/apache/tika/blob/main/tika-pipes/tika-pipes-int
See xref:pipes/shared-server-mode.adoc[Shared Server Mode] for the trade-offs.
[#config-template]
-=== `tika-async-cli` config template
+=== Tika Pipes config template
[source,json,subs=none]
----
diff --git a/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
b/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
index 997ea159b1..0745ba0db1 100644
--- a/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
+++ b/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
@@ -41,7 +41,7 @@ Where `per_fork_slice ≥ 2`:
* 1 CPU for everything else the JVM does (GC concurrent worker, JIT,
protocol heartbeat, socket I/O thread)
-The parent JVM (the one running `tika-async-cli` / `tika-app -a`) is light
+The parent JVM (the one running `tika-app` in Tika Pipes mode) is light
on CPU — it just serializes requests, deserializes responses, and runs
the heartbeat — but it must not be CPU-starved. A starved parent shows up
as pathological tail latency on small operations like `socket.write()`,
diff --git a/docs/modules/ROOT/pages/using-tika/cli/index.adoc
b/docs/modules/ROOT/pages/using-tika/cli/index.adoc
index c9f9da8f03..594828fc78 100644
--- a/docs/modules/ROOT/pages/using-tika/cli/index.adoc
+++ b/docs/modules/ROOT/pages/using-tika/cli/index.adoc
@@ -32,7 +32,7 @@ text content and metadata from all sorts of files.
NOTE: As of 4.x, `tika-app` is distributed as a zip archive rather than a
single
self-contained jar. The bare `tika-app-<version>.jar` is only a thin launcher
and
will fail with `NoClassDefFoundError` if run on its own — the parsers and
supporting
-modules (including the batch processor) live in the adjacent `lib/` directory.
+modules (including the Tika Pipes processor) live in the adjacent `lib/`
directory.
Download `tika-app-<version>.zip`, unzip it, and run `tika-app-<version>.jar`
from
inside the unzipped directory so that `lib/` and `plugins/` sit alongside the
jar:
@@ -138,9 +138,10 @@ Extract text from a remote document and search for
keywords:
curl http://example.com/document.doc | java -jar tika-app.jar --text | grep -q
keyword
----
-=== Batch processing
+=== Tika Pipes processing
-Process entire directories by specifying input and output paths:
+Process many documents by specifying input and output paths. Inputs can be a
+local directory, S3, GCS, Azure, JDBC, and others via Tika Pipes fetchers:
[source,bash]
----
@@ -163,13 +164,28 @@ Use a custom configuration file:
java -jar tika-app.jar --config=tika-config.json document.pdf
----
-== Batch Processing
+== Tika Pipes Processing
-For processing large numbers of files, run `tika-app` with input/output
directories.
-Under the hood this uses Tika Pipes batch processing, with forked JVM
processes for
-fault tolerance.
+For processing many documents — from a local directory, S3, GCS, Azure, JDBC,
+or any other Tika Pipes source — run `tika-app` with input/output paths.
+Under the hood this is Tika Pipes, dispatched asynchronously into forked JVM
+processes for fault tolerance. Tika prints a one-line banner to `stderr` when
+it switches into Pipes mode so you can confirm which path is running.
-=== Basic Batch Usage
+=== How Pipes mode is activated
+
+`tika-app` enters Pipes mode automatically when any of the following are true:
+
+* Two positional arguments are given and the first is an existing directory
+ (`tika-app.jar /in /out`).
+* Any of these options are present: `-i`, `-o`, `--input`, `--output`,
+ `--fileList`, `-z`/`-Z`/`--extract`/`--extract-dir`, or `-a`/`--async`.
+* A single `.json` argument is given — it is treated as a Tika Pipes config
file.
+
+Anything else (single file, URL, stdin, `--gui`, `--server`) stays in standard
+single-document mode.
+
+=== Basic Pipes Usage
[source,bash]
----
@@ -179,7 +195,7 @@ java -jar tika-app.jar -i /path/to/input -o /path/to/output
This processes all files in the input directory and writes JSON metadata
(RMETA format)
to the output directory.
-=== Batch Options
+=== Tika Pipes Options
[cols="1,3"]
|===
@@ -213,7 +229,7 @@ to the output directory.
|Plugins directory
|===
-=== Batch Examples
+=== Tika Pipes Examples
Extract markdown content only (no metadata) from all files:
@@ -231,3 +247,11 @@ Extract text with all metadata in concatenated mode:
----
java -jar tika-app.jar -i /path/to/input -o /path/to/output --concatenate
----
+
+Use a Tika config file alongside the Pipes options. Both `--config=foo.json`
+(the standard-mode long form) and `-c foo.json` work:
+
+[source,bash]
+----
+java -jar tika-app.jar -i /path/to/input -o /path/to/output
--config=tika-config.json
+----
diff --git a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
index d4a5628489..6ed26567b6 100644
--- a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
+++ b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
@@ -285,6 +285,8 @@ public class TikaCLI {
}
private static void async(String[] args) throws Exception {
+ System.err.println("tika-app: running in Tika Pipes mode "
+ + "(async dispatch via -a/--async); use --help for options.");
args = AsyncHelper.translateArgs(args);
String tikaConfigPath = "";
//TODO - runpack is a smelly. fix this.
@@ -305,14 +307,14 @@ public class TikaCLI {
invokeAsyncCLI(args);
return;
}
- // For batch mode (two directories), pass directly to TikaAsyncCLI.
+ // For Pipes mode (two directories), pass directly to TikaAsyncCLI.
// It will create its own config with PluginsWriter that includes
// plugin-roots, fetcher, emitter, and pipes-iterator configuration.
invokeAsyncCLI(args);
}
/**
- * Invokes the batch/async processor ({@code tika-async-cli}). The async
+ * Invokes the Tika Pipes async processor ({@code tika-async-cli}). The
async
* processor and the parsers it forks live in the {@code lib/} directory of
* the tika-app distribution rather than inside the bare {@code
tika-app.jar}.
* If tika-app is run as a standalone jar (without the surrounding unzipped
@@ -326,9 +328,9 @@ public class TikaCLI {
try {
TikaAsyncCLI.main(args);
} catch (NoClassDefFoundError e) {
- System.err.println("Error: could not load the Tika batch/async
processor (" +
+ System.err.println("Error: could not load the Tika Pipes processor
(" +
e.getMessage() + ").");
- System.err.println("Batch mode requires the full tika-app
distribution, not the "
+ System.err.println("Tika Pipes mode requires the full tika-app
distribution, not the "
+ "standalone jar.");
System.err.println("Download tika-app-<version>.zip, unzip it, and
run "
+ "tika-app-<version>.jar from inside the unzipped
directory so that the "
@@ -401,7 +403,7 @@ public class TikaCLI {
}
}
- // Check if last two args are directories (batch mode with options)
+ // Check if last two args are directories (Pipes mode with options)
if (args.length >= 2) {
String lastArg = args[args.length - 1];
String secondLastArg = args[args.length - 2];
@@ -413,7 +415,7 @@ public class TikaCLI {
return true;
}
} catch (Exception e) {
- // Invalid path, not batch mode
+ // Invalid path, not Pipes mode
}
}
}
@@ -852,20 +854,25 @@ public class TikaCLI {
out.println(" a normal file explorer to the GUI window to extract");
out.println(" text content and metadata from the files.");
out.println();
- out.println("- Batch mode");
+ out.println("- Tika Pipes mode");
out.println();
- out.println(" Simplest method.");
- out.println(" Specify two directories as args with no other args:");
+ out.println(" For processing many documents from a directory, S3,
GCS, Azure, JDBC, etc.");
+ out.println(" Simplest invocation is two directories as args with
no other args:");
out.println(" java -jar tika-app.jar <inputDirectory>
<outputDirectory>");
out.println();
- out.println("Batch/Pipes Options:");
+ out.println("Tika Pipes Options:");
out.println(" -i Input directory");
out.println(" -o Output directory");
- out.println(" -n Number of forked
processes");
+ out.println(" -n, --numClients Number of forked
processes");
out.println(" -X -Xmx in the forked
processes");
- out.println(" -T Timeout in milliseconds");
- out.println(" --fileList File list (one path per
line, relative to -i or absolute)");
+ out.println(" -T, --timeoutMs Timeout for each parse in
milliseconds");
+ out.println(" -c, --config=<file> Tika config file
(--config=<file> also accepted)");
+ out.println(" -p, --pluginsDir Plugins directory");
+ out.println(" --fileList File list (one path per
line, relative to -i or absolute)");
out.println(" --handler Handler type: t=text,
h=html, x=xml, m=markdown, b=body, i=ignore");
+ out.println(" --concatenate Concatenate content from
all embedded documents");
+ out.println(" --content-only Output only extracted
content (no JSON wrapper); implies --concatenate");
+ out.println(" --on-exists Behavior when an output
file exists: exception (default), replace, skip");
out.println(" -Z Recursively unpack all the
attachments, too");
out.println(" --unpack-format=<format> Output format: REGULAR
(default) or FRICTIONLESS");
out.println(" --unpack-mode=<mode> Output mode: ZIPPED
(default) or DIRECTORY");
@@ -887,23 +894,6 @@ public class TikaCLI {
return false;
}
- private boolean testForBatch(String[] args) {
- if (args.length == 2 && !args[0].startsWith("-") &&
!args[1].startsWith("-")) {
- Path inputCand = Paths.get(args[0]);
- Path outputCand = Paths.get(args[1]);
- if (Files.isDirectory(inputCand) &&
!Files.isRegularFile(outputCand)) {
- return true;
- }
- }
-
- for (String s : args) {
- if (s.equals("-inputDir") || s.equals("--inputDir") ||
s.equals("-i")) {
- return true;
- }
- }
- return false;
- }
-
private void configure() throws TikaException, IOException, SAXException {
if (configFilePath != null) {
tikaLoader = TikaLoader.load(Paths.get(configFilePath));