This is an automated email from the ASF dual-hosted git repository.

tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/main by this push:
     new 4bfbdf22cf TIKA-4737 -- improve docs for tika-pipes via tika-app 
(#2836)
4bfbdf22cf is described below

commit 4bfbdf22cf0d42f3661e5f0abcd42fd7e6190735
Author: Tim Allison <[email protected]>
AuthorDate: Tue May 26 08:54:27 2026 -0400

    TIKA-4737 -- improve docs for tika-pipes via tika-app (#2836)
---
 docs/modules/ROOT/pages/migration-to-4x/index.adoc |  2 +-
 docs/modules/ROOT/pages/pipes/configuration.adoc   |  2 +-
 docs/modules/ROOT/pages/pipes/cpu-sizing.adoc      |  2 +-
 docs/modules/ROOT/pages/using-tika/cli/index.adoc  | 44 ++++++++++++++-----
 .../src/main/java/org/apache/tika/cli/TikaCLI.java | 50 +++++++++-------------
 5 files changed, 57 insertions(+), 43 deletions(-)

diff --git a/docs/modules/ROOT/pages/migration-to-4x/index.adoc 
b/docs/modules/ROOT/pages/migration-to-4x/index.adoc
index 39675318c5..9333046aa6 100644
--- a/docs/modules/ROOT/pages/migration-to-4x/index.adoc
+++ b/docs/modules/ROOT/pages/migration-to-4x/index.adoc
@@ -49,4 +49,4 @@ The following tika-app options for dumping configuration are 
not yet available:
 
 These require completing the JSON serialization support for TikaConfig 
objects. The underlying serialization infrastructure exists (see 
xref:migration-to-4x/serialization-4x.adoc[Serialization]) but the CLI 
integration is pending.
 
-*Workaround:* Manually create JSON config files using the templates in 
`tika-pipes/tika-async-cli/src/main/resources/config-template.json` as a 
starting point.
+*Workaround:* Manually create JSON config files using the 
xref:pipes/configuration.adoc#config-template[Tika Pipes config template] as a 
starting point.
diff --git a/docs/modules/ROOT/pages/pipes/configuration.adoc 
b/docs/modules/ROOT/pages/pipes/configuration.adoc
index f6b3d5c2b6..f0a004ae98 100644
--- a/docs/modules/ROOT/pages/pipes/configuration.adoc
+++ b/docs/modules/ROOT/pages/pipes/configuration.adoc
@@ -191,7 +191,7 @@ icon:github[] 
https://github.com/apache/tika/blob/main/tika-pipes/tika-pipes-int
 See xref:pipes/shared-server-mode.adoc[Shared Server Mode] for the trade-offs.
 
 [#config-template]
-=== `tika-async-cli` config template
+=== Tika Pipes config template
 
 [source,json,subs=none]
 ----
diff --git a/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc 
b/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
index 997ea159b1..0745ba0db1 100644
--- a/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
+++ b/docs/modules/ROOT/pages/pipes/cpu-sizing.adoc
@@ -41,7 +41,7 @@ Where `per_fork_slice ≥ 2`:
 * 1 CPU for everything else the JVM does (GC concurrent worker, JIT,
   protocol heartbeat, socket I/O thread)
 
-The parent JVM (the one running `tika-async-cli` / `tika-app -a`) is light
+The parent JVM (the one running `tika-app` in Tika Pipes mode) is light
 on CPU — it just serializes requests, deserializes responses, and runs
 the heartbeat — but it must not be CPU-starved. A starved parent shows up
 as pathological tail latency on small operations like `socket.write()`,
diff --git a/docs/modules/ROOT/pages/using-tika/cli/index.adoc 
b/docs/modules/ROOT/pages/using-tika/cli/index.adoc
index c9f9da8f03..594828fc78 100644
--- a/docs/modules/ROOT/pages/using-tika/cli/index.adoc
+++ b/docs/modules/ROOT/pages/using-tika/cli/index.adoc
@@ -32,7 +32,7 @@ text content and metadata from all sorts of files.
 NOTE: As of 4.x, `tika-app` is distributed as a zip archive rather than a 
single
 self-contained jar. The bare `tika-app-<version>.jar` is only a thin launcher 
and
 will fail with `NoClassDefFoundError` if run on its own — the parsers and 
supporting
-modules (including the batch processor) live in the adjacent `lib/` directory.
+modules (including the Tika Pipes processor) live in the adjacent `lib/` 
directory.
 
 Download `tika-app-<version>.zip`, unzip it, and run `tika-app-<version>.jar` 
from
 inside the unzipped directory so that `lib/` and `plugins/` sit alongside the 
jar:
@@ -138,9 +138,10 @@ Extract text from a remote document and search for 
keywords:
 curl http://example.com/document.doc | java -jar tika-app.jar --text | grep -q 
keyword
 ----
 
-=== Batch processing
+=== Tika Pipes processing
 
-Process entire directories by specifying input and output paths:
+Process many documents by specifying input and output paths. Inputs can be a
+local directory, S3, GCS, Azure, JDBC, and others via Tika Pipes fetchers:
 
 [source,bash]
 ----
@@ -163,13 +164,28 @@ Use a custom configuration file:
 java -jar tika-app.jar --config=tika-config.json document.pdf
 ----
 
-== Batch Processing
+== Tika Pipes Processing
 
-For processing large numbers of files, run `tika-app` with input/output 
directories.
-Under the hood this uses Tika Pipes batch processing, with forked JVM 
processes for
-fault tolerance.
+For processing many documents — from a local directory, S3, GCS, Azure, JDBC,
+or any other Tika Pipes source — run `tika-app` with input/output paths.
+Under the hood this is Tika Pipes, dispatched asynchronously into forked JVM
+processes for fault tolerance. Tika prints a one-line banner to `stderr` when
+it switches into Pipes mode so you can confirm which path is running.
 
-=== Basic Batch Usage
+=== How Pipes mode is activated
+
+`tika-app` enters Pipes mode automatically when any of the following are true:
+
+* Two positional arguments are given and the first is an existing directory
+  (`tika-app.jar /in /out`).
+* Any of these options are present: `-i`, `-o`, `--input`, `--output`,
+  `--fileList`, `-z`/`-Z`/`--extract`/`--extract-dir`, or `-a`/`--async`.
+* A single `.json` argument is given — it is treated as a Tika Pipes config 
file.
+
+Anything else (single file, URL, stdin, `--gui`, `--server`) stays in standard
+single-document mode.
+
+=== Basic Pipes Usage
 
 [source,bash]
 ----
@@ -179,7 +195,7 @@ java -jar tika-app.jar -i /path/to/input -o /path/to/output
 This processes all files in the input directory and writes JSON metadata 
(RMETA format)
 to the output directory.
 
-=== Batch Options
+=== Tika Pipes Options
 
 [cols="1,3"]
 |===
@@ -213,7 +229,7 @@ to the output directory.
 |Plugins directory
 |===
 
-=== Batch Examples
+=== Tika Pipes Examples
 
 Extract markdown content only (no metadata) from all files:
 
@@ -231,3 +247,11 @@ Extract text with all metadata in concatenated mode:
 ----
 java -jar tika-app.jar -i /path/to/input -o /path/to/output --concatenate
 ----
+
+Use a Tika config file alongside the Pipes options. Both `--config=foo.json`
+(the standard-mode long form) and `-c foo.json` work:
+
+[source,bash]
+----
+java -jar tika-app.jar -i /path/to/input -o /path/to/output 
--config=tika-config.json
+----
diff --git a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java 
b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
index d4a5628489..6ed26567b6 100644
--- a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
+++ b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
@@ -285,6 +285,8 @@ public class TikaCLI {
     }
 
     private static void async(String[] args) throws Exception {
+        System.err.println("tika-app: running in Tika Pipes mode "
+                + "(async dispatch via -a/--async); use --help for options.");
         args = AsyncHelper.translateArgs(args);
         String tikaConfigPath = "";
         //TODO - runpack is a smelly. fix this.
@@ -305,14 +307,14 @@ public class TikaCLI {
             invokeAsyncCLI(args);
             return;
         }
-        // For batch mode (two directories), pass directly to TikaAsyncCLI.
+        // For Pipes mode (two directories), pass directly to TikaAsyncCLI.
         // It will create its own config with PluginsWriter that includes
         // plugin-roots, fetcher, emitter, and pipes-iterator configuration.
         invokeAsyncCLI(args);
     }
 
     /**
-     * Invokes the batch/async processor ({@code tika-async-cli}). The async
+     * Invokes the Tika Pipes async processor ({@code tika-async-cli}). The 
async
      * processor and the parsers it forks live in the {@code lib/} directory of
      * the tika-app distribution rather than inside the bare {@code 
tika-app.jar}.
      * If tika-app is run as a standalone jar (without the surrounding unzipped
@@ -326,9 +328,9 @@ public class TikaCLI {
         try {
             TikaAsyncCLI.main(args);
         } catch (NoClassDefFoundError e) {
-            System.err.println("Error: could not load the Tika batch/async 
processor (" +
+            System.err.println("Error: could not load the Tika Pipes processor 
(" +
                     e.getMessage() + ").");
-            System.err.println("Batch mode requires the full tika-app 
distribution, not the "
+            System.err.println("Tika Pipes mode requires the full tika-app 
distribution, not the "
                     + "standalone jar.");
             System.err.println("Download tika-app-<version>.zip, unzip it, and 
run "
                     + "tika-app-<version>.jar from inside the unzipped 
directory so that the "
@@ -401,7 +403,7 @@ public class TikaCLI {
             }
         }
 
-        // Check if last two args are directories (batch mode with options)
+        // Check if last two args are directories (Pipes mode with options)
         if (args.length >= 2) {
             String lastArg = args[args.length - 1];
             String secondLastArg = args[args.length - 2];
@@ -413,7 +415,7 @@ public class TikaCLI {
                         return true;
                     }
                 } catch (Exception e) {
-                    // Invalid path, not batch mode
+                    // Invalid path, not Pipes mode
                 }
             }
         }
@@ -852,20 +854,25 @@ public class TikaCLI {
         out.println("    a normal file explorer to the GUI window to extract");
         out.println("    text content and metadata from the files.");
         out.println();
-        out.println("- Batch mode");
+        out.println("- Tika Pipes mode");
         out.println();
-        out.println("    Simplest method.");
-        out.println("    Specify two directories as args with no other args:");
+        out.println("    For processing many documents from a directory, S3, 
GCS, Azure, JDBC, etc.");
+        out.println("    Simplest invocation is two directories as args with 
no other args:");
         out.println("         java -jar tika-app.jar <inputDirectory> 
<outputDirectory>");
         out.println();
-        out.println("Batch/Pipes Options:");
+        out.println("Tika Pipes Options:");
         out.println("    -i                         Input directory");
         out.println("    -o                         Output directory");
-        out.println("    -n                         Number of forked 
processes");
+        out.println("    -n, --numClients           Number of forked 
processes");
         out.println("    -X                         -Xmx in the forked 
processes");
-        out.println("    -T                         Timeout in milliseconds");
-        out.println("    --fileList                  File list (one path per 
line, relative to -i or absolute)");
+        out.println("    -T, --timeoutMs            Timeout for each parse in 
milliseconds");
+        out.println("    -c, --config=<file>        Tika config file 
(--config=<file> also accepted)");
+        out.println("    -p, --pluginsDir           Plugins directory");
+        out.println("    --fileList                 File list (one path per 
line, relative to -i or absolute)");
         out.println("    --handler                  Handler type: t=text, 
h=html, x=xml, m=markdown, b=body, i=ignore");
+        out.println("    --concatenate              Concatenate content from 
all embedded documents");
+        out.println("    --content-only             Output only extracted 
content (no JSON wrapper); implies --concatenate");
+        out.println("    --on-exists                Behavior when an output 
file exists: exception (default), replace, skip");
         out.println("    -Z                         Recursively unpack all the 
attachments, too");
         out.println("    --unpack-format=<format>   Output format: REGULAR 
(default) or FRICTIONLESS");
         out.println("    --unpack-mode=<mode>       Output mode: ZIPPED 
(default) or DIRECTORY");
@@ -887,23 +894,6 @@ public class TikaCLI {
         return false;
     }
 
-    private boolean testForBatch(String[] args) {
-        if (args.length == 2 && !args[0].startsWith("-") && 
!args[1].startsWith("-")) {
-            Path inputCand = Paths.get(args[0]);
-            Path outputCand = Paths.get(args[1]);
-            if (Files.isDirectory(inputCand) && 
!Files.isRegularFile(outputCand)) {
-                return true;
-            }
-        }
-
-        for (String s : args) {
-            if (s.equals("-inputDir") || s.equals("--inputDir") || 
s.equals("-i")) {
-                return true;
-            }
-        }
-        return false;
-    }
-
     private void configure() throws TikaException, IOException, SAXException {
         if (configFilePath != null) {
             tikaLoader = TikaLoader.load(Paths.get(configFilePath));

Reply via email to