(tika) 01/07: update parse modes and configuration.adoc

tallison Mon, 11 May 2026 18:00:37 -0700

This is an automated email from the ASF dual-hosted git repository.

tballison pushed a commit to branch docs/pipes-updates
in repository https://gitbox.apache.org/repos/asf/tika.git


commit 5b30fcaad8f0f41362696d4299658452217f8e86
Author: tallison <[email protected]>
AuthorDate: Mon May 11 09:24:40 2026 -0400

    update parse modes and configuration.adoc
---
 docs/modules/ROOT/pages/pipes/configuration.adoc |   2 +-
 docs/modules/ROOT/pages/pipes/index.adoc         |   2 +-
 docs/modules/ROOT/pages/pipes/parse-modes.adoc   | 143 ++++++++++++++++-------
 3 files changed, 103 insertions(+), 44 deletions(-)

diff --git a/docs/modules/ROOT/pages/pipes/configuration.adoc 
b/docs/modules/ROOT/pages/pipes/configuration.adoc
index c6614e7811..e9c75ab060 100644
--- a/docs/modules/ROOT/pages/pipes/configuration.adoc
+++ b/docs/modules/ROOT/pages/pipes/configuration.adoc
@@ -98,7 +98,7 @@ See also xref:pipes/timeouts.adoc[Timeouts] for the full 
timeout model.
 
 |`parseMode`
 |`RMETA`
-|How embedded documents are handled: `RMETA` (recursive metadata list), 
`CONCATENATE`, `CONTENT_ONLY`, `UNPACK`. See xref:pipes/parse-modes.adoc[Parse 
Modes].
+|How embedded documents are handled: `RMETA` (recursive metadata list), 
`CONCATENATE`, `CONTENT_ONLY`, `NO_PARSE`, `UNPACK`. See 
xref:pipes/parse-modes.adoc[Parse Modes].
 
 |`onParseException`
 |`EMIT`
diff --git a/docs/modules/ROOT/pages/pipes/index.adoc 
b/docs/modules/ROOT/pages/pipes/index.adoc
index 796f9d7f1f..7bd2078238 100644
--- a/docs/modules/ROOT/pages/pipes/index.adoc
+++ b/docs/modules/ROOT/pages/pipes/index.adoc
@@ -48,7 +48,7 @@ against problematic files.
 * xref:pipes/iterators.adoc[Iterators] -- document enumeration (directory 
walk, S3 listing, CSV, JDBC, Kafka, etc.)
 * xref:pipes/reporters.adoc[Reporters] -- track per-document processing status
 * xref:pipes/configuration.adoc[Pipeline Configuration] -- numClients, 
timeouts, JVM args, parse modes, emit batching
-* xref:pipes/parse-modes.adoc[Parse Modes] -- control how documents are parsed 
and emitted (`RMETA`, `CONCATENATE`, `CONTENT_ONLY`, `UNPACK`)
+* xref:pipes/parse-modes.adoc[Parse Modes] -- control how documents are parsed 
and emitted (`RMETA`, `CONCATENATE`, `CONTENT_ONLY`, `NO_PARSE`, `UNPACK`)
 * xref:pipes/unpack-config.adoc[Extracting Embedded Bytes] -- extract raw 
bytes from embedded documents
 * xref:pipes/timeouts.adoc[Timeouts] -- two-tier timeout system for handling 
long-running and hung parsers
 
diff --git a/docs/modules/ROOT/pages/pipes/parse-modes.adoc 
b/docs/modules/ROOT/pages/pipes/parse-modes.adoc
index a023d0b406..2a1af6a593 100644
--- a/docs/modules/ROOT/pages/pipes/parse-modes.adoc
+++ b/docs/modules/ROOT/pages/pipes/parse-modes.adoc
@@ -16,6 +16,8 @@
 //
 
 = Parse Modes
+:toc:
+:toclevels: 3
 
 Tika Pipes uses `ParseMode` to control how documents are parsed and how 
results are emitted.
 The parse mode is set on the `ParseContext` or configured in `PipesConfig`.
@@ -27,28 +29,60 @@ The parse mode is set on the `ParseContext` or configured 
in `PipesConfig`.
 |Mode |Description
 
 |`RMETA`
-|Default mode. Each embedded document produces a separate `Metadata` object.
-Results are returned as a JSON array of metadata objects.
+|Default mode. Each embedded document produces its own `Metadata` object.
+Results are returned as a JSON array of metadata objects, preserving 
per-embedded metadata.
 
 |`CONCATENATE`
-|All content from embedded documents is concatenated into a single content 
field.
-Results are returned as a single `Metadata` object with all metadata preserved.
+|All embedded-document text is concatenated into a single content field on the 
**container's** `Metadata` object.
+Per-embedded metadata is **not** retained in the result. See 
<<concatenate-mode>>.
 
 |`CONTENT_ONLY`
-|Parses like `CONCATENATE` but emits only the raw extracted content — no JSON 
wrapper,
-no metadata fields. Useful when you want just the text, markdown, or HTML 
output.
+|Same parsing as `CONCATENATE`, but emitters write only the raw content — no 
JSON wrapper,
+no metadata fields. See <<content-only-mode>>.
 
 |`NO_PARSE`
-|Skip parsing entirely. Useful for pipelines that only need to fetch and emit 
raw bytes.
+|Skips parsing. Container-level MIME detection and digesting (if configured) 
still run.
+See <<no-parse-mode>>.
 
 |`UNPACK`
 |Extract raw bytes from embedded documents. See 
xref:pipes/unpack-config.adoc[Extracting Embedded Bytes].
 |===
 
+== Content Handler Types
+
+The content handler type determines the format of the extracted text. It is 
set on the
+`ContentHandlerFactory` configured in `parseContext` (or via the CLI `-h` 
flag), and applies
+to all modes that produce content (`RMETA`, `CONCATENATE`, `CONTENT_ONLY`).
+
+[cols="1,1,2"]
+|===
+|Handler |Extension |Description
+
+|`t` (text)
+|`.txt`
+|Plain text output
+
+|`h` (html)
+|`.html`
+|HTML output
+
+|`x` (xml)
+|`.xml`
+|XHTML output
+
+|`m` (markdown)
+|`.md`
+|Markdown output
+
+|`b` (body)
+|`.txt`
+|Body content handler output (text from the document body only)
+|===
+
+[#concatenate-mode]
 == CONCATENATE Mode
 
-`CONCATENATE` merges all content from embedded documents into a single content 
field
-while preserving all metadata from parsing:
+`CONCATENATE` merges all extracted text — from the container and all embedded 
documents — into a single content field on the container's `Metadata` object.
 
 [source,json]
 ----
@@ -59,12 +93,28 @@ while preserving all metadata from parsing:
 }
 ----
 
-The result is a single `Metadata` object containing the concatenated content in
-`X-TIKA:content` along with all other metadata fields (title, author, content 
type, etc.).
+=== What's in the result
+
+* A **single** `Metadata` object (the container's).
+* `X-TIKA:content` contains the concatenated text of the container and all 
reachable embedded documents.
+* Container-level metadata fields (title, author, content type, etc.) are 
present.
+* The handler type used is recorded in `X-TIKA:content_handler_type`.
+
+=== What's NOT in the result
+
+* **Per-embedded-document metadata is discarded.** If an embedded PDF has its 
own title and author, those values are not in the output. Only the container's 
metadata is returned. Use `RMETA` if you need per-embedded metadata.
+* Individual embedded-document parse exceptions are not surfaced as separate 
entries. They are handled by Tika's embedded document extractor and may appear 
as embedded-exception fields on the container metadata, but there is no 
per-embedded `Metadata` object to inspect.
+
+=== Container-level exceptions
+
+If the container parse fails (`SAXException`, `EncryptedDocumentException`, or 
any other `Exception`), the stack trace is caught, logged, and stored on the 
container metadata as `X-TIKA:container_exception`. The parse continues to a 
return value rather than throwing — callers must check this field if they need 
to detect failure.
 
+If the configured write limit is reached during concatenation, 
`X-TIKA:write_limit_reached` is set to `true`.
+
+[#content-only-mode]
 == CONTENT_ONLY Mode
 
-`CONTENT_ONLY` is designed for use cases where you want just the extracted 
content
+`CONTENT_ONLY` is designed for cases where you want just the extracted content
 written to storage — no JSON wrapping, no metadata overhead. This is 
particularly
 useful for:
 
@@ -81,22 +131,20 @@ useful for:
 }
 ----
 
-=== How It Works
+=== How it works
 
-1. Documents are parsed identically to `CONCATENATE` mode — all embedded 
content is
-   merged into a single content field.
-2. A metadata filter automatically strips all metadata except `X-TIKA:content` 
and
-   `X-TIKA:CONTAINER_EXCEPTION` (for error tracking).
+1. Documents are parsed identically to `CONCATENATE` mode — all embedded text 
is merged into the container's content field, and the same caveats around 
per-embedded metadata apply.
+2. A metadata filter automatically strips all metadata except `X-TIKA:content` 
and `X-TIKA:container_exception` (for error tracking).
 3. When the emitter is a `StreamEmitter` (such as the filesystem or S3 
emitter), the
    raw content string is written directly as bytes — no JSON serialization.
 
-=== Metadata Filtering
+=== Metadata filtering
 
 By default, `CONTENT_ONLY` mode applies an `IncludeFieldMetadataFilter` that 
retains
-only `X-TIKA:content` and `X-TIKA:CONTAINER_EXCEPTION`. If you set your own
+only `X-TIKA:content` and `X-TIKA:container_exception`. If you set your own
 `MetadataFilter` on the `ParseContext`, your filter takes priority.
 
-=== CLI Usage
+=== CLI usage
 
 The `tika-async-cli` batch processor supports `CONTENT_ONLY` via the 
`--content-only`
 flag:
@@ -107,33 +155,44 @@ java -jar tika-async-cli.jar -i /input -o /output -h m 
--content-only
 ----
 
 This produces `.md` files (when using the `m` handler type) containing only the
-extracted markdown content.
+extracted markdown content. See <<_content_handler_types>> for the available 
handler types.
 
-=== Content Handler Types
+[#no-parse-mode]
+== NO_PARSE Mode
 
-The content format depends on the configured handler type:
+`NO_PARSE` skips parsing entirely. The container's content type is still 
detected, and any configured digester still runs against the raw bytes. No text 
is extracted, no embedded documents are recursed into.
 
-[cols="1,1,2"]
-|===
-|Handler |Extension |Description
+[source,json]
+----
+{
+  "parseContext": {
+    "parseMode": "NO_PARSE"
+  }
+}
+----
 
-|`t` (text)
-|`.txt`
-|Plain text output
+=== What still runs
 
-|`h` (html)
-|`.html`
-|HTML output
+* **MIME detection.** The configured `Detector` runs against the input stream 
and populates `Content-Type` and `X-TIKA:content_type_parser_override` on the 
container metadata.
+* **Digesting.** If a `DigesterFactory` is configured on the `ParseContext`, 
it runs against the raw bytes and writes the digest fields (e.g., 
`X-TIKA:digest:SHA256`) to the container metadata before the parse-mode check.
 
-|`x` (xml)
-|`.xml`
-|XHTML output
+=== What does NOT run
 
-|`m` (markdown)
-|`.md`
-|Markdown output
+* No parser is invoked. `X-TIKA:content` is empty.
+* No embedded documents are extracted.
+* No content handler is constructed (handler-type configuration is ignored for 
this mode).
 
-|`b` (body)
-|`.txt`
-|Body content handler output
-|===
+=== When to use
+
+* **Fetch-and-emit pipelines** that move bytes from one store to another and 
need only the content type and a fixed-bytes digest for downstream routing or 
deduplication.
+* **Hash-only inventories** of large corpora where parsing every document is 
too expensive but a stable digest per file is required.
+* **MIME triage**: detect content types across a large set so a downstream 
pipeline can pick the right parser, parse mode, or skip rule.
+
+Because digest and detection run in `_preParse` regardless of parse mode, 
switching between `NO_PARSE` and the parsing modes leaves digest values stable 
for the same input — useful for cross-stage joins.
+
+[#unpack-mode]
+== UNPACK Mode
+
+`UNPACK` extracts the raw bytes of embedded documents (rather than their 
parsed text) and emits them via the configured emitter. See 
xref:pipes/unpack-config.adoc[Extracting Embedded Bytes] for the full 
configuration model.
+
+The recursive parsing pass for `UNPACK` uses the same code path as `RMETA`; 
the difference is at setup and emit time, where mandatory byte extraction is 
enabled and emitted bytes are routed through the `UnpackHandler`.

(tika) 01/07: update parse modes and configuration.adoc

Reply via email to