This is an automated email from the ASF dual-hosted git repository.
tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new 1849dc7bbd TIKA-4750 - improve docs (#2879)
1849dc7bbd is described below
commit 1849dc7bbd32cd9722f672be7411151eb4009456
Author: Tim Allison <[email protected]>
AuthorDate: Sat Jun 6 12:16:07 2026 -0400
TIKA-4750 - improve docs (#2879)
---
docs/modules/ROOT/nav.adoc | 2 +-
docs/modules/ROOT/pages/configuration/index.adoc | 2 +-
.../pages/configuration/parsers/tess4j-parser.adoc | 21 +++++++++++++++++++++
.../apache/tika/parser/ocr/tess4j/Tess4JParser.java | 12 ++++++++++++
4 files changed, 35 insertions(+), 2 deletions(-)
diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
index 070f535ff8..3c7ae7a011 100644
--- a/docs/modules/ROOT/nav.adoc
+++ b/docs/modules/ROOT/nav.adoc
@@ -54,7 +54,7 @@
** xref:configuration/parsers/tesseract-ocr-parser.adoc[Tesseract OCR]
** xref:configuration/parsers/vlm-parsers.adoc[VLM Parsers (Claude, Gemini,
OpenAI)]
** xref:configuration/parsers/external-parser.adoc[External Parser (ffmpeg,
exiftool, etc.)]
-** xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR (In-Process)]
+** xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR (In-Process,
advanced)]
* xref:migration-to-4x/index.adoc[Migration to 4.x]
** xref:migration-to-4x/migrating-to-4x.adoc[Migration Guide]
** xref:migration-to-4x/migrating-tika-server-4x.adoc[Tika Server Migration]
diff --git a/docs/modules/ROOT/pages/configuration/index.adoc
b/docs/modules/ROOT/pages/configuration/index.adoc
index c8cb3ab7e7..37a176e0a3 100644
--- a/docs/modules/ROOT/pages/configuration/index.adoc
+++ b/docs/modules/ROOT/pages/configuration/index.adoc
@@ -97,7 +97,7 @@ JSON uses the backslash as an escape character, so path
options (e.g. `tesseract
* xref:configuration/parsers/pdf-parser.adoc[PDFParser] — PDF parsing options
* xref:configuration/parsers/tesseract-ocr-parser.adoc[TesseractOCRParser] —
OCR options for image-based text extraction
-* xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR Parser] —
in-process OCR via tess4j JNI bindings
+* xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR Parser] —
in-process OCR via tess4j JNA bindings (advanced users only; most users should
prefer the TesseractOCRParser above)
* xref:configuration/parsers/vlm-parsers.adoc[VLM Parsers] — Claude, Gemini,
OpenAI, Ollama, vLLM
* xref:configuration/parsers/external-parser.adoc[External Parser] — wrap
external tools (ffmpeg, exiftool, etc.)
diff --git a/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc
b/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc
index 4dccac2c4d..c81a173f98 100644
--- a/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc
+++ b/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc
@@ -17,6 +17,27 @@
= Tess4J OCR Parser
+[IMPORTANT]
+====
+*Advanced users only.* `Tess4JParser` loads the Tesseract native library
+directly into your JVM through https://github.com/java-native-access/jna[JNA]
+(Java Native Access). Operating it safely means locating and linking the
+correct platform-specific native libraries, reasoning about the Java/native
+boundary, and accepting that a fault in the native code can crash the entire
+JVM. In short, it assumes you are comfortable working with native-library
+integration via JNA.
+
+If that doesn't describe you, please don't reach for this parser — and that's
+perfectly fine. The standard
+xref:configuration/parsers/tesseract-ocr-parser.adoc[`TesseractOCRParser`]
+performs the same OCR by running the `tesseract` command-line program in a
+separate process. It needs no native linking, is far easier to set up, and a
+crash in Tesseract can never take down your application, so it is the
+recommended choice for almost everyone. Choose `Tess4JParser` only when you
+have a measured need for in-process OCR throughput *and* the expertise to run
+native bindings safely.
+====
+
The `Tess4JParser` is an OCR parser that calls the Tesseract native library
in-process via https://github.com/nguyenq/tess4j[Tess4J] and JNA, rather
than spawning a `tesseract` child process for every image. This eliminates
diff --git
a/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java
b/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java
index d02952e6bb..d240190d26 100644
---
a/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java
+++
b/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java
@@ -58,6 +58,18 @@ import org.apache.tika.utils.StringUtils;
/**
* OCR parser using <a href="https://github.com/nguyenq/tess4j">Tess4J</a>,
* which provides a Java JNA wrapper around the native Tesseract library.
+ *
+ * <p><b>Advanced users only.</b> This parser loads the Tesseract native
library
+ * directly into the JVM via JNA (Java Native Access). Using it safely requires
+ * locating and linking the correct platform-specific native libraries and
+ * accepting that a fault in the native code can crash the entire JVM. If you
are
+ * not comfortable with native-library integration via JNA, please prefer the
+ * standard {@code TesseractOCRParser}, which performs the same OCR by running
the
+ * {@code tesseract} command-line program in a separate process: it needs no
+ * native linking and a crash in Tesseract can never take down your
application,
+ * so it is the recommended choice for almost everyone. Reach for
+ * {@code Tess4JParser} only when you have a measured need for in-process OCR
+ * throughput <em>and</em> the expertise to operate native bindings safely.
* <p>
* Unlike the command-line {@code TesseractOCRParser}, this parser calls
Tesseract
* in-process via JNA, eliminating the per-file process-spawn overhead.