(tika) branch main updated: TIKA-4750 - improve docs (#2879)

tallison Sat, 06 Jun 2026 09:16:22 -0700

This is an automated email from the ASF dual-hosted git repository.

tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git



The following commit(s) were added to refs/heads/main by this push:
     new 1849dc7bbd TIKA-4750 - improve docs (#2879)
1849dc7bbd is described below

commit 1849dc7bbd32cd9722f672be7411151eb4009456
Author: Tim Allison <[email protected]>
AuthorDate: Sat Jun 6 12:16:07 2026 -0400

    TIKA-4750 - improve docs (#2879)
---
 docs/modules/ROOT/nav.adoc                          |  2 +-
 docs/modules/ROOT/pages/configuration/index.adoc    |  2 +-
 .../pages/configuration/parsers/tess4j-parser.adoc  | 21 +++++++++++++++++++++
 .../apache/tika/parser/ocr/tess4j/Tess4JParser.java | 12 ++++++++++++
 4 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
index 070f535ff8..3c7ae7a011 100644
--- a/docs/modules/ROOT/nav.adoc
+++ b/docs/modules/ROOT/nav.adoc
@@ -54,7 +54,7 @@
 ** xref:configuration/parsers/tesseract-ocr-parser.adoc[Tesseract OCR]
 ** xref:configuration/parsers/vlm-parsers.adoc[VLM Parsers (Claude, Gemini, 
OpenAI)]
 ** xref:configuration/parsers/external-parser.adoc[External Parser (ffmpeg, 
exiftool, etc.)]
-** xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR (In-Process)]
+** xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR (In-Process, 
advanced)]
 * xref:migration-to-4x/index.adoc[Migration to 4.x]
 ** xref:migration-to-4x/migrating-to-4x.adoc[Migration Guide]
 ** xref:migration-to-4x/migrating-tika-server-4x.adoc[Tika Server Migration]
diff --git a/docs/modules/ROOT/pages/configuration/index.adoc 
b/docs/modules/ROOT/pages/configuration/index.adoc
index c8cb3ab7e7..37a176e0a3 100644
--- a/docs/modules/ROOT/pages/configuration/index.adoc
+++ b/docs/modules/ROOT/pages/configuration/index.adoc
@@ -97,7 +97,7 @@ JSON uses the backslash as an escape character, so path 
options (e.g. `tesseract
 
 * xref:configuration/parsers/pdf-parser.adoc[PDFParser] — PDF parsing options
 * xref:configuration/parsers/tesseract-ocr-parser.adoc[TesseractOCRParser] — 
OCR options for image-based text extraction
-* xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR Parser] — 
in-process OCR via tess4j JNI bindings
+* xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR Parser] — 
in-process OCR via tess4j JNA bindings (advanced users only; most users should 
prefer the TesseractOCRParser above)
 * xref:configuration/parsers/vlm-parsers.adoc[VLM Parsers] — Claude, Gemini, 
OpenAI, Ollama, vLLM
 * xref:configuration/parsers/external-parser.adoc[External Parser] — wrap 
external tools (ffmpeg, exiftool, etc.)
 
diff --git a/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc 
b/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc
index 4dccac2c4d..c81a173f98 100644
--- a/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc
+++ b/docs/modules/ROOT/pages/configuration/parsers/tess4j-parser.adoc
@@ -17,6 +17,27 @@
 
 = Tess4J OCR Parser
 
+[IMPORTANT]
+====
+*Advanced users only.* `Tess4JParser` loads the Tesseract native library
+directly into your JVM through https://github.com/java-native-access/jna[JNA]
+(Java Native Access). Operating it safely means locating and linking the
+correct platform-specific native libraries, reasoning about the Java/native
+boundary, and accepting that a fault in the native code can crash the entire
+JVM. In short, it assumes you are comfortable working with native-library
+integration via JNA.
+
+If that doesn't describe you, please don't reach for this parser — and that's
+perfectly fine. The standard
+xref:configuration/parsers/tesseract-ocr-parser.adoc[`TesseractOCRParser`]
+performs the same OCR by running the `tesseract` command-line program in a
+separate process. It needs no native linking, is far easier to set up, and a
+crash in Tesseract can never take down your application, so it is the
+recommended choice for almost everyone. Choose `Tess4JParser` only when you
+have a measured need for in-process OCR throughput *and* the expertise to run
+native bindings safely.
+====
+
 The `Tess4JParser` is an OCR parser that calls the Tesseract native library
 in-process via https://github.com/nguyenq/tess4j[Tess4J] and JNA, rather
 than spawning a `tesseract` child process for every image. This eliminates
diff --git 
a/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java
 
b/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java
index d02952e6bb..d240190d26 100644
--- 
a/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java
+++ 
b/tika-parsers/tika-parsers-ml/tika-parser-tess4j-module/src/main/java/org/apache/tika/parser/ocr/tess4j/Tess4JParser.java
@@ -58,6 +58,18 @@ import org.apache.tika.utils.StringUtils;
 /**
  * OCR parser using <a href="https://github.com/nguyenq/tess4j";>Tess4J</a>,
  * which provides a Java JNA wrapper around the native Tesseract library.
+ *
+ * <p><b>Advanced users only.</b> This parser loads the Tesseract native 
library
+ * directly into the JVM via JNA (Java Native Access). Using it safely requires
+ * locating and linking the correct platform-specific native libraries and
+ * accepting that a fault in the native code can crash the entire JVM. If you 
are
+ * not comfortable with native-library integration via JNA, please prefer the
+ * standard {@code TesseractOCRParser}, which performs the same OCR by running 
the
+ * {@code tesseract} command-line program in a separate process: it needs no
+ * native linking and a crash in Tesseract can never take down your 
application,
+ * so it is the recommended choice for almost everyone. Reach for
+ * {@code Tess4JParser} only when you have a measured need for in-process OCR
+ * throughput <em>and</em> the expertise to operate native bindings safely.
  * <p>
  * Unlike the command-line {@code TesseractOCRParser}, this parser calls 
Tesseract
  * in-process via JNA, eliminating the per-file process-spawn overhead.

(tika) branch main updated: TIKA-4750 - improve docs (#2879)

Reply via email to