[jira] [Created] (TIKA-4458) PDFParser with Tesseract: Improve documentation about embedded JP2 and JB2 files

Peter Hoogendijk (Jira) Thu, 24 Jul 2025 11:44:05 -0700

Peter Hoogendijk created TIKA-4458:
--------------------------------------

             Summary: PDFParser with Tesseract: Improve documentation about 
embedded JP2 and JB2 files
                 Key: TIKA-4458
                 URL: https://issues.apache.org/jira/browse/TIKA-4458
             Project: Tika
          Issue Type: Wish
          Components: parser
    Affects Versions: 3.2.1
            Reporter: Peter Hoogendijk



When using Tika-app 3.2.1 with Tesseract 5.3.0 to parse PDF-files with embedded 
JP2 and JB2 data the following errors are reported:
{code:java}
ERROR [main] 20:26:27,356 org.apache.pdfbox.contentstream.PDFStreamEngine 
Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not 
installed {code}
Installing jai 1.1.3 and jai-imageio 1.1 in the OpenJDK 17 lib directory does 
not change the error messages.

Please provide instructions (or a link to existing instructions) on how to 
configure Apache Tika to solve this error. After a lot of searching I only 
found instructions how to configure PDFBox (in pom.xml) but this does not solve 
the issue for Apache Tika. How do I translate the required PDFBox configuration 
sections to the Apache Tika cofiguration file? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4458) PDFParser with Tesseract: Improve documentation about embedded JP2 and JB2 files

Reply via email to