[
https://issues.apache.org/jira/browse/TIKA-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085853#comment-18085853
]
Adrian Bird commented on TIKA-4749:
-----------------------------------
Yes, I can share the PDF and some other info.
[^TIKA-4749.zip]
The zip file contains the following:
* PDFTesseractExample.json - the config file (I removed my Tesseract paths)
* MyTestFile.pdf - the input PDF file, which happen to be some notes on using
Apache FOP (FYI I generated this with Apache FOP and it contains various image
types generated with Graphviz)
* MyTestFile.pdf.json - the output
I ran it using this:
%JAVA_HOME%\bin\java -jar %TIKA_JAR% -i Input -o Output --handler x
--config=config\PDFTesseractExample.json
> Improve inline image handling in PDFs
> -------------------------------------
>
> Key: TIKA-4749
> URL: https://issues.apache.org/jira/browse/TIKA-4749
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
> Attachments: TIKA-4749.zip
>
>
> [~birdya22] reported an exception from tesseract reading an extracted inline
> image from a PDF. We should figure out exactly what's going wrong and fix it.
>
> [~birdya22] if you're able to share the triggering pdf with us, that would
> be helpful...even if offline.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)