[ 
https://issues.apache.org/jira/browse/TIKA-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085853#comment-18085853
 ] 

Adrian Bird commented on TIKA-4749:
-----------------------------------

Yes, I can share the PDF and some other info.

[^TIKA-4749.zip]

The zip file contains the following:
 * PDFTesseractExample.json - the config file (I removed my Tesseract paths)
 * MyTestFile.pdf - the input PDF file, which happen to be some notes on using 
Apache FOP (FYI I generated this with Apache FOP and it contains various image 
types generated with Graphviz)
 * MyTestFile.pdf.json - the output

I ran it using this:
%JAVA_HOME%\bin\java -jar %TIKA_JAR% -i Input -o Output --handler x 
--config=config\PDFTesseractExample.json

 

> Improve inline image handling in PDFs
> -------------------------------------
>
>                 Key: TIKA-4749
>                 URL: https://issues.apache.org/jira/browse/TIKA-4749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: TIKA-4749.zip
>
>
> [~birdya22] reported an exception from tesseract reading an extracted inline 
> image from a PDF. We should figure out exactly what's going wrong and fix it.
>  
> [~birdya22]  if you're able to share the triggering pdf with us, that would 
> be helpful...even if offline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to