[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695001#comment-16695001
 ] 

Tim Allison commented on TIKA-2749:
-----------------------------------

bq. Note: I have no need for OCR recently, so this is just talk
Thank you for this... :D

bq.Perhaps we can assume that when Tesseract has been installed and configured 
then images in PDF's should be automatically extracted and OCR'd.

Right.  Agreed, I think.  However, part of the challenge is automatically 
determining whether to go with option 1) extract every image and run it through 
OCR or 2) render the entire page and then run OCR on that.  Some pdfs can 
contain thousands of images per page, and option 2 is better for that.  So, we 
will need some heuristics, and I think adding a heuristic on whether text 
extraction "got enough" might be useful as well.

Substitute deeplearning above for "heuristic", and throw in a ground truth set 
with an evaluation of performance tradeoffs, and we'll be all set.

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to