[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695001#comment-16695001 ]
Tim Allison commented on TIKA-2749: ----------------------------------- bq. Note: I have no need for OCR recently, so this is just talk Thank you for this... :D bq.Perhaps we can assume that when Tesseract has been installed and configured then images in PDF's should be automatically extracted and OCR'd. Right. Agreed, I think. However, part of the challenge is automatically determining whether to go with option 1) extract every image and run it through OCR or 2) render the entire page and then run OCR on that. Some pdfs can contain thousands of images per page, and option 2 is better for that. So, we will need some heuristics, and I think adding a heuristic on whether text extraction "got enough" might be useful as well. Substitute deeplearning above for "heuristic", and throw in a ground truth set with an evaluation of performance tradeoffs, and we'll be all set. > OCR on PDFs should "just work" out of the box > --------------------------------------------- > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)