[ https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259813#comment-17259813 ]
David Pilato commented on TIKA-3258: ------------------------------------ I really like having {{auto}} as the default mode. I was thinking of making this the default for my FSCrawler project. I'm unsure how it works today but if it's the case, I'm not sure that detecting automatically tesseract is a good thing. I think that using OCR on images, pdf, etc, should be a decision activated by the developer not by the presence of a binary in the path. But that's another story. > Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0 > --------------------------------------------------------- > > Key: TIKA-3258 > URL: https://issues.apache.org/jira/browse/TIKA-3258 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Major > > In Tika 1.x we currently have the fiddly mess that users have to configure > OCR of PDFs...it doesn't just work out of the box. We did this initially > because of concerns (well, reality) of crazy resource consumption for some > PDFs that can have thousands of images per page that are stitched together to > make a reasonable composite. > Since then, we've added option 2, which renders each page and then runs OCR > on that composite image rather than running OCR on each inline image...so > we'll only call tesseract once per page. Second, we've added an 'auto' mode > that runs OCR only on pages that didn't have much text extracted. While > there is plenty of room for improvement in the 'auto' heuristic, I think we > should move to running OCR automatically on PDFs as default in 2.0.0. > Under this proposal, users will now have to disable OCR if they have > tesseract installed but don't want to run it on PDFs. > This will be a breaking change, and we'll make sure to document it early and > often in the "Breaking Changes" sections of the readme.txt. -- This message was sent by Atlassian Jira (v8.3.4#803005)