[ https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260846#comment-17260846 ]
Luís Filipe Nassif edited comment on TIKA-3258 at 1/7/21, 9:59 PM: ------------------------------------------------------------------- Regarding [~tilman] concern, PDF to image conversion caused a lot of OOM to our project in the past. That was solved by converting in another process. I think ForkParser or Tika server spawnChild should take care of that. Usually, page rendering to image is faster than OCR, so I think it is not too much slower than OCRing each image in page. This has other issues, eg image masks/filters, images broken in smaller adjacent images, images drawn with polygons... This OCR page if < X chars/words found in page approach was enough to our use case to OCR scanned PDFs. Maybe if PDFBox could have an option to not include page text in the rendered page, that should avoid OCRing searchable text again, saving a lot of time and resources, and not producing duplicated texts with the OCR "always" option. Does that make sense [~tilman] and [~tallison]? was (Author: lfcnassif): Regarding [~tilman] concern, PDF to image conversion caused a lot of OOM to our project in the past. That was solved by converting in another process. I think ForkParser or Tika server spawnChild should take care of that. Usually, page rendering to image is faster than OCR, so I think it is not too much slower than OCRing each image in page. This has other issues, eg image masks/filters, images broken in smaller adjacent images, images drawn with polygons... This OCR page if < X chars/words found in page approach was enough to our use case to OCR scanned PDFs. Maybe if PDFBox could have an option to not include page text in the rendered page, that should avoid OCRing searchable text again, saving a lot of time and resources, and not producing duplicated texts with the OCR "always" option. Does that make sense [~tilman]? > Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0 > --------------------------------------------------------- > > Key: TIKA-3258 > URL: https://issues.apache.org/jira/browse/TIKA-3258 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Major > > In Tika 1.x we currently have the fiddly mess that users have to configure > OCR of PDFs...it doesn't just work out of the box. We did this initially > because of concerns (well, reality) of crazy resource consumption for some > PDFs that can have thousands of images per page that are stitched together to > make a reasonable composite. > Since then, we've added option 2, which renders each page and then runs OCR > on that composite image rather than running OCR on each inline image...so > we'll only call tesseract once per page. Second, we've added an 'auto' mode > that runs OCR only on pages that didn't have much text extracted. While > there is plenty of room for improvement in the 'auto' heuristic, I think we > should move to running OCR automatically on PDFs as default in 2.0.0. > Under this proposal, users will now have to disable OCR if they have > tesseract installed but don't want to run it on PDFs. > This will be a breaking change, and we'll make sure to document it early and > often in the "Breaking Changes" sections of the readme.txt. -- This message was sent by Atlassian Jira (v8.3.4#803005)