[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

Markus Mandalka (JIRA) Wed, 02 Jan 2019 05:19:43 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732035#comment-16732035
 ]


Markus Mandalka commented on TIKA-2749:
---------------------------------------

Some ideas/experience/wishes from my side for development of Open Semantic 
Search i'd like if

 
 * this OCR could be deactivated on document level / by HTTP/REST option (which 
can be disabled by using /bin/false as definition of the tesseract binary which 
i am doing now after a tip thanks to Tim Allison)
 * for this case Tika would add a state/bool/info if document is OCRable (or i 
could infer it from metadata fields - maybe there are such infos even if 
bin/false used, i had yet no time to look deeper), if there are images which 
would be OCRd but aren't because i disabled OCR by first point

 

since i plan/wish/implement for Open Semantic ETL for the future to

- first extract / index documents without OCR without to change the global tika 
config

and would like to be able later

-reextract/index documents with OCR later (which for performance / not to do 
the full extraction second time for documents where OCR would not make a 
difference) could be limited/filtered/optimized by such a info from former 
extraction without ocr to only such documents where there is something for OCR

 

since OCR often needs much processing time for often "only" few additional 
infos and so could run afterwards for only documents including images while 
users could find most infos much earlier / work with a relative good index soon 
and a OCRd/better index later instead of waiting days/weeks on first indexing 
of large document sets.

 

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

Reply via email to