Okey, thanks. That means I have to figure if Tess4J takes care of that.

2019. április 2., kedd 17:28:15 UTC+2 időpontban shree a következőt írta:
>
> Tesseract does not take pdfs as direct input. You have to convert pdf to 
> images and provide that to tesseract.
>
> However there are many 3rd party applications which take pdf as input and 
> have tesseract as backend to do OCR.
>
> On Tue, Apr 2, 2019 at 5:02 PM Kristóf Horváth <vazzz...@gmail.com 
> <javascript:>> wrote:
>
>> I just tried to doOCR on a pdf that has embedded CID fonts and gave me 
>> the following error:
>>
>>> 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript  -    **** Error: 
>>>> can't process embedded font stream,
>>>
>>> 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript  -        
>>>>  attempting to load the font using its name.
>>>
>>> 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript  -                
>>>> Output may be incorrect.
>>>
>>>
>> Some of the CID fonts a correctly embeded and have font names that i 
>> recogniz, but it also has font with names such as Fd64459. 
>>
>>  I figured, it has to do with the fonts, although Ghostscripts website 
>> says :
>>
>>> NOTE: care must be exercised since poor or incorrect output may result 
>>> from inappropriate CIDFont substitution. We therefore *strongly* recommend 
>>> embedding CIDFonts in your Postscript and PDF files if at all possible.
>>
>>
>> So if we try to do OCR on this pdf, it wont produce anything because 
>> Ghostscript recognizes the false CID fonts and throws an error.
>>
>> So my first question is: Did I make my assessment correctly?
>> My second question is: If pdf has CID fonts but the situation is not as 
>> bad, meaning Ghostscript can work with it, but it will produce incorrect 
>> output, does Tesseract handles this in any way? to put it in an other way, 
>> Can I be sure that Tesseract will not give me false output and also throws 
>> me this error or something similar?
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/d6741e58-5894-4b52-a39d-e684243b6498%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/d6741e58-5894-4b52-a39d-e684243b6498%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/40916659-da75-4fc8-b595-220bcba5458e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to