[tesseract-ocr] Doing OCR on pdfs with embedded CID fonts

Kristóf Horváth Tue, 02 Apr 2019 04:32:33 -0700

I just tried to doOCR on a pdf that has embedded CID fonts and gave me the 
following error:


> 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript  -    **** Error: 
>> can't process embedded font stream,
>
> 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript  -         attempting 
>> to load the font using its name.
>
> 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript  -                
>> Output may be incorrect.
>
>
Some of the CID fonts a correctly embeded and have font names that i 
recogniz, but it also has font with names such as Fd64459. 

 I figured, it has to do with the fonts, although Ghostscripts website says 
:

> NOTE: care must be exercised since poor or incorrect output may result 
> from inappropriate CIDFont substitution. We therefore *strongly* recommend 
> embedding CIDFonts in your Postscript and PDF files if at all possible.


So if we try to do OCR on this pdf, it wont produce anything because 
Ghostscript recognizes the false CID fonts and throws an error.

So my first question is: Did I make my assessment correctly?
My second question is: If pdf has CID fonts but the situation is not as 
bad, meaning Ghostscript can work with it, but it will produce incorrect 
output, does Tesseract handles this in any way? to put it in an other way, 
Can I be sure that Tesseract will not give me false output and also throws 
me this error or something similar?


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d6741e58-5894-4b52-a39d-e684243b6498%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Doing OCR on pdfs with embedded CID fonts

Reply via email to