I just tried to doOCR on a pdf that has embedded CID fonts and gave me the following error:
> 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - **** Error: >> can't process embedded font stream, > > 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - attempting >> to load the font using its name. > > 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - >> Output may be incorrect. > > Some of the CID fonts a correctly embeded and have font names that i recogniz, but it also has font with names such as Fd64459. I figured, it has to do with the fonts, although Ghostscripts website says : > NOTE: care must be exercised since poor or incorrect output may result > from inappropriate CIDFont substitution. We therefore *strongly* recommend > embedding CIDFonts in your Postscript and PDF files if at all possible. So if we try to do OCR on this pdf, it wont produce anything because Ghostscript recognizes the false CID fonts and throws an error. So my first question is: Did I make my assessment correctly? My second question is: If pdf has CID fonts but the situation is not as bad, meaning Ghostscript can work with it, but it will produce incorrect output, does Tesseract handles this in any way? to put it in an other way, Can I be sure that Tesseract will not give me false output and also throws me this error or something similar? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d6741e58-5894-4b52-a39d-e684243b6498%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.