Okey, thanks. That means I have to figure if Tess4J takes care of that. 2019. április 2., kedd 17:28:15 UTC+2 időpontban shree a következőt írta: > > Tesseract does not take pdfs as direct input. You have to convert pdf to > images and provide that to tesseract. > > However there are many 3rd party applications which take pdf as input and > have tesseract as backend to do OCR. > > On Tue, Apr 2, 2019 at 5:02 PM Kristóf Horváth <vazzz...@gmail.com > <javascript:>> wrote: > >> I just tried to doOCR on a pdf that has embedded CID fonts and gave me >> the following error: >> >>> 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - **** Error: >>>> can't process embedded font stream, >>> >>> 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - >>>> attempting to load the font using its name. >>> >>> 6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - >>>> Output may be incorrect. >>> >>> >> Some of the CID fonts a correctly embeded and have font names that i >> recogniz, but it also has font with names such as Fd64459. >> >> I figured, it has to do with the fonts, although Ghostscripts website >> says : >> >>> NOTE: care must be exercised since poor or incorrect output may result >>> from inappropriate CIDFont substitution. We therefore *strongly* recommend >>> embedding CIDFonts in your Postscript and PDF files if at all possible. >> >> >> So if we try to do OCR on this pdf, it wont produce anything because >> Ghostscript recognizes the false CID fonts and throws an error. >> >> So my first question is: Did I make my assessment correctly? >> My second question is: If pdf has CID fonts but the situation is not as >> bad, meaning Ghostscript can work with it, but it will produce incorrect >> output, does Tesseract handles this in any way? to put it in an other way, >> Can I be sure that Tesseract will not give me false output and also throws >> me this error or something similar? >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesser...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/d6741e58-5894-4b52-a39d-e684243b6498%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/d6741e58-5894-4b52-a39d-e684243b6498%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/40916659-da75-4fc8-b595-220bcba5458e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.