On Thu, 22 Feb 2024, 07:32 Dror Musai, <dror...@gmail.com> wrote: > Hi > > using version 5.3 of tesseract with hebrew lang. still not understand > why adobe + foxit , can not find word in the pdf after ocr. >
Pdf does not equal "text"! Pdf is a *complex* format where, more often than not, human-visible "text" is actually just a bunch of picture(s) instead of rendered glyphs: https://en.m.wikipedia.org/wiki/Glyph https://en.m.wikipedia.org/wiki/PDF Your line IMPLIES that the pdf(s) you struggle with are generated by/via tesseract. Lacking information, this is what I assume, for now. OCR is complex machinery, and first order of business with diagnosing complex machinery is reducing the *scope* of error. For that, and hence for anyone possibly being able to assist you, you need to *check* and *reduce*. *Check*: nobody human needs actual text to "read" (means: view on screen or on printed paper) pdf content. We look at images=pictures and that is what "pdf readers" produce - except specialized ware for blind people and the otherwise visually handicapped. As you mention "search" as the problem area, which DOES require machine text rather than basic pictures, first you must find out whether the OCR process actually does produce "text", and if so, what that text actually IS: pdf viewers *hide* "text overlays" by default, so you need *specialized* *tools* to uncover the text inside the pdf or, much easier, change the OCR output format. For that it is *strongly* *advised* (I'd say *mandatory*) to adjust your OCR process to have it produce HOCR format, which is a kind of augmented HTML: you can open such a file in notepad and actually read the raw content. Some of us are okay with TEXT output format, because that is the simplest format, but it drops info that is available in HOCR and thus obscures/hides several problem types, hence my advice to find out how you can produce HOCR format *directly* from tesseract. https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html *Reduce*: To enable anyone to possibly assist, you must reduce = boil down the issue to tesseract in a structure and mini process that makes it potentially reproducible; along the way you may find that the issue you have is not tesseract related but located elsewhere in your process/pipeline. Here we'll assume your issue is with tesseract or it's immediate surroundings. *Required action* Here's what you need to do (*everyone* has to, because there's a plethora of processes around, before, after and on top of tesseract out there and those only make things easy as long as things go exactly as *advertised*. *You*, on the other hand, have an issue, so you will have to divide and conquer, i.e. *reduce* your problem zone/area/scope, or you will forever be unable to discover where the problem originates); reduce your (OCR) process to this and report: >>>>>>>>>>> (Checklist) - you use the tesseract CLI (aka "tesseract executable/binary with its command line interface"); this is not a python script, not anything "script"-ish otherwise; you execute *tesseract* directly in *bash/cmd* and specify the precise command line (tesseract + argument set). This command line is also needed by anyone else out there to possibly reproduce your issue and help diagnose & fix. - you feed tesseract a (page) image, preferably PNG format. If your original source is jpeg, use the jpeg. - your tesseract commandline is such that tesseract outputs HOCR format (my preference) or plain text; this already empowers you to diagnose your issue deeper yourself as you can easily check yourself whether tesseract then produces desired/expected output or something else. Which is also useful to know as you're looking for the root cause here. https://en.m.wikipedia.org/wiki/Root_cause_analysis In your particular case, with the minimal information handed over, three general main problem sources are to be expected and reduction must be applied to discover which of these is *yours*: 1. errors in pdf text embedding process (part of OCR postprocess); failure to correctly and compatibly embed text in pdf 2. failure to produce a *page image* that is *ready for OCR* by tesseract. (OCR preprocess) Lots of issues are due to this. 3. unexpected/faulty OCR results for the given input image (the OCR process itself: tesseract) - for reporting, anyone will need your tesseract commandline, the input image(s) used and the results you get (error+info console output; output text/file(s)) plus the tesseract version/build info, which can, for example, be obtained by running tesseract -v <<<<<<<<<(Checklist ends) with google find work fine. > :-S to have a pdf indexed and searchable by Google, you need to publish the pdf online and the Google index bot must go and find and access it; that is a nontrivial process, so I wonder... Besides, once Google gets to your pdf, it will judiciously run it's own OCR process internally before indexing your pdf content, which makes this a non-starter for diagnostics purposes regarding your own process/pipeline.... At the very least, this is *way* *off* into any postprocessing pipeline and definitely not instantaneous for anyone; Google indexing is arbitrary in time. This is also indicative that you might want to seek additional, local, technical support while diagnosing your issue. aslo just view the file in adobe + foxit looks fine. > As I stated near the beginning: these are pdf viewers and they are happy to show you page scans or any other picture format/potpourri in your pdf, next to *possible* text glyphs. Pdf is a very complex format, you don't need machine text to show text and "text overlays" are not shown on screen or in print. Meanwhile, SEARCHING in a pdf requires TEXT (machine text) plus pdf search permissions (pdfs can be "secured" against search, copy-paste, etc. to complicate those pdf text search issues even further). Hence the advice to REDUCE your problem surface area; currently, also due to the minimal provided information, it is... without bounds. the revered issue is on searching something > I'm sure you meant something else then "*revered*" here. ;-) Perhaps a Google Translate to English automaton mistake? Cheers, Ger > > ב-יום ראשון, 7 ביולי 2013 בשעה 12:45:18 UTC+3, Daniel כתב/ה: > >> Hi everyone, >> >> I worked on a project that I need to do training for rtl languages. >> (hebrew and arabic) >> After I do the training process everything works great, except that the >> text printed as ltr text. >> Is there any flag to set during the training process that tell tesseract >> to treat the trained file as rtl language file so he can print the text in >> the right order? >> > PS: you quoted this message from a long time ago, but, given your own message, this is only *potentially* an issue (much) further down the road (after much *reducing!*) and while related to search issues, not the top contender. *You must first discover what is actually produced and analyse that. *Only once that is cleared, might you possibly run into rtl vs ltr, etc. >> Thanks for helping! >> Daniel >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/010f76b3-da27-445a-9d22-652b6f14a9e0n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/010f76b3-da27-445a-9d22-652b6f14a9e0n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpKoGmXAMDku3fXrYtX3EsMxPOcL-hCjLRsXKBfcTH7RA%40mail.gmail.com.