On Thu, 22 Feb 2024, 07:32 Dror Musai, <dror...@gmail.com> wrote:

> Hi
>
> using version 5.3  of tesseract with hebrew  lang.  still not understand
> why adobe + foxit   ,  can not find word in   the pdf   after ocr.
>

Pdf does not equal "text"! Pdf is a *complex* format where, more often than
not, human-visible "text" is actually just a bunch of picture(s) instead of
rendered glyphs: https://en.m.wikipedia.org/wiki/Glyph
https://en.m.wikipedia.org/wiki/PDF

Your line IMPLIES that the pdf(s) you struggle with are generated by/via
tesseract. Lacking information, this is what I assume, for now.

OCR is complex machinery, and first order of business with diagnosing
complex machinery is reducing the *scope* of error. For that, and hence for
anyone possibly being able to assist you, you need to *check* and *reduce*.

*Check*: nobody human needs actual text to "read" (means: view on screen or
on printed paper) pdf content. We look at images=pictures and that is what
"pdf readers" produce - except specialized ware for blind people and the
otherwise visually handicapped.
As you mention "search" as the problem area, which DOES require machine
text rather than basic pictures, first you must find out whether the OCR
process actually does produce "text", and if so, what that text actually
IS: pdf viewers *hide* "text overlays" by default, so you need *specialized*
*tools* to uncover the text inside the pdf or, much easier, change the OCR
output format.

For that it is *strongly* *advised* (I'd say *mandatory*) to adjust your
OCR process to have it produce HOCR format, which is a kind of augmented
HTML: you can open such a file in notepad and actually read the raw
content. Some of us are okay with TEXT output format, because that is the
simplest format, but it drops info that is available in HOCR and thus
obscures/hides several problem types, hence my advice to find out how you
can produce HOCR format *directly* from tesseract.
https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html

*Reduce*:
To enable anyone to possibly assist, you must reduce = boil down the issue
to tesseract in a structure and mini process that makes it potentially
reproducible; along the way you may find that the issue you have is not
tesseract related but located elsewhere in your process/pipeline. Here
we'll assume your issue is with tesseract or it's immediate surroundings.



*Required action*

 Here's what you need to do (*everyone* has to, because there's a plethora
of processes around, before, after and on top of tesseract out there and
those only make things easy as long as things go exactly as *advertised*.
*You*, on the other hand, have an issue, so you will have to divide and
conquer, i.e. *reduce* your problem zone/area/scope, or you will forever be
unable to discover where the problem originates); reduce your (OCR) process
to this and report:

>>>>>>>>>>> (Checklist)

- you use the tesseract CLI (aka "tesseract executable/binary with its
command line interface"); this is not a python script, not anything
"script"-ish otherwise; you execute *tesseract* directly in *bash/cmd* and
specify the precise command line (tesseract + argument set). This command
line is also needed by anyone else out there to possibly reproduce your
issue and help diagnose & fix.

- you feed tesseract a (page) image, preferably PNG format. If your
original source is jpeg, use the jpeg.

- your tesseract commandline is such that tesseract outputs HOCR format (my
preference) or plain text; this already empowers you to diagnose your issue
deeper yourself as you can easily check yourself whether tesseract then
produces desired/expected output or something else. Which is also useful to
know as you're looking for the root cause here.
 https://en.m.wikipedia.org/wiki/Root_cause_analysis

In your particular case, with the minimal information handed over, three
general main problem sources are to be expected and reduction must be
applied to discover which of these is *yours*:

1. errors in pdf text embedding process (part of OCR postprocess); failure
to correctly and compatibly embed text in pdf

2. failure to produce a *page image* that is *ready for OCR* by tesseract.
(OCR preprocess) Lots of issues are due to this.

3. unexpected/faulty OCR results for the given input image (the OCR process
itself: tesseract)

- for reporting, anyone will need your tesseract commandline, the input
image(s) used and the results you get (error+info console output; output
text/file(s)) plus the tesseract version/build info, which can, for
example, be obtained by running

tesseract -v


<<<<<<<<<(Checklist ends)


with google find work fine.
>

:-S   to have a pdf indexed and searchable by Google, you need to publish
the pdf online and the Google index bot must go and find and access it;
that is a nontrivial process, so I wonder... Besides, once Google gets to
your pdf, it will judiciously run it's own OCR process internally before
indexing your pdf content, which makes this a non-starter for diagnostics
purposes regarding your own process/pipeline.... At the very least, this is
*way* *off* into any postprocessing pipeline and definitely not
instantaneous for anyone; Google indexing is arbitrary in time.

This is also indicative that you might want to seek additional, local,
technical support while diagnosing your issue.

aslo just view the file   in adobe + foxit  looks fine.
>

As I stated near the beginning: these are pdf viewers and they are happy to
show you page scans or any other picture format/potpourri in your pdf, next
to *possible* text glyphs. Pdf is a very complex format, you don't need
machine text to show text and "text overlays" are not shown on screen or in
print.

Meanwhile, SEARCHING in a pdf requires TEXT (machine text) plus pdf search
permissions (pdfs can be "secured" against search, copy-paste, etc. to
complicate those pdf text search issues even further).

Hence the advice to REDUCE your problem surface area; currently, also due
to the minimal provided information, it is... without bounds.

the revered issue is on searching something
>

I'm sure you meant something else then "*revered*" here.  ;-)
 Perhaps a Google Translate to English automaton mistake?

Cheers,

Ger


>
> ב-יום ראשון, 7 ביולי 2013 בשעה 12:45:18 UTC+3, Daniel כתב/ה:
>
>> Hi everyone,
>>
>> I worked on a project that I need to do training for rtl languages.
>> (hebrew and arabic)
>> After I do the training process everything works great, except that the
>> text printed as ltr text.
>> Is there any flag to set during the training process that tell tesseract
>> to treat the trained file as rtl language file so he can print the text in
>> the right order?
>>
>

PS: you quoted this message from a long time ago, but, given your own
message, this is only *potentially* an issue (much) further down the road
(after much *reducing!*) and while related to search issues, not the top
contender.
*You must first discover what is actually produced and analyse that. *Only
once that is cleared, might you possibly run into rtl vs ltr, etc.












>> Thanks for helping!
>> Daniel
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/010f76b3-da27-445a-9d22-652b6f14a9e0n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/010f76b3-da27-445a-9d22-652b6f14a9e0n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpKoGmXAMDku3fXrYtX3EsMxPOcL-hCjLRsXKBfcTH7RA%40mail.gmail.com.

Reply via email to