Re: [tesseract-ocr] OCRing screenshots faster

Ger Hobbelt Mon, 21 Oct 2024 06:08:14 -0700

Since you're talking screenshots:

- tesseract is designed and trained to process books and published papers,
i.e. black printed text on white background. If you have your UI set to
"dark mode" i.e. bright text on dark background, you can help tesseract a
lot by preprocessing your image, e.g. invert the colors, so the input image
is much closer to black text on white BG. what tesseract does under the
hood is only try both ways (regular + inverted word image snippet) for any
word/particle that resulted in a lower than 0.7 confidence estimate on the
first try: by making sure your input image is as clean as possible and
black/,text on white/bright background you save yourself and tesseract up
to half of the OCR attempts.


- cleanliness is godliness in OCR ;-) : remove any noise from your input
image, including window borders and other graphical elements that are not
text: this saves tesseract time in it's image-to-line/word segmenter and
will consequently produce fewer and cleaner bboxes (bounding boxes) of
image snippets to feed into the neural net that does the image pixels to
text transformation. Less pixels to munge means more speed going through a
'page' (= input image).

Tesseract has an internal image preprocess which detects long lines (window
borders and such) and a few other bits of graphic content, but that is a
very generic machine: you surely can do better in a bespoke solution as
part of your own image preprocessing stage of the entire
screen-to-searchable-text process.

- where text scraping is possible, it will always win: across the board
it's lower CPU cost than running an image-based neural net and has FAR
fewer quality issues due to the inherent statistics of both procedures. OCR
is and always should be: a last resort.

- in the old days, with lower Rez displays, yes, the computer text was
'crisp' - in a very specific technical way that's not conducive to good
generic OCR, which is usually printed-book trained and oriented, and with
modern displays you get some human-visual improvements but also do realize
those new 'crisp' looking characters carry some edge noise, thanks to
modern anti aliasing (ClearText and other algos used by the various os'es
and display drivers) and ubiquitous subpixel positioning. Hence, an 'A'
here does not have to match an 'A' there, pixel for pixel, in the same
window+screenshot any more.

That being said, it might be useful to check other, more direct, pattern
recognition approaches when your input is decoding rendered text consoles.
Maybe look around at openCV, for instance. I don't know: I haven't dealt
with your particular input myself.

Cheers,

Ger


On Mon, 14 Oct 2024, 06:37 Billy Croan, <bi...@croan.org> wrote:

> I have a 1920x1080 screen and I have a script to screenshot it every so
> often (usually 30 seconds) and I run tesseract on those screenshots to make
> them searchable, so I can go back in time and find something that I thought
> I recall seeing.
>
> This works well, and has given me much appreciated certainty many times.
> It is perhaps a little cpu/power hungry though.  It's the only thing that
> pegs the cpu most times.  So today I optimized it to only run the OCR when
> the battery is full and using AC power.
>
> Then I got to thinking.  Tesseract takes about 4 seconds to process one
> screenshot.  Or about 13% of my whole cpu.  That's only okay for web
> browsers, right? :-p
>
> Is there a way to speed that up?  So I read
> https://tesseract-ocr.github.io/tessdoc/FAQ.html#can-i-increase-speed-of-ocr
> And I tried "tessedit_do_invert=0 " and it wrecked the output.
> completely unusable garbled output.
>
> I've been specifying dpi 96 all this time and maybe dpi could affect
> performance?
>
> I tried "OMP_THREAD_LIMIT=1" as well.  But 1, 2, and 4 performed the
> same.  My cheap laptop has a " i5-1235U" cpu so 2 performance cores and 8
> efficiency cores.  I have no idea how to tell tesseract to use the
> performance cores only but maybe the e-cores slow it down.
>
> I also wonder if there's some parts of tesseract that I can shut off to
> reduce CPU usage... Knowing that my input is "perfect" text.  i.e. it will
> never be tilted or rotated 90 or 180 degrees.  I only want to recognise
> English.  And it is guaranteed never to have defects common to
> printed/scanned paper images. Tesseract could be 'lazier' maybe and still
> do a good job in this case.
>
> Any suggestions, feedback?  maybe I should be trying to text-scrape via
> X11 or gtk somehow?  But I do often use ipmi kvmoip consoles or remote
> terminals where my local PC wouldn't have the text in a buffer but it
> should still be exceptionally clean text.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CADUq1f5tFHjEqC_S4fD%2BoeBhwmBV%3DmtqFxe9scPCcRBcoRgctw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CADUq1f5tFHjEqC_S4fD%2BoeBhwmBV%3DmtqFxe9scPCcRBcoRgctw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fr6CTz%3DpAgajSmK5WR1937i2U7i1e5Q6UpazsekozVFYg%40mail.gmail.com.

Re: [tesseract-ocr] OCRing screenshots faster

Reply via email to