Re: [tesseract-ocr] Re: Improve ocr on screendump

Ger Hobbelt Fri, 25 Dec 2020 04:42:53 -0800

Also keep in mind that a lot of folks using tesseract have problems with
output quality due to feeding it inverted color images, i.e. White text on
black background (after converting their inputs to b&w images).

Since you say your input are game screens, chances are high you're in that
same boat.

For best results, make sure your text is BLACK (darkest), your background
WHITE (lightest). This may be done by inverting your image colours before
thresholding (converting to pure b&w).

The generic preprocess for tesseract would thus be:

1: analyze & invert? =>
Making sure the text(s) to ocr are the darkest pixels in your image.
2: analyze and improve color contrast locally? =>
Locate and remove shadow, vignettes, etc. in any areas of the page (image).
Goal: improve outcome of next step by feeding it input that produces the
least amount of pixel noise.
With game inputs, depending on the styling of the game, one simple filter
might be to pick one of the color channels (r, g, b) or rotated color
channels. In other words: does my image contrast / legibility improve when
i look at it through a color filter, e.g. a purple filter or yellow or
green? When the text pops and the background "disappears" you've got an
easy winner. Some times that's all you need.
3: thresholding
Turn your image into b&w, binary color. That is: all is black or white, no
more grays.
There's plenty to find for that on the net, most of it research (and open
source code) aimed at improving scans of old books, manuscripts, but also
stuff like license plates. Test and use what works best for you. Picking an
appropriate thresholding algorithm will be useful.

The entire preprocessing endeavor is for one reason only: feeding the ocr
engine images that look closest to the training set: black text on white
background.
If you end up with white text on black background, results will be rotten,
random quality, until you manage to flip it around to black text on white
bg. Anything goes to make it so. If you come up with a preprocess that's
mixing or re-ordering stages 1&2, or 1,2 *and* 3, that's fine: those stages
only are there to organize the human thought model: when you come up with a
process that consistently delivers clean(est) BLACK text on WHITE bg as its
end result, you're golden.

Sorry for repeating the message, but i've found that the "feed tesseract
black-on-white, not white-on-black" mantra is the most important,
particularly for "unconventional inputs". (Not providing any white margin
comes second, i.e. cropping images so severely that the text touches the
image edges: always leave (or crop and then *add*) a white border.)
When visually evaluating your (trial) preprocess, evaluate based on this
question: could this output i got have been printed in a regular book and
is it easily legible to me?

HTH

Ger

On Sun, Dec 20, 2020, 18:26 Quan Nguyen <nguyen...@gmail.com> wrote:

> You may need to scale the image to 300 DPI for better results. This is
> especially true for screenshots, where the resolution is typically at 72 or
> 96 DPI.
>
> On Tuesday, November 10, 2020 at 3:40:40 AM UTC-6 player1 wrote:
>
>> Hi Folks
>>
>> Im new to Tesseract and need some pointers on how to improve the ouput
>> from a game screen dump.
>>
>> It has some game stats with different types of fonts, at different sizes
>> and one font is skewed to the side.
>>
>> The screendump has background graphics but its toned down as not to
>> disturb human readings the page.
>>
>> The screendump might have different resolutions but the position of texts
>> are fixed to particular regions.
>>
>> So far I have tried reading the page (with tess4J) at 120 DPI and only
>> the simplest text which looks to be about 20pt in size is read out
>> correctly, bigger fonts are completely lost.
>>
>> What options do I have to improve the output form Tesseract?
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/14e8cf91-b1bf-4301-9652-a03aa661a387n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/14e8cf91-b1bf-4301-9652-a03aa661a387n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fqts7oLREDxikpGm48cjbY0f_szL15cKNCM%3D3yk4q6RTQ%40mail.gmail.com.

Re: [tesseract-ocr] Re: Improve ocr on screendump

Reply via email to