Also keep in mind that a lot of folks using tesseract have problems with output quality due to feeding it inverted color images, i.e. White text on black background (after converting their inputs to b&w images).
Since you say your input are game screens, chances are high you're in that same boat. For best results, make sure your text is BLACK (darkest), your background WHITE (lightest). This may be done by inverting your image colours before thresholding (converting to pure b&w). The generic preprocess for tesseract would thus be: 1: analyze & invert? => Making sure the text(s) to ocr are the darkest pixels in your image. 2: analyze and improve color contrast locally? => Locate and remove shadow, vignettes, etc. in any areas of the page (image). Goal: improve outcome of next step by feeding it input that produces the least amount of pixel noise. With game inputs, depending on the styling of the game, one simple filter might be to pick one of the color channels (r, g, b) or rotated color channels. In other words: does my image contrast / legibility improve when i look at it through a color filter, e.g. a purple filter or yellow or green? When the text pops and the background "disappears" you've got an easy winner. Some times that's all you need. 3: thresholding Turn your image into b&w, binary color. That is: all is black or white, no more grays. There's plenty to find for that on the net, most of it research (and open source code) aimed at improving scans of old books, manuscripts, but also stuff like license plates. Test and use what works best for you. Picking an appropriate thresholding algorithm will be useful. The entire preprocessing endeavor is for one reason only: feeding the ocr engine images that look closest to the training set: black text on white background. If you end up with white text on black background, results will be rotten, random quality, until you manage to flip it around to black text on white bg. Anything goes to make it so. If you come up with a preprocess that's mixing or re-ordering stages 1&2, or 1,2 *and* 3, that's fine: those stages only are there to organize the human thought model: when you come up with a process that consistently delivers clean(est) BLACK text on WHITE bg as its end result, you're golden. Sorry for repeating the message, but i've found that the "feed tesseract black-on-white, not white-on-black" mantra is the most important, particularly for "unconventional inputs". (Not providing any white margin comes second, i.e. cropping images so severely that the text touches the image edges: always leave (or crop and then *add*) a white border.) When visually evaluating your (trial) preprocess, evaluate based on this question: could this output i got have been printed in a regular book and is it easily legible to me? HTH Ger On Sun, Dec 20, 2020, 18:26 Quan Nguyen <nguyen...@gmail.com> wrote: > You may need to scale the image to 300 DPI for better results. This is > especially true for screenshots, where the resolution is typically at 72 or > 96 DPI. > > On Tuesday, November 10, 2020 at 3:40:40 AM UTC-6 player1 wrote: > >> Hi Folks >> >> Im new to Tesseract and need some pointers on how to improve the ouput >> from a game screen dump. >> >> It has some game stats with different types of fonts, at different sizes >> and one font is skewed to the side. >> >> The screendump has background graphics but its toned down as not to >> disturb human readings the page. >> >> The screendump might have different resolutions but the position of texts >> are fixed to particular regions. >> >> So far I have tried reading the page (with tess4J) at 120 DPI and only >> the simplest text which looks to be about 20pt in size is read out >> correctly, bigger fonts are completely lost. >> >> What options do I have to improve the output form Tesseract? >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/14e8cf91-b1bf-4301-9652-a03aa661a387n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/14e8cf91-b1bf-4301-9652-a03aa661a387n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fqts7oLREDxikpGm48cjbY0f_szL15cKNCM%3D3yk4q6RTQ%40mail.gmail.com.