Re: [tesseract-ocr] Re: Traineddata distorted and provides bad read, last trained sample is as usual

Ger Hobbelt Sat, 05 Apr 2025 08:07:37 -0700

Haven't checked your info further, but note your remark:

*IMPORTANT*: I use images in same color pallete: black background
white(close to gray) font, without any masks applied.

Please, do NOT train with inverted imagery like that (white on black),
particularly when you are working with existing models, as those have been
trained to deal with book sources (black printed text on white paper) and
feeding such a model with inverted images and forcing it to learn those too
can only lead to total model confusion and consequently headaches and
"weird, inexplicable results".

Yes, when you dig deep ("rtfc") you'll find tesseract carries a bit of code
to detect white-on-black inverted image inputs and invert those for you
before feeding them to the core engine, but forget about that bit as it is
only triggered under a set of very particular circumstances and (AFAIR)
never in a training scenario.

TL;DR: any and all training is best done based on black/dark text on
white/light background as both training images and ocr-ing (processing)
images' code flow *implicitly* assumes this type of input.

 (When you use tesseract for a longer while, you will discover that feeding
it white-on-black works just well enough to give you the idea that this
might fly, but "weird shit" keeps happening in your decoded outputs and the
hassle never goes away whatever you try, until you adjust your preprocess
to always pump out black-on-white, guaranteed, and that "sometimes it's
plain weird!" stuff ... just goes away. There's technical explanations for
this, surely, but way too many ifs and buts there for easy comprehension
and a simple story.)

If, for instance, you plan to train and use tesseract for screen reader /
subtitle action, where often light text occurs on black backgrounds, the
above statement implies that your customized process MUST *invert* all
source images, both in the training and the using/decoding paths, as the
tesseract core is meant to receive black text on white BG, always, for
optimal results.

In your case, may I suggest re-running all what you did, but with inverted
source, i.e. all your training images turned into black text on top of
white background? I expect this will deliver fewer "weird results" versus
what you currently experience.

Take care,

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

On Sat, 5 Apr 2025, 16:08 Mitya, <mityaholi...@gmail.com> wrote:

> *Summary:*
> I decided to train one source image  (without any filters), but still
> getting major issue, assumable with set of commands to train model or
> (Highly Likely) in area where we update eng.trainedadata or interfere with
> checkpoints!
> Could you please take a look?
>
> Best Regards,
> Mitya
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/260866d4-8131-4b62-86a3-e9bb88d18187n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/260866d4-8131-4b62-86a3-e9bb88d18187n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpM_9%2Bw33ZQ-8aoVE-8-AWkBptzhw%3D2jeV0cgZ6yW5YDg%40mail.gmail.com.

Re: [tesseract-ocr] Re: Traineddata distorted and provides bad read, last trained sample is as usual

Reply via email to