Re: [tesseract-ocr] Re: Guide me on training or better/practical pre-processing?

Ger Hobbelt Wed, 19 Jun 2024 03:05:02 -0700

Couple of general notes, some of which I'm sure you already tried:

- all input images: convert to black text on white background. Think:
*greyscale*!, rather than pure binarization. (Pixel values are fed straight
into the neural net, so it MAY help to have the lighter pixels near the
edge of the glyph/character not being hard pure black as then they "weigh
in" differently compared to having the whole thing binarized (your
"tesseract internal" sample) -- I now know I *effed* that one up in a pull
request I did some time ago, where Stefan Weil got (slightly) worsening
results in his tests versus spot improvements in my own.😰

- preprocessing: font size is a big factor. See also the tesseract docs
(response from mobile here, sorry, no link. Also search mailing list if you
can: the original research for the "30px" measure comes with a chart.
Anyway, bottom line: scale/resize your input images and observe the
changing confidence numbers output by tesseract (tsv and hour output;
confidence per character is relevant here)

- your input is, by your definition, essentially a fully random input with
an Alfabet size of 26+26+10+2 = 64 characters. At "word size" 5-12 that
means the acceptable word set ('dictionary') size is sum[5..12]{64^i} which
is roughly 4.8E21 which is huge. Even when we reduce the Alfabet to
capitals only as a very rough lower estimate trying to account for human
behaviour ("nobody starts their login name with a . dot", ...) you're still
lookjng at upwards of 26^5 = 12 million words. Hence the obvious 😉
conclusion: disable the dictionary for scenarios such as yours.

- LSTM looks at the input one VERTICAL SCANLINE at a time, while keeping a
kind of "memory" of what came before. While I'm still somewhat vague on the
precise internal workings, this implies that LSTM also "remembers" the
previous characters in the input.
Sounds like Markov Chain 🤔...
How far back does that memory go?

 #1: I haven't looked at the tesseract code deep enough (while still
*grokking* what I'm seeing) to know if it actually does a bidirectional
LSTM scan (probably it does as that would be more reasonable for rtl
language support such as Arabic, apart from published papers mentioning
bidirectional has, usually, slightly better prediction performance that
unidirectional LSTM), so we must reckon with a left-to-right markov chain
plus a right-to-left one influencing this character's estimate...

 #2: "Markov chain" here means the LSTM engine's predictions are not solely
based on looking at the current (vertical) SCANLINE, not only at the couple
of scanlines that constitute the current character, but previous and
subsequent characters' pixels will influence the current prediction! Think
"expect u after q in English" (e.g. "query"), that sort of thing. IIRC
there was the mention somewhere of '50 scanlines', which, for 30px fonts,
would imply at least 2 'historic' characters. Immediately we complicate
matters as such simple numbers are only a very rough and shoddy indication
of reality: proportional fonts mean '50 lines' span a varying number of
characters ('w' is about as wide as 'iii', f.e.), plus, more importantly,
at least as far as I understand LSTM today, that memory is not hard, as in:
"looking at the previous 50 as well", no, it's more like, what in finance
models is called an EMA (exponential moving average): an LSTM keeps a kind
of running summary of what came before, and thus the "history" has a tail
into infinity (in both directions), where farther away previously seen
characters/scanlines have much less impact than the ones that just came
before, i.e. are very close to the current scanline.

All that theory leads to the conclusion that an LSTM neural net shares
characteristics with a Markov chain.

For you, that's relevant as tesseract was trained on a dictionary
(constructed by Ray Smith at Google) and if my numbers are anywhere near
reality, this means you must (a) unlearn some of the teachings the LSTM
learned(encoded) based on that original training set as it also encoded
that "q is always followed by u" stuff thanks to being trained on a very
large human language words dictionary (plus some extras like values etc.)
and (b) the first rough estimate for your own training set, given your
alfabet, the near-pure-random sequencing for your "words" and the above
very crude Markov chain impact length estimate of 2 historic characters,
both ways if the LSTM is bidirectional, thus leading to a 2+1(current
char)+2 = 5 character wide word segment for training a net like that, thus
requiring a training set of alfabetSize ^ relevantSegmentSizeEstimate =
64^5 =~ 1 billion (1E9) words (!) in your training set to ensure the LSTM
won't be negatively surprised by unexpected inputs afterwards.

That's a heck of a lot of training to do (thanks to the login name can be
anything rule), unless someone can show me the errors in my
reasoning+calculus. (I'm here to learn 🙏)

So ... declaring that "unfeasible" we're going for less safe, less
accuracy, less work....

Since your font, as you already noted yourself, has shapes that, like
ligatures, blend and overlap when considered from the vertical scanline
perspective and the first adjacent neighbour scanline (your "CJ"
examples!), the absolute minimum count is 2 instead of 5. Slightly better
is 3, accounting for the alleged bidirectionality. Which means your
training set size is, minimum, 64^2 = 4096 words (all 2 character
permutations of your alfabet) or 64^3 = 262144 combos of 3 characters.

Any combo not trained is (with regrettable high probability) a combo that
won't be recognized.

I don't know how much it matters if or how you combine those 256K 3char
combos in your actual training words (I have yet to tackle with tesseract
training myself); all I've seen is that not-in-training-set of certain
combos is detrimental to (tesseract) prediction confidence -- an example of
which was a gentleman earlier this year who was ocr-ing Dutch VAT numbers:
oops, those start with "NL" and then it often enough is
000yourpersonaltaxidentificationnumber (with is a VERY blatant security
flaw/leak in Dutch VAT, but I digress) and, yup, tesseract very probably
never saw that particular "NL000" sequence during training, so recognition
was shot to hell as, by the time it hit the first zero, current character
confidence dropped below 70% and that's close to some internal threshold
where the machine decides: " nah... can't be real. Forget it!" and the
result you get is garbage. *ouch*!

Ergo: make sure at least those 64^2 combos are in your training set.

Thus it was also a very smart move to take the OCRD training as a base:
that one shares the certainly-not-a-human-word-in-here characteristic of
your usage scenario; not the near-random you've got, but closer than the
base set everyone uses for ocr-ing human language texts.

*Re Over fitting*

The way I read it until recently was "the net learns to match the quirks of
your particular training set too well, now it expects those very same
quirks everywhere", but that doesn't explain increasing error rates during
training.

🤦 What I (think I) missed is the TIMELINE in the training process:
overfitting is what happens when this chain of events occurs, and this is
how *any* training is done:

One sample from the training set is taken, fed to the net and results
observed and compared against desirables (German language has *beautiful*
words for this in control-engineering: "Istwert" (the value that is) and
"Sollwert" (the value that MUST be): the difference between Istwert and
Sollwert is used as a factor and a direction towards correcting / adjusting
the net (down that rabbit hole: backwards propagation, transfer function
differentiability, gradient descent, ...): the larger the difference, the
stronger the adjustment. As the net ages, adjustment factors are reduced to
help stabilise the thing. Lots more involved, but this is the crude
training basics.
Now take any training sample, run a single cycle like that and observe a
little error: still not perfect! Hence a tiny adjustment follows, aiming to
improve the future outcome for that sample. However, those edge weights
which are adjusted in a net are used all the time, for everyone: once such
a tiny adjustment due to backdrop for a single sample impacts OTHER
samples' predictions negatively, it can be argued that overfitting is
starting to occur: while we may match sample X slightly better, we happen
to (accidentally, but as a consequence of how a net works, fundamentally)
have decreased the confidence and thus quality of prediction for (some)
other samples (in the training set). When the next training cycle for those
now-worsened samples Y and Z cannot compensate any more by their own net
adjustment activities,  then the human / monitor outside this inner
training loop may start to notice and that's when we call this behaviour
overfitting. It's a gradual thing and the "magic touch" is knowing when to
stop training and/or pull other tricks out of your hat, such as switching
training schedule/mechanics. (Fun aside: I see Stefan Weil is coding
dropping in an experimental branch this month; I'm very curious what
results that will produce.😁 This is one of the many ideas out there to
counteract overfitting (philosophically speaking from my armchair I'd say
the better phrase might be "postponing the moment you cross that pain
threshold and call this fitting an overfitting"))

If we take this description of the training timeline into account,
overfitting is then a human-chosen spot along the timeline where BCER
starts to bend (knee in the curve), or, when you consider a training set to
be a subsample of reality, where the BCER (or other metric used as your
KPI) starts to level off: if further training does not improve results for
known knowns, you're probably worsening for known and unknown unknowns.
(was that McNamara I'm paraphrasing? 🤔 Anyhoo ..)

Hope that helps getting a bit better feel about what overfitting might
constitute.

And to anyone else, who sees minor or major errors in my blathering: please
do correct. Thank you! (I'm not moving forward when  filling the ML and the
interwebz with more faulty intel about tesseract et al than already extant.)

On Tue, 18 Jun 2024, 03:13 John Roxton, <prisonersdilemma01010...@gmail.com>
wrote:

> Update:
> After searching all the threads/discussions and reading posts, I decided
> to try out the example 'ocrd-testset' that comes with `tesstrain`.
> Following a recommendation to another user by @zednop, I ran the command
> `make training MODEL_NAME=ocrd START_MODEL=deu_latf
> TESSDATA=~/tessdata_best MAX_ITERATIONS=10000` and was able to see
> significant improvement, which I was able to verify compared to the default
> model.
>
> Inspired, I tried training my own model (again) using the "Droid Sans"
> font with random ground-truth text generated from a limited character set
> ("A-Za-z0-9._"), of variable lengths 5-12 characters,
> with a starting model of the tesseract_best eng.traineddata.  Initially,
> for the first ~35,000 iterations, training was showing signs of improvement
> with a BCER decreasing to about 92%.  However, then I noticed the BCER
> began to rise so I ended the training. Soon after, I continued hoping it
> wasn't abnormal, but the BCER continued to rise and rise all the way back
> to a BCER of 99.99%, at which point I ended it and haven't restarted it
> since.
>
> The AIs tell me it's likely due to "over-fitting".  This is something I
> don't quite understand, yet.  I am wondering if the arbitrary nature of the
> text in the test set might be "short-circuiting" the prediction, and if
> maybe I should disable the dictionary.
>
> Any suggestions?
>
> On Monday, June 17, 2024 at 12:39:23 PM UTC-4 John Roxton wrote:
>
>> I should clarify my issues with training my own model:
>> I can generate all the needed data, but I simply cannot find a consistent
>> source that can guide me through the LSTM training process.  So, in case
>> anyone is wondering, I have not yet actually successfully trained and tried
>> my own model.  I have produced some .traineddata files that are larger than
>> the default eng.traineddata file, but fail to solve even the few images
>> above.  Furthermore, I cannot seem to replicate the training process!
>>
>> I will also mention that my solutions for post-processing with some sort
>> of fuzzy-matching process can be useful with longer strings, but fail
>> miserably with the shortest of strings, where the impact of a single
>> character being misinterpreted is more significant.
>>
>> On Monday, June 17, 2024 at 12:16:51 PM UTC-4 John Roxton wrote:
>>
>>> I'm using Tesseract 5.3.3
>>>
>>> My use-case is to perform OCR on username strings captured from various
>>> ROIs of screenshots.  These strings are 5-12 characters in length and make
>>> use of a set of allowable characters consisting of:  A-Za-z0-9._
>>>
>>> In general, it seems that Tesseract already does a pretty good job on my
>>> images, but due to the particular font that seems to be used (I believe it
>>> is "Droid Sans"), it often struggles with particular characters or
>>> character combinations.
>>>
>>> The most common mistake it makes is with O (capital o) and 0 (zero).
>>> Another particularly tricky character/combination is with either case of
>>> the letter "J" as the "hook" in this letter for this font hangs below the
>>> horizon.  It also may mischaracterize a "I" (capital i) for "l" (lowercase
>>> L).
>>>
>>> I've found that `--psm 6` usually works best for my use-case.
>>>
>>> Reading through the `tesseract-ocr` and `tesstrain` documentation, and
>>> learning from what I can find elsewhere online, it seems:
>>> - it is recommended that pre-processing images is better than training
>>> - fine-tuning should be preferred over training from scratch
>>>
>>> Albeit, I am having great trouble in training my own model.  I have
>>> generated 10,000 `.tif` images of text  of assorted string lengths from
>>> 5-12 characters utilizing my restricted character set in random
>>> combinations using the "Droid Sans" font, along with associated "ground
>>> truth" files with matching file names and a `.gt.txt` extension.
>>> Additionally, I have many "in-the-field" images (such as those seen below)
>>> that I can provide "ground truth" text for.
>>>
>>>
>>> Here are some particularly tricky images I've encountered:
>>>
>>> "CJR21" - often misinterpreted as "R21", "QR21", or "gR21"
>>> [image: CJR21.png]
>>>
>>> "WPJ777" - Interpreted correctly using `--psm 6`
>>> [image: WPJ777.png]
>>>
>>> "SenorC0le" - A common case of a "0" (zero) misinterpreted as a capital
>>> "O"
>>> [image: SeenorC0le.png]
>>>
>>> "Iamagod" - capital i misinterpreted as a lowercase L[image:
>>> Iamagod.png]
>>>
>>> Example of Tesseract's "internal" pre-processing:
>>> [image: Olympic-seat_4-25-3503-screenshot.processed.png]
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/02210c67-07d5-48a7-b309-ad3e15148b15n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/02210c67-07d5-48a7-b309-ad3e15148b15n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpWYkZVkMkrJkiGm9CBvgSKqMS_C%3DKr8O9Nq91D2mvcRQ%40mail.gmail.com.

Re: [tesseract-ocr] Re: Guide me on training or better/practical pre-processing?

Reply via email to