[tesseract-ocr] Re: Issue with Fine-Tuning eng.traineddata on Large Dataset: Negative Mean RMS Error

Tom Morris Wed, 31 Jan 2024 09:59:07 -0800

On Tuesday, January 30, 2024 at 11:13:06 AM UTC-5 Ilyas wrote:

The output I'm wondering about is :
At iteration 1/600/600, Mean rms=-2147483.6%, delta=0.033%, char
train=275.696%, word train=100%, skip ratio=0%, New worst char error =
275.696 wrote checkpoint.

I expected the training process to proceed normally with the Mean RMS error
showing sensible values, similar to when training on smaller datasets. When
I use around 100k lstmf files it doesn't have this behaviour but with 400k
this happens.

Am I looking in the wrong direction or missing something ?

As Ger pointed out, the underflow is likely the symptom of a bug, but no
one is likely to be able to help much without a much smaller reproducer.

The first thing I'd try would be to eliminate possible bad data in the 300K
new files as a source of the error. Can you run 100K chunks of the added
files separately without any error?

If that works, I'd try to figure out the upper limit that works - 200K?
300K? 350K? Perhaps you'll find an upper bound that's high enough for your
use case and you can avoid the hard work of tracking down the bug.

There's unlikely to be any easy way to figure out what's going on.

Tom

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/4937f3df-39c2-4fdf-872a-ef6d81b99faen%40googlegroups.com.

[tesseract-ocr] Re: Issue with Fine-Tuning eng.traineddata on Large Dataset: Negative Mean RMS Error

Reply via email to