On Tuesday, January 30, 2024 at 11:13:06 AM UTC-5 Ilyas wrote:

The output I'm wondering about is :
At iteration 1/600/600, Mean rms=-2147483.6%, delta=0.033%, char 
train=275.696%, word train=100%, skip ratio=0%,  New worst char error = 
275.696 wrote checkpoint.

I expected the training process to proceed normally with the Mean RMS error 
showing sensible values, similar to when training on smaller datasets. When 
I use around 100k lstmf files it doesn't have this behaviour but with 400k 
this happens.

Am I looking in the wrong direction or missing something ?


As Ger pointed out, the underflow is likely the symptom of a bug, but no 
one is likely to be able to help much without a much smaller reproducer.

The first thing I'd try would be to eliminate possible bad data in the 300K 
new files as a source of the error. Can you run 100K chunks of the added 
files separately without any error?

If that works, I'd try to figure out the upper limit that works - 200K? 
300K? 350K? Perhaps you'll find an upper bound that's high enough for your 
use case and you can avoid the hard work of tracking down the bug.

There's unlikely to be any easy way to figure out what's going on.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4937f3df-39c2-4fdf-872a-ef6d81b99faen%40googlegroups.com.

Reply via email to