Re: [tesseract-ocr] Re: why are there no new trained models since 2018?

Liam Doherty Fri, 15 Mar 2024 20:13:11 -0700

As far as I can tell, that release includes tweaks from 2019 to the
model files which are just fixes to the config, not retraining.

The idea that retraining stopped because it was no longer necessary
seems a bit of a stretch to me, given the 100s of languages involved -
for example, the Traditional Chinese training data seems to indicate
it's missing quite a few of the standard characters, if I'm
interpreting 
https://github.com/tesseract-ocr/langdata_lstm/blob/main/chi_tra/chi_tra.unicharset
correctly. (I am not a Chinese speaker, but there are 4808 very common
characters, plus 6329 less-common standard characters, and 18,319
rarely used but still standard characters, according to Wikipedia -
and that file only has 4591 lines, including a bunch of non-Chinese
characters.) Although perhaps languages with simpler character sets
and/or better training data have hit this limit.

My naive assumption when I originally encountered issues with
tesseract was that there would be some central repository of training
data which we would collaborate on extending and improving in an
open-source way, including with examples of bad results on fairly
clean inputs. Given that tesseract is focused on OCR of
machine-created text in the first place, creating synthetic datasets
also seems very viable.

Just to be clear, none of this is intended as a criticism of the
contributors to this project - just an attempt to understand the
situation.

On Fri, Mar 15, 2024 at 2:15 PM W.t <[email protected]> wrote:
>
> https://github.com/tesseract-ocr/tessdata_best/releases/tag/4.1.0 has models 
> uploaded in 2021. There may be newer ones for 5 but I don't know where they 
> are. 2021 is still a pretty long time though, I suppose they achieved as much 
> as they could for general application and anything more requires training
>
> On Tuesday, February 20, 2024 at 12:43:36 AM UTC-5 Liam Doherty wrote:
>>
>> Is this an issue of access to compute resources? access to training data? 
>> Are the current models considered as good as they can be?
>>
>> Thanks,
>> Liam
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/e0ccfe29-b055-401a-8d1f-8cd684f36113n%40googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CADwNSq6ueoBe6Wc%3DYAVniiGwqVfE2pJAgwkyYrbLtF7-OM%2BhcQ%40mail.gmail.com.

Re: [tesseract-ocr] Re: why are there no new trained models since 2018?

Reply via email to