As far as I can tell, that release includes tweaks from 2019 to the model files which are just fixes to the config, not retraining.
The idea that retraining stopped because it was no longer necessary seems a bit of a stretch to me, given the 100s of languages involved - for example, the Traditional Chinese training data seems to indicate it's missing quite a few of the standard characters, if I'm interpreting https://github.com/tesseract-ocr/langdata_lstm/blob/main/chi_tra/chi_tra.unicharset correctly. (I am not a Chinese speaker, but there are 4808 very common characters, plus 6329 less-common standard characters, and 18,319 rarely used but still standard characters, according to Wikipedia - and that file only has 4591 lines, including a bunch of non-Chinese characters.) Although perhaps languages with simpler character sets and/or better training data have hit this limit. My naive assumption when I originally encountered issues with tesseract was that there would be some central repository of training data which we would collaborate on extending and improving in an open-source way, including with examples of bad results on fairly clean inputs. Given that tesseract is focused on OCR of machine-created text in the first place, creating synthetic datasets also seems very viable. Just to be clear, none of this is intended as a criticism of the contributors to this project - just an attempt to understand the situation. On Fri, Mar 15, 2024 at 2:15 PM W.t <willtay...@gmail.com> wrote: > > https://github.com/tesseract-ocr/tessdata_best/releases/tag/4.1.0 has models > uploaded in 2021. There may be newer ones for 5 but I don't know where they > are. 2021 is still a pretty long time though, I suppose they achieved as much > as they could for general application and anything more requires training > > On Tuesday, February 20, 2024 at 12:43:36 AM UTC-5 Liam Doherty wrote: >> >> Is this an issue of access to compute resources? access to training data? >> Are the current models considered as good as they can be? >> >> Thanks, >> Liam > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/e0ccfe29-b055-401a-8d1f-8cd684f36113n%40googlegroups.com. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADwNSq6ueoBe6Wc%3DYAVniiGwqVfE2pJAgwkyYrbLtF7-OM%2BhcQ%40mail.gmail.com.