I recently retrained the chi_tra model with a new font. The existing model would confuse certain characters. In addition, the source images (I'm decoding TV subtitles) had a weirdly shaped question mark. In the sample below the last two characters output as the number "7".
[image: chi_tra_7_0_QM.png] I managed to find and buy a font that was very close to the font but the question mark didn't match. So I rendered the text to images without question marks, duplicated the data set and appended the question mark *image* to each line. Then merged the two data sets for input to training. Training on my poky old 2.5G i5 CPU takes about 6 to 18 hours of unattended operation. Getting the ground-truth sorted out took about 1-2 days of my direct work. But despite all this training, it still fails with no output <https://groups.google.com/g/tesseract-ocr/c/hwX_YFRUXf4> at all if the input has an ellipsis or three dots appended: [image: bad_sub_243.png] This seems to be a problem with the image preprocessing tesseract does when identifying blocks or glyphs or something rather than a problem with the model. I'm debugging it now but it is tough going. The code is exactly what you'd expect from a massive C program from 1985 worked on by multiple researcher-types over the past 40 years... On Tuesday, March 19, 2024 at 1:38:27 PM UTC+8 lfdo...@gmail.com wrote: > Thanks, that's helpful. Is the collaboration with Google ongoing then? > Can you give me a sense of what magnitude of computing resources > training on the full dataset involves? Is it simply the days-to-weeks > per model described in the documentation? Would it be reasonable to > continually retrain existing models with additional > community-contributed data, rather than starting from scratch each > time? > > On Sun, Mar 17, 2024 at 3:51 AM Tom Morris <tfmo...@gmail.com> wrote: > > > > On Friday, March 15, 2024 at 11:13:15 PM UTC-4 lfdo...@gmail.com wrote: > > > > My naive assumption when I originally encountered issues with > > tesseract was that there would be some central repository of training > > data which we would collaborate on extending and improving in an > > open-source way, including with examples of bad results on fairly > > clean inputs. > > > > > > Ray Smith has been very generous with his time and Google's resources, > but it's a bit of an asymmetric situation and the open source community, by > and large, has not organized around wide scale retraining. The work that > has been done is typically isolated, "one-of"s with the results not > captured and used to improve the state of play. The groups that have put > significant resources into training typically have a very focused goal such > as early German blackletter, early modern printing, etc. > > > > > > Given that tesseract is focused on OCR of > > machine-created text in the first place, creating synthetic datasets > > also seems very viable. > > > > > > I think one issue with creating synthetic datasets is access to > commercially licensed fonts. Google has the resources to purchase licenses > for hundreds of commercial fonts and use them to render a great variety of > line images, but there's no economical way for them to provide those fonts > to the open source community for reuse. > > > > Training also requires a non-trivial amount of computing resources as > well as some specialized knowledge. > > > > Tom > > > > -- > > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to tesseract-oc...@googlegroups.com. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/44dd22af-42fb-48f4-bf3b-9bbbe2c21a37n%40googlegroups.com > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/819ef5c5-22a9-4ae5-bc22-789b45bb54f7n%40googlegroups.com.