Update: After searching all the threads/discussions and reading posts, I decided to try out the example 'ocrd-testset' that comes with `tesstrain`. Following a recommendation to another user by @zednop, I ran the command `make training MODEL_NAME=ocrd START_MODEL=deu_latf TESSDATA=~/tessdata_best MAX_ITERATIONS=10000` and was able to see significant improvement, which I was able to verify compared to the default model.
Inspired, I tried training my own model (again) using the "Droid Sans" font with random ground-truth text generated from a limited character set ("A-Za-z0-9._"), of variable lengths 5-12 characters, with a starting model of the tesseract_best eng.traineddata. Initially, for the first ~35,000 iterations, training was showing signs of improvement with a BCER decreasing to about 92%. However, then I noticed the BCER began to rise so I ended the training. Soon after, I continued hoping it wasn't abnormal, but the BCER continued to rise and rise all the way back to a BCER of 99.99%, at which point I ended it and haven't restarted it since. The AIs tell me it's likely due to "over-fitting". This is something I don't quite understand, yet. I am wondering if the arbitrary nature of the text in the test set might be "short-circuiting" the prediction, and if maybe I should disable the dictionary. Any suggestions? On Monday, June 17, 2024 at 12:39:23 PM UTC-4 John Roxton wrote: > I should clarify my issues with training my own model: > I can generate all the needed data, but I simply cannot find a consistent > source that can guide me through the LSTM training process. So, in case > anyone is wondering, I have not yet actually successfully trained and tried > my own model. I have produced some .traineddata files that are larger than > the default eng.traineddata file, but fail to solve even the few images > above. Furthermore, I cannot seem to replicate the training process! > > I will also mention that my solutions for post-processing with some sort > of fuzzy-matching process can be useful with longer strings, but fail > miserably with the shortest of strings, where the impact of a single > character being misinterpreted is more significant. > > On Monday, June 17, 2024 at 12:16:51 PM UTC-4 John Roxton wrote: > >> I'm using Tesseract 5.3.3 >> >> My use-case is to perform OCR on username strings captured from various >> ROIs of screenshots. These strings are 5-12 characters in length and make >> use of a set of allowable characters consisting of: A-Za-z0-9._ >> >> In general, it seems that Tesseract already does a pretty good job on my >> images, but due to the particular font that seems to be used (I believe it >> is "Droid Sans"), it often struggles with particular characters or >> character combinations. >> >> The most common mistake it makes is with O (capital o) and 0 (zero). >> Another particularly tricky character/combination is with either case of >> the letter "J" as the "hook" in this letter for this font hangs below the >> horizon. It also may mischaracterize a "I" (capital i) for "l" (lowercase >> L). >> >> I've found that `--psm 6` usually works best for my use-case. >> >> Reading through the `tesseract-ocr` and `tesstrain` documentation, and >> learning from what I can find elsewhere online, it seems: >> - it is recommended that pre-processing images is better than training >> - fine-tuning should be preferred over training from scratch >> >> Albeit, I am having great trouble in training my own model. I have >> generated 10,000 `.tif` images of text of assorted string lengths from >> 5-12 characters utilizing my restricted character set in random >> combinations using the "Droid Sans" font, along with associated "ground >> truth" files with matching file names and a `.gt.txt` extension. >> Additionally, I have many "in-the-field" images (such as those seen below) >> that I can provide "ground truth" text for. >> >> >> Here are some particularly tricky images I've encountered: >> >> "CJR21" - often misinterpreted as "R21", "QR21", or "gR21" >> [image: CJR21.png] >> >> "WPJ777" - Interpreted correctly using `--psm 6` >> [image: WPJ777.png] >> >> "SenorC0le" - A common case of a "0" (zero) misinterpreted as a capital >> "O" >> [image: SeenorC0le.png] >> >> "Iamagod" - capital i misinterpreted as a lowercase L[image: Iamagod.png] >> >> Example of Tesseract's "internal" pre-processing: >> [image: Olympic-seat_4-25-3503-screenshot.processed.png] >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/02210c67-07d5-48a7-b309-ad3e15148b15n%40googlegroups.com.