I should clarify my issues with training my own model: I can generate all the needed data, but I simply cannot find a consistent source that can guide me through the LSTM training process. So, in case anyone is wondering, I have not yet actually successfully trained and tried my own model. I have produced some .traineddata files that are larger than the default eng.traineddata file, but fail to solve even the few images above. Furthermore, I cannot seem to replicate the training process!
I will also mention that my solutions for post-processing with some sort of fuzzy-matching process can be useful with longer strings, but fail miserably with the shortest of strings, where the impact of a single character being misinterpreted is more significant. On Monday, June 17, 2024 at 12:16:51 PM UTC-4 John Roxton wrote: > I'm using Tesseract 5.3.3 > > My use-case is to perform OCR on username strings captured from various > ROIs of screenshots. These strings are 5-12 characters in length and make > use of a set of allowable characters consisting of: A-Za-z0-9._ > > In general, it seems that Tesseract already does a pretty good job on my > images, but due to the particular font that seems to be used (I believe it > is "Droid Sans"), it often struggles with particular characters or > character combinations. > > The most common mistake it makes is with O (capital o) and 0 (zero). > Another particularly tricky character/combination is with either case of > the letter "J" as the "hook" in this letter for this font hangs below the > horizon. It also may mischaracterize a "I" (capital i) for "l" (lowercase > L). > > I've found that `--psm 6` usually works best for my use-case. > > Reading through the `tesseract-ocr` and `tesstrain` documentation, and > learning from what I can find elsewhere online, it seems: > - it is recommended that pre-processing images is better than training > - fine-tuning should be preferred over training from scratch > > Albeit, I am having great trouble in training my own model. I have > generated 10,000 `.tif` images of text of assorted string lengths from > 5-12 characters utilizing my restricted character set in random > combinations using the "Droid Sans" font, along with associated "ground > truth" files with matching file names and a `.gt.txt` extension. > Additionally, I have many "in-the-field" images (such as those seen below) > that I can provide "ground truth" text for. > > > Here are some particularly tricky images I've encountered: > > "CJR21" - often misinterpreted as "R21", "QR21", or "gR21" > [image: CJR21.png] > > "WPJ777" - Interpreted correctly using `--psm 6` > [image: WPJ777.png] > > "SenorC0le" - A common case of a "0" (zero) misinterpreted as a capital "O" > [image: SeenorC0le.png] > > "Iamagod" - capital i misinterpreted as a lowercase L[image: Iamagod.png] > > Example of Tesseract's "internal" pre-processing: > [image: Olympic-seat_4-25-3503-screenshot.processed.png] > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0d7c5a24-2ac4-4e57-bb77-de421bc5670bn%40googlegroups.com.