I should clarify my issues with training my own model:
I can generate all the needed data, but I simply cannot find a consistent 
source that can guide me through the LSTM training process.  So, in case 
anyone is wondering, I have not yet actually successfully trained and tried 
my own model.  I have produced some .traineddata files that are larger than 
the default eng.traineddata file, but fail to solve even the few images 
above.  Furthermore, I cannot seem to replicate the training process!

I will also mention that my solutions for post-processing with some sort of 
fuzzy-matching process can be useful with longer strings, but fail 
miserably with the shortest of strings, where the impact of a single 
character being misinterpreted is more significant.

On Monday, June 17, 2024 at 12:16:51 PM UTC-4 John Roxton wrote:

> I'm using Tesseract 5.3.3
>
> My use-case is to perform OCR on username strings captured from various 
> ROIs of screenshots.  These strings are 5-12 characters in length and make 
> use of a set of allowable characters consisting of:  A-Za-z0-9._
>
> In general, it seems that Tesseract already does a pretty good job on my 
> images, but due to the particular font that seems to be used (I believe it 
> is "Droid Sans"), it often struggles with particular characters or 
> character combinations.
>
> The most common mistake it makes is with O (capital o) and 0 (zero). 
>  Another particularly tricky character/combination is with either case of 
> the letter "J" as the "hook" in this letter for this font hangs below the 
> horizon.  It also may mischaracterize a "I" (capital i) for "l" (lowercase 
> L).
>
> I've found that `--psm 6` usually works best for my use-case.
>
> Reading through the `tesseract-ocr` and `tesstrain` documentation, and 
> learning from what I can find elsewhere online, it seems:
> - it is recommended that pre-processing images is better than training
> - fine-tuning should be preferred over training from scratch
>
> Albeit, I am having great trouble in training my own model.  I have 
> generated 10,000 `.tif` images of text  of assorted string lengths from 
> 5-12 characters utilizing my restricted character set in random 
> combinations using the "Droid Sans" font, along with associated "ground 
> truth" files with matching file names and a `.gt.txt` extension. 
> Additionally, I have many "in-the-field" images (such as those seen below) 
> that I can provide "ground truth" text for.
>
>
> Here are some particularly tricky images I've encountered:
>
> "CJR21" - often misinterpreted as "R21", "QR21", or "gR21"
> [image: CJR21.png]
>
> "WPJ777" - Interpreted correctly using `--psm 6`
> [image: WPJ777.png]
>
> "SenorC0le" - A common case of a "0" (zero) misinterpreted as a capital "O"
> [image: SeenorC0le.png]
>
> "Iamagod" - capital i misinterpreted as a lowercase L[image: Iamagod.png]
>
> Example of Tesseract's "internal" pre-processing:
> [image: Olympic-seat_4-25-3503-screenshot.processed.png]
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0d7c5a24-2ac4-4e57-bb77-de421bc5670bn%40googlegroups.com.

Reply via email to