I recently retrained the chi_tra model with a new font. The existing model 
would confuse certain characters. In addition, the source images (I'm 
decoding TV subtitles) had a weirdly shaped question mark. In the sample 
below the last two characters output as the number "7".

[image: chi_tra_7_0_QM.png]

I managed to find and buy a font that was very close to the font but the 
question mark didn't match.  So I rendered the text to images without 
question marks, duplicated the data set and appended the question mark 
*image* to each line. Then merged the two data sets for input to training.

Training on my poky old 2.5G i5 CPU takes about 6 to 18 hours of unattended 
operation.  Getting the ground-truth sorted out took about 1-2 days of my 
direct work.

But despite all this training, it still fails with no output 
<https://groups.google.com/g/tesseract-ocr/c/hwX_YFRUXf4> at all if the 
input has an ellipsis or three dots appended:
[image: bad_sub_243.png]
This seems to be a problem with the image preprocessing tesseract does when 
identifying blocks or glyphs or something rather than a problem with the 
model.  I'm debugging it now but it is tough going. The code is exactly 
what you'd expect from a massive C program from 1985 worked on by multiple 
researcher-types over the past 40 years...

On Tuesday, March 19, 2024 at 1:38:27 PM UTC+8 lfdo...@gmail.com wrote:

> Thanks, that's helpful. Is the collaboration with Google ongoing then?
> Can you give me a sense of what magnitude of computing resources
> training on the full dataset involves? Is it simply the days-to-weeks
> per model described in the documentation? Would it be reasonable to
> continually retrain existing models with additional
> community-contributed data, rather than starting from scratch each
> time?
>
> On Sun, Mar 17, 2024 at 3:51 AM Tom Morris <tfmo...@gmail.com> wrote:
> >
> > On Friday, March 15, 2024 at 11:13:15 PM UTC-4 lfdo...@gmail.com wrote:
> >
> > My naive assumption when I originally encountered issues with
> > tesseract was that there would be some central repository of training
> > data which we would collaborate on extending and improving in an
> > open-source way, including with examples of bad results on fairly
> > clean inputs.
> >
> >
> > Ray Smith has been very generous with his time and Google's resources, 
> but it's a bit of an asymmetric situation and the open source community, by 
> and large, has not organized around wide scale retraining. The work that 
> has been done is typically isolated, "one-of"s with the results not 
> captured and used to improve the state of play. The groups that have put 
> significant resources into training typically have a very focused goal such 
> as early German blackletter, early modern printing, etc.
> >
> >
> > Given that tesseract is focused on OCR of
> > machine-created text in the first place, creating synthetic datasets
> > also seems very viable.
> >
> >
> > I think one issue with creating synthetic datasets is access to 
> commercially licensed fonts. Google has the resources to purchase licenses 
> for hundreds of commercial fonts and use them to render a great variety of 
> line images, but there's no economical way for them to provide those fonts 
> to the open source community for reuse.
> >
> > Training also requires a non-trivial amount of computing resources as 
> well as some specialized knowledge.
> >
> > Tom
> >
> > --
> > You received this message because you are subscribed to the Google 
> Groups "tesseract-ocr" group.
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to tesseract-oc...@googlegroups.com.
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/44dd22af-42fb-48f4-bf3b-9bbbe2c21a37n%40googlegroups.com
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/819ef5c5-22a9-4ae5-bc22-789b45bb54f7n%40googlegroups.com.

Reply via email to