Re: [tesseract-ocr] Re: why are there no new trained models since 2018?

Liam Doherty Mon, 18 Mar 2024 22:38:21 -0700

Thanks, that's helpful. Is the collaboration with Google ongoing then?
Can you give me a sense of what magnitude of computing resources
training on the full dataset involves? Is it simply the days-to-weeks
per model described in the documentation? Would it be reasonable to
continually retrain existing models with additional
community-contributed data, rather than starting from scratch each
time?


On Sun, Mar 17, 2024 at 3:51 AM Tom Morris <[email protected]> wrote:
>
> On Friday, March 15, 2024 at 11:13:15 PM UTC-4 [email protected] wrote:
>
> My naive assumption when I originally encountered issues with
> tesseract was that there would be some central repository of training
> data which we would collaborate on extending and improving in an
> open-source way, including with examples of bad results on fairly
> clean inputs.
>
>
> Ray Smith has been very generous with his time and Google's resources, but 
> it's a bit of an asymmetric situation and the open source community, by and 
> large, has not organized around wide scale retraining. The work that has been 
> done is typically isolated, "one-of"s with the results not captured and used 
> to improve the state of play. The groups that have put significant resources 
> into training typically have a very focused goal such as early German 
> blackletter, early modern printing, etc.
>
>
> Given that tesseract is focused on OCR of
> machine-created text in the first place, creating synthetic datasets
> also seems very viable.
>
>
> I think one issue with creating synthetic datasets is access to commercially 
> licensed fonts. Google has the resources to purchase licenses for hundreds of 
> commercial fonts and use them to render a great variety of line images, but 
> there's no economical way for them to provide those fonts to the open source 
> community for reuse.
>
> Training also requires a non-trivial amount of computing resources as well as 
> some specialized knowledge.
>
> Tom
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/44dd22af-42fb-48f4-bf3b-9bbbe2c21a37n%40googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CADwNSq45-Twb9qwfoK%2BZ56rEkWRC%2B%2BxXYOEtW1y%3DGgkGiEWFxA%40mail.gmail.com.

Re: [tesseract-ocr] Re: why are there no new trained models since 2018?

Reply via email to