Thanks, that's helpful. Is the collaboration with Google ongoing then? Can you give me a sense of what magnitude of computing resources training on the full dataset involves? Is it simply the days-to-weeks per model described in the documentation? Would it be reasonable to continually retrain existing models with additional community-contributed data, rather than starting from scratch each time?
On Sun, Mar 17, 2024 at 3:51 AM Tom Morris <tfmor...@gmail.com> wrote: > > On Friday, March 15, 2024 at 11:13:15 PM UTC-4 lfdo...@gmail.com wrote: > > My naive assumption when I originally encountered issues with > tesseract was that there would be some central repository of training > data which we would collaborate on extending and improving in an > open-source way, including with examples of bad results on fairly > clean inputs. > > > Ray Smith has been very generous with his time and Google's resources, but > it's a bit of an asymmetric situation and the open source community, by and > large, has not organized around wide scale retraining. The work that has been > done is typically isolated, "one-of"s with the results not captured and used > to improve the state of play. The groups that have put significant resources > into training typically have a very focused goal such as early German > blackletter, early modern printing, etc. > > > Given that tesseract is focused on OCR of > machine-created text in the first place, creating synthetic datasets > also seems very viable. > > > I think one issue with creating synthetic datasets is access to commercially > licensed fonts. Google has the resources to purchase licenses for hundreds of > commercial fonts and use them to render a great variety of line images, but > there's no economical way for them to provide those fonts to the open source > community for reuse. > > Training also requires a non-trivial amount of computing resources as well as > some specialized knowledge. > > Tom > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/44dd22af-42fb-48f4-bf3b-9bbbe2c21a37n%40googlegroups.com. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADwNSq45-Twb9qwfoK%2BZ56rEkWRC%2B%2BxXYOEtW1y%3DGgkGiEWFxA%40mail.gmail.com.