Hello,, please read ****wiki pages http://code.google.com/p/tesseract-ocr/wiki especially http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract where is described training process for tesseract 2.04
In svn (http://code.google.com/p/tesseract-ocr/source/checkout) there is already (pre?) release of version 3.00 with language data also for your language (See http://code.google.com/p/tesseract-ocr/source/browse/#svn/trunk/tessdata%3Fstate%3Dclosed). Based on some remarks on wikipages training process should be different + see posting in this forum. There is no information when 3.00 will be released. Zd. Dn(a 23.04.2010 16:28, Lars Aronsson wrote / napísal(a): > I'm the founder of Project Runeberg, the Scandinavian > volunteer book scanning project, http://runeberg.org/ > where we have mainly been using Abbyy Finereader, > with subsequent manual, online proofreading. > I'm also involved in Wikisource, the book scanning > and proofreading project of the Wikimedia Foundation. > > Is anybody training Tesseract to read Swedish and > other Scandinavian languages? Is there a tutorial > for how to train new languages in Tesseract? > > I'm running Ubuntu Linux 9.10. The included package > for Tesseract 2.03 contains man pages that are next > to useless. There seem to be some programs: mftraining, > cntraining, unicharset_extractor, but they talk about > "box files" and I have no clue what these are. > > In Project Runeberg, we already have 186,000 pages > that are fully proofread, mostly in Swedish and > Danish, in various fonts and from different years, > meaning different spelling standards. Could these > be used for training Tesseract? How do I start? > >
smime.p7s
Description: S/MIME Cryptographic Signature