Re: Training for Swedish, Danish, Norwegian, old spelling, fraktur

Zdenko Podobný Fri, 23 Apr 2010 10:48:01 -0700

Hello,,

please read ****wiki pages http://code.google.com/p/tesseract-ocr/wiki
especially http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
where is described training process for tesseract 2.04


In svn (http://code.google.com/p/tesseract-ocr/source/checkout) there is
already (pre?) release of version 3.00 with language data also for your
language (See
http://code.google.com/p/tesseract-ocr/source/browse/#svn/trunk/tessdata%3Fstate%3Dclosed).


Based on some remarks on wikipages training process should be different
+ see posting in this forum. There is no information when 3.00 will be
released.

Zd.

Dn(a 23.04.2010 16:28, Lars Aronsson  wrote / napísal(a):
> I'm the founder of Project Runeberg, the Scandinavian
> volunteer book scanning project, http://runeberg.org/
> where we have mainly been using Abbyy Finereader,
> with subsequent manual, online proofreading.
> I'm also involved in Wikisource, the book scanning
> and proofreading project of the Wikimedia Foundation.
>
> Is anybody training Tesseract to read Swedish and
> other Scandinavian languages? Is there a tutorial
> for how to train new languages in Tesseract?
>
> I'm running Ubuntu Linux 9.10. The included package
> for Tesseract 2.03 contains man pages that are next
> to useless. There seem to be some programs: mftraining,
> cntraining, unicharset_extractor, but they talk about
> "box files" and I have no clue what these are.
>
> In Project Runeberg, we already have 186,000 pages
> that are fully proofread, mostly in Swedish and
> Danish, in various fonts and from different years,
> meaning different spelling standards. Could these
> be used for training Tesseract? How do I start?
>
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Training for Swedish, Danish, Norwegian, old spelling, fraktur

Reply via email to