Dear Matthew, thank you  for your long letter.

To make a long story short, I'm familiar with the old typography problems but I have no experience with tesseract training.

I may however point you to the report concerning an experiment consisting in training tesseract on old Polish texts with the same problems which you describe:

http://lib.psnc.pl/publication/428

Both the texts, as images and PAGE files, are publicly available at

http://dl.psnc.pl/activities/projekty/impact/results/

Please note that the trained dataset is also available at

http://dl.psnc.pl/download/tesseract_traineddata.zip

The training used "classical" rectangular method.

To say the truth, I don't know how efficient the training was as I'm not aware of any large scale application of the trained dataset. Using it is one of the user options at Virtual Transcription Laboratory (http://wlt.synat.pcss.pl/wlt-web/index.xhtml), but I have no idea who uses it and for what.

It would be interesting to retrain tesseract using your approach on the data described above and to compare the results, but I'm afraid nobody has time and motivation for it.

Best regards and good luck with your project

Janusz


--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
[email protected], [email protected], http://fleksem.klf.uw.edu.pl/~jsbien/

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to