Dear Matthew, thank you for your long letter.
To make a long story short, I'm familiar with the old typography
problems but I have no experience with tesseract training.
I may however point you to the report concerning an experiment
consisting in training tesseract on old Polish texts with the same
problems which you describe:
http://lib.psnc.pl/publication/428
Both the texts, as images and PAGE files, are publicly available at
http://dl.psnc.pl/activities/projekty/impact/results/
Please note that the trained dataset is also available at
http://dl.psnc.pl/download/tesseract_traineddata.zip
The training used "classical" rectangular method.
To say the truth, I don't know how efficient the training was as I'm
not aware of any large scale application of the trained dataset. Using
it is one of the user options at Virtual Transcription Laboratory
(http://wlt.synat.pcss.pl/wlt-web/index.xhtml), but I have no idea who
uses it and for what.
It would be interesting to retrain tesseract using your approach on
the data described above and to compare the results, but I'm afraid
nobody has time and motivation for it.
Best regards and good luck with your project
Janusz
--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
[email protected], [email protected], http://fleksem.klf.uw.edu.pl/~jsbien/
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.