[tesseract-ocr] Evaluation of model trained with generated text against real-word data

Inductiveload Sun, 25 Jul 2021 22:59:05 -0700

Hi,

I am working on training an LSTM model for old-style English printing (i.e. 
a font somewhat like Caslon, long-s and substantial printing defects). I am 
hoping to eventually submit to tessdata_contrib.


I have had quite some success with a script to generate line data using a 
modified version of Adobe Caslon Pro and some noise generation and then 
training on top of the eng model [1]. This is mostly because I do not want 
to have to process lines out of thousands of images and correct them all 
first.

However, because I am training on artificial data, but the actual aim is to 
OCR real images, I would like to be able to evaluate the effects of various 
parameters more objectively. However, I am struggling to figure out how to 
generate the required data to get an answer from lstmeval. The inputs I 
have are a directory of images and text files, in the same way that I have 
a directory of generated images for the ground truth data I am training 
with.

What is the correct way to generate the required data for running lstmeval 
manually in this case?

[1]: https://en.wikisource.org/wiki/User:Inductiveload/Tesseract

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/45458ed5-aaa2-4edd-9399-0473b24c1e3cn%40googlegroups.com.

[tesseract-ocr] Evaluation of model trained with generated text against real-word data

Reply via email to