[tesseract-ocr] Estimating duration of the train data creation

Sim Tov Sat, 21 Aug 2021 14:00:47 -0700

Hello,

I want to train Tesseract 4 (LSTM) from scratch to recognize certain font
family and run this command:


/usr/share/tesseract-ocr/tesstrain.sh --fonts_dir Fonts --lang heb
--linedata_only --noextract_font_properties --langdata_dir ./langdata
 --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ --output_dir
output/train --fontlist <list> <of> <eight> <fonts>

my training_text file is 26M and wordlist is 6.3M . I have launched the
command above 2 days ago and the process is still running. I get output
like this:

Page 3357
Loaded 171819/171819 pages (1-171819) of document /tmp/tmp.M6Ams42Ik5/...

1. Is there a way to estimate how long all this will take or how many pages
are going to be loaded?


In the previous stage the text was rendered with the output like this:

Rendered page 1796 to file /tmp/tmp.vmJd24cTIt/...

2. Is there a way to estimate how many pages are going to be rendered?

Thank you!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CA%2BX_a%2ByeXKvXr4jBe6w8vBiTk0AGKSdNWy-NBdZCRr12NPetKw%40mail.gmail.com.

[tesseract-ocr] Estimating duration of the train data creation

Reply via email to