Hello, I want to train Tesseract 4 (LSTM) from scratch to recognize certain font family and run this command:
/usr/share/tesseract-ocr/tesstrain.sh --fonts_dir Fonts --lang heb --linedata_only --noextract_font_properties --langdata_dir ./langdata --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ --output_dir output/train --fontlist <list> <of> <eight> <fonts> my training_text file is 26M and wordlist is 6.3M . I have launched the command above 2 days ago and the process is still running. I get output like this: Page 3357 Loaded 171819/171819 pages (1-171819) of document /tmp/tmp.M6Ams42Ik5/... 1. Is there a way to estimate how long all this will take or how many pages are going to be loaded? In the previous stage the text was rendered with the output like this: Rendered page 1796 to file /tmp/tmp.vmJd24cTIt/... 2. Is there a way to estimate how many pages are going to be rendered? Thank you! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CA%2BX_a%2ByeXKvXr4jBe6w8vBiTk0AGKSdNWy-NBdZCRr12NPetKw%40mail.gmail.com.