The reason I use v4.1.1 is because it is the version that is supplied with the recently released stable Debian 11. It will remain like this for the next 2 years (approx)..
So my question is - whether it is OK to use the .lstmf files I got so far for training, or must the process of their generation be finished properly?... In other words - if I stop the process myself in the middle is the .lstmf file OK? Is there a way to check its consistency? On Wednesday, September 22, 2021 at 11:14:01 AM UTC+3 zdenop wrote: > And what about testing the latest code? > "tesstrain.sh" training is not supported anymore, and for creating issues > you must use the latest code anyway. > > Zdenko > > > st 22. 9. 2021 o 9:20 Sim Tov <smn...@gmail.com> napĂsal(a): > >> Maybe it is just a bug I need to open an issue? >> >> On Monday, September 20, 2021 at 2:52:18 PM UTC+3 Sim Tov wrote: >> >>> Hello, >>> >>> I use v4.1.1 on Linux (Debian 11) and try to generate train and evaluate >>> data. The commands I used were: >>> >>> train: >>> >>> usr/share/tesseract-ocr/tesstrain.sh --fonts_dir FontsRashi/Working >>> --lang heb --linedata_only --noextract_font_properties --langdata_dir >>> ./langdata --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ >>> --output_dir output/train --fontlist 'BenOr Rashi' 'Guttman Rashi Bold' >>> >>> and >>> >>> evaluate: >>> >>> /usr/share/tesseract-ocr/tesstrain.sh --fonts_dir FontsRashi/Working >>> --lang heb --linedata_only --noextract_font_properties --langdata_dir >>> ./langdata --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/ >>> --output_dir output/evaluate --fontlist 'Guttman Rashi' >>> >>> After several days of running both commands stopped with errors like >>> this for each of the 3 fonts: >>> >>> Page 8365 >>> Loaded 386170/386170 lines (1-386170) of document >>> /tmp/heb-2021-09-16.1dB/heb.Guttman_Rashi.exp0.lstmf >>> Page 8366 >>> Loaded 386216/386216 lines (1-386216) of document >>> /tmp/heb-2021-09-16.1dB/heb.Guttman_Rashi.exp0.lstmf >>> /usr/share/tesseract-ocr/tesstrain_utils.sh: line 72: 2271 Segmentation >>> fault "${cmd}" "$@" 2>&1 >>> 2272 Done | tee -a ${LOG_FILE} >>> ERROR: Program tesseract failed. Abort. >>> >>> Interestingly that heb.Guttman_Rashi.exp0.lstmf and both others .lstmf >>> files were exactly 1Gb big... >>> >>> Does it has something to do with what is written here: >>> >>> >>> https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 >>> >>> "The text is divided by language automatically, so there is a separate >>> stream for each of the Devanagari-based languages (as there is for the >>> Latin-based languages) and *clipped to 1GB *for each language." >>> >>> 1. So is this Segmentation fault an expected behavior? >>> >>> 2. What should I do now? Should I rerun the commands hoping that they >>> will finish properly or should I copy those .lstmf files that I got so far >>> to the train/evaluate directories and start training? >>> >>> 3. Both output/evaluate and output/train directories remained empty >>> after the commands above failed. What files should be there at the end so I >>> can start the training process? >>> >>> >>> Thank you in advance! >>> >>> tesseract --version >>> tesseract 4.1.1 >>> leptonica-1.79.0 >>> libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : >>> libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0 >>> Found AVX >>> Found FMA >>> Found SSE >>> Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 >>> liblz4/1.9.3 libzstd/1.4.8 >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/cec596a9-cdfb-4a68-ab49-d275f27a82a5n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/cec596a9-cdfb-4a68-ab49-d275f27a82a5n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/098cb99f-bcf4-4571-9edb-b2b8b868ca42n%40googlegroups.com.