And what about testing the latest code?
"tesstrain.sh" training is not supported anymore, and for creating issues
you must use the latest code anyway.

Zdenko


st 22. 9. 2021 o 9:20 Sim Tov <smn...@gmail.com> napĂ­sal(a):

> Maybe it is just a bug I need to open an issue?
>
> On Monday, September 20, 2021 at 2:52:18 PM UTC+3 Sim Tov wrote:
>
>> Hello,
>>
>> I use v4.1.1 on Linux (Debian 11) and try to generate train and evaluate
>> data. The commands I used were:
>>
>> train:
>>
>> usr/share/tesseract-ocr/tesstrain.sh --fonts_dir FontsRashi/Working
>> --lang heb --linedata_only --noextract_font_properties --langdata_dir
>> ./langdata  --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/
>> --output_dir output/train --fontlist 'BenOr Rashi' 'Guttman Rashi Bold'
>>
>> and
>>
>> evaluate:
>>
>> /usr/share/tesseract-ocr/tesstrain.sh --fonts_dir FontsRashi/Working
>> --lang heb --linedata_only --noextract_font_properties --langdata_dir
>> ./langdata  --tessdata_dir /usr/share/tesseract-ocr/4.00/tessdata/
>> --output_dir output/evaluate --fontlist 'Guttman Rashi'
>>
>> After several days of running both commands stopped with errors like this
>> for each of the 3 fonts:
>>
>> Page 8365
>> Loaded 386170/386170 lines (1-386170) of document
>> /tmp/heb-2021-09-16.1dB/heb.Guttman_Rashi.exp0.lstmf
>> Page 8366
>> Loaded 386216/386216 lines (1-386216) of document
>> /tmp/heb-2021-09-16.1dB/heb.Guttman_Rashi.exp0.lstmf
>> /usr/share/tesseract-ocr/tesstrain_utils.sh: line 72:  2271 Segmentation
>> fault      "${cmd}" "$@" 2>&1
>>       2272 Done                    | tee -a ${LOG_FILE}
>> ERROR: Program tesseract failed. Abort.
>>
>> Interestingly that heb.Guttman_Rashi.exp0.lstmf and both others .lstmf
>> files were exactly 1Gb big...
>>
>> Does it has something to do with what is written here:
>>
>>
>> https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951
>>
>> "The text is divided by language automatically, so there is a separate
>> stream for each of the Devanagari-based languages (as there is for the
>> Latin-based languages) and *clipped to 1GB *for each language."
>>
>> 1. So is this Segmentation fault an expected behavior?
>>
>> 2. What should I do now? Should I rerun the commands hoping that they
>> will finish properly or should I copy those .lstmf files that I got so far
>> to the train/evaluate directories and start training?
>>
>> 3. Both output/evaluate and output/train directories remained empty after
>> the commands above failed. What files should be there at the end so I can
>> start the training process?
>>
>>
>> Thank you in advance!
>>
>> tesseract --version
>> tesseract 4.1.1
>>  leptonica-1.79.0
>>   libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 :
>> libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
>>  Found AVX
>>  Found FMA
>>  Found SSE
>>  Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8
>> liblz4/1.9.3 libzstd/1.4.8
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/cec596a9-cdfb-4a68-ab49-d275f27a82a5n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/cec596a9-cdfb-4a68-ab49-d275f27a82a5n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xC5YdJ%2BKURAC5jb66hn5tvpjAT87a2C1LAkkWHWwU3yA%40mail.gmail.com.

Reply via email to